Business Understanding:
This is the goal of the client:
“Can you analyze the weather data to predict public transport service disruption in Dubai? How can we plan for less disruption in the wake of severe weather conditions and leverage the emergency management plan as well as providing uninterrupted services and products to citizens?”
Data Understanding:
The client has supplied a json file with information about the weather of Dubai. This information was recorded each hour between 00 a.m ., January 1, 2018 and 11 p.m, March 16, 2020. We have the main meteorologic metrics, like temperature, pressure, humidity, wind conditions, cloud status, train status, and other columns related to weather. This json file has 19344 rows.
The client also suggest to use the information that provides RTA trough the Dubai Pulse web. In that web we can locate several files with information related to the bus routes:
- RTA_AVERAGE_SPEED_PER_LINE_BUSES-OPEN. This a set of files with information relative to the datetime, the route, and the speed of the bus. There are 40 files that requires 212 MB.
- RTA_BUS_RIDERSHIP-OPEN. This a set of files with information relative to the check-ins and check-outs of the riders, that include the datetime, the route, the initial station, the end station. This a big set with more of 1300 files, of 100 MB each one.
I download a sample test of each set of files and see that the information in the first one is perfect. I can correlate the bus information with the weather information, trough the datetime, and i can estimate the disruptions of service using the average speed.
If the speed of the bus is lower or faster we can try to correlate that with the weather conditions. The speed of the bus is directly related to the disruption of service. If we can predict a slower speed using a regression model we can predict the disruption of the service.
The second set of files is focused to the check-ins and check-outs of the riders. We have more days in these files, but we don’t have the speed, and there is no easy way to calculate the speed because we must find for each user a row with a check-in, and a row with a check-out, to be able to estimate the speed of that route.
So, i will use the RTA_AVERAGE_SPEED_PER_LINE_BUSES-OPEN dataset that is located in this url
Data Preparation:
The first step is to download the data, i use a little script in Python to download the files trough scrapping techniques.
The second step is to make modifications on the bus files. These are the columns of the files:
- date
- period
- route_name
- route_direction
- service_type
- average_speed
I join the 40 files in one dataframe called df_bus, then I group by this dataframe by date and route name. Exclude the route direction, the service type, and agregatte the average speed using the median. Then i convert the datetime format to UNIX, using a lambda function and a parser. The length of this dataframe is 19344 rows.
The third step is to make modifications on the weather files. I load the json on a dataframe, called df_weather. Inside of the dataframe there are several columns that are objects, these are the columns: main, wind, clouds, rain, weather. With a lambda function I convert these objects to columns of a dataframe, and expand the original dataframe with these columns. After that I drop the original objects, and the dt_iso, because i have another column with datetime in UNIX format. The length of this dataframe is 156866 rows.
I merge both dataframes in a dataframe called df_total, using the datetime in UNIX format. The merge uses a inner join trought that column, and i obtain a new dataframe. The length of this dataframe is 65822 rows.
Now i have information of the bus routes and the weather in that route.
Modelling:
The goal is to try to predict the speed of the bus route, so i will use the average speed as the objective of the neural network. It’s a regressor, so the last layer will use a linear activation, and we will have only a neuron in the last layer. The client requests to use a tensorflow solution. So i implement a Keras regressor model.
I standarize the dataset and i make a train/test division of the dataset. I test several parameters using HyperTuning with GridSearchCV.
I use a function with the model to make several tests and modify the content of the function to check different architectures of the solution, like DDN, LSTM, etc. The metrics of each model are in the evaluation step.
Evaluation:
The metric i have selected is the median absolute error. The loss is calculated by taking the median of all absolute differences between the target and the prediction. This is a robust metric versus outliers. I have used crossvalidation with 5 folds so I will calculate in each fold the MSE, and I will obtain the median MSE at the end of the 5 folds.
- 5 Dense Layer – MSE = 8.21 % of error.
- 10 Dense Layer – MSE = 8.16 % of error
- 12 Dense Layer – MSE = 7.48 % of error.
- 15 Dense Layer – MSE = 8.01 % of error.
- 5 LSTM Layer – MSE = NaN
- 10 LSTM Layer – MSE = NaN
- 15 LSTM Layer – MSE = NaN
This means we can obtain an average speed prediction with an error of 7.74% using the weather information.
Deployment:
I present the code of the model using Colab. If you down’t want to use Colab you are free to export the Jupyter Notebook and modify the data folder.
Here is the colab with the code
Data preparation¶
Scrapping of bus files from the RTA of Dubai Pulse.
This requires the installation of a geckdriver under Colab
# install wget, firefox, geckodriver, and selenium
!pip install wget
!pip install selenium
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
Colab is used with Google Drive to store the datasets.
The code will require a pre-existing folder under your Drive account called /Colab Notebooks/D2020/Bus/'
import requests
import logging as log
from lxml import html
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import wget
from google.colab import drive
drive.mount('/content/drive')
datasetFolder='/content/drive/My Drive/Colab Notebooks/D2020/Bus/'
def downloadCSV():
url = "https://www.dubaipulse.gov.ae/data/allresource/rta-bus/rta_average_speed_per_line_buses-open?organisation=rta&service=rta-bus&count=40&result_per_page=100&as_sfid=AAAAAAXQ-dqMA8yvyDCuTT0GMC-1DU06mAFsPJppMNnkAH0H--Qse7PRib25W90JbA9W4reS3xLHUPX245bxvdszC5WRoOUrM7ULj1s9mogF2S6eDxvzapzC45azH8NR9Popk14%3D&as_fid=9396bc997a5a44ecd2d45b42e3341dee8bf728b5"
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.set_window_size(1120, 550)
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver,10)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,"col-md-width-60" )))
content = driver.page_source
response = re.findall(re.escape('https://www.dubaipulse.gov.ae/dataset') + '.*', content)
for resp in response:
if 'csv' in resp:
csvFile = resp.replace(re.escape('">'),'')
wget.download(csvFile, datasetFolder)
downloadCSV()
Easy scrapping, just selecting https elements, with csv substring
import pandas as pd
import os
from google.colab import drive
drive.mount('/content/drive')
datasetFolder='/content/drive/My Drive/Colab Notebooks/D2020/'
dataset_weather = "https://datacases.s3.us-east-2.amazonaws.com/datathon-2020/Ernst+and+Young/Dubai+Weather_20180101_20200316.txt"
df_weather = pd.read_json(dataset_weather)
df_weather.to_csv(datasetFolder + 'weather.csv', index=False)
files_bus = os.listdir(datasetFolder + 'Bus/')
df_bus = pd.read_csv(datasetFolder + 'Bus/' + files_bus[0])
for file in files_bus[1:]:
if 'csv' in file:
df1 = pd.read_csv(datasetFolder + 'Bus/' + file)
df_bus = df_bus.append(df1, ignore_index=True)
df_bus.to_csv(datasetFolder + 'bus.csv', index=False)
Dataframe of bus. I group all the elemts using date and route name.
Then i create a new datetime UNIX format column with the datetime information.
from dateutil import parser
import datetime
df_bus = df_bus.groupby(['date','route_name'], as_index= False)['average_speed'].mean().sort_values(['date', 'route_name'])
df_bus = df_bus.rename(columns = {'date':'dt_iso'})
df_bus['dt'] = df_bus.apply(lambda row : int((parser.parse(row['dt_iso']) - datetime.datetime(1970, 1, 1)).total_seconds()), axis = 1)
df_bus = df_bus.drop('dt_iso', axis=1)
df_bus[:3]
Dataframe of weather. I expand the object columns to the actual dataframe, and remove the objects.
Rename the columns, and drop of non-necesary columns
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['main']), axis = 1))
df_weather = df_weather.drop('main', axis=1)
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['wind']), axis = 1))
df_weather = df_weather.drop('wind', axis=1)
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['clouds']), axis = 1))
df_weather = df_weather.drop('clouds', axis=1)
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['rain']), axis = 1))
df_weather = df_weather.drop('rain', axis=1)
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['weather'][0]), axis = 1))
df_weather = df_weather.drop('weather', axis=1)
df_weather = df_weather.drop('dt_iso', axis=1)
df_weather = df_weather.drop('all', axis=1)
df_weather = df_weather.drop(0, axis=1)
df_weather = df_weather.rename(columns = {'temp':'main_temp','temp_min':'main_temp_min','temp_max':'main_temp_max','feels_like':'main_feels_like','pressure':'main_pressure','speed':'wind_speed','deg':'wind_deg','1h':'rain_1h','3h':'rain_3h','id':'weather_id','description':'weather_description','icon':'weather_icon'})
df_weather[:3]
Merge both dataframes using a inner join on datetime column
df_total = pd.merge(df_weather, df_bus, how = 'inner', on = 'dt')
df_total[:3]
df_total.to_csv(datasetFolder + 'total.csv', index=False)
Compare lengths of dataframes
print('Length of weather dataset:', len(df_weather))
print('Length of bus dataset:', len(df_bus))
print('Length of total dataset:', len(df_total))
Modeling¶
I use a crossvalidation scheme with 3 folds and GridSearch.
The model function is encapsulated to make diferent tests
import pandas as pd
from keras.layers import Dense, Dropout, Input, LSTM
from keras.models import Sequential
from keras.layers.merge import add
from keras.optimizers import Adam
from keras.models import Model
from keras import losses
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from keras.wrappers.scikit_learn import KerasRegressor
from numpy.random import uniform
import numpy as np
from keras import backend as K
from tensorflow.python.ops import math_ops
from sklearn.model_selection import KFold,cross_val_score
from numpy import ndarray
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from google.colab import drive
drive.mount('/content/drive')
datasetFolder='/content/drive/My Drive/Colab Notebooks/D2020/'
dataset = pd.read_csv(datasetFolder + 'total.csv')
def LSTMModel(learn_rate=0.01, neurons=12):
inputs = Input(shape=(xTrain.shape[1],xTrain.shape[2]))
se1 = LSTM(neurons, activation='relu')(inputs)
outputs = Dense(1, activation='linear')(se1)
model.compile(loss=losses.mean_absolute_error, optimizer=Adam(lr=learn_rate))
def DNNModel(learn_rate=0.01, neurons=12):
inputs = Input(shape=(xTrain.shape[1],))
se1 = Dense(neurons, activation='relu')(inputs)
outputs = Dense(1, activation='linear')(se1)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss=losses.mean_absolute_error, optimizer=Adam(lr=learn_rate))
return model
architecture='DNN'
dataset= dataset._get_numeric_data()
dataset=dataset.fillna(0)
X=dataset.drop(['average_speed'], axis=1)
y=dataset.take([-1], axis=1).values.ravel()
scaler = StandardScaler()
X=scaler.fit_transform(X)
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2, random_state = 2020)
if architecture=='DNN':
model = KerasRegressor(build_fn=DNNModel, verbose=1)
if architecture=='LSTM':
xTrain = xTrain.reshape(xTrain.shape[0], 1, xTrain.shape[1])
model = KerasRegressor(build_fn=LSTMModel, verbose=1)
learn_rate = [0.001, 0.01]
epochs = [10, 20]
param_grid = dict(epochs=epochs, learn_rate=learn_rate)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1, cv=3, verbose=2)
grid_result = grid.fit(xTrain, yTrain)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
Evaluation¶
This is the best model obtained
import pandas as pd
from keras.layers import Dense, Dropout, Input, LSTM
from keras.models import Sequential
from keras.layers.merge import add
from keras.optimizers import Adam
from keras.models import Model
from keras import losses
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from keras.wrappers.scikit_learn import KerasRegressor
from numpy.random import uniform
import numpy as np
from keras import backend as K
from tensorflow.python.ops import math_ops
from sklearn.model_selection import KFold,cross_val_score
from numpy import ndarray
from google.colab import drive
drive.mount('/content/drive')
datasetFolder='/content/drive/My Drive/Colab Notebooks/D2020/'
dataset = pd.read_csv(datasetFolder + 'total.csv')
def base_model(learn_rate=0.01, neurons=12):
inputs = Input(shape=(15,))
se1 = Dense(neurons, activation='relu')(inputs)
outputs = Dense(1, activation='linear')(se1)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss=losses.mean_absolute_error, optimizer=Adam(lr=learn_rate))
return model
dataset= dataset._get_numeric_data()
dataset=dataset.fillna(0)
X=dataset.drop(['average_speed'], axis=1)
y=dataset.take([-1], axis=1).values.ravel()
scaler = StandardScaler()
X=scaler.fit_transform(X)
folds=5
result=ndarray((folds,),float)
kf = KFold(n_splits=folds,random_state=None, shuffle=True)
k=0
for train_index, test_index in kf.split(X):
xTrain, xTest = X[train_index], X[test_index]
yTrain, yTest = y[train_index], y[test_index]
model = KerasRegressor(build_fn=base_model, verbose=1)
model = DNNModel(learn_rate=0.001, neurons=12)
model.fit(xTrain, yTrain, epochs=20, verbose=1)
yPred = model.predict(xTest)
value = metrics.median_absolute_error(yTest, yPred)
print('MSE Error: {}'.format(value))
result[k] = metrics.median_absolute_error(yTest, yPred)
k=k+1
I use a loss to calculate to regression.
The less the loss the best the model
print('MEAN MSE Error: {}'.format(np.mean(result)))
Future tasks:
Now we have a prediction of the average speed, with this information and the length route we can show predictions about when will the bus arrive at each bus stop. This step is marked to do in future tasks.
One thought on “Datathon 2020 Ernst and Young Challengue – Team Solo”
Hi. Can a i receive a feedback about this article? Thank you.