Datathon 2020 SolutionsDatathons Solutions

Datathon 2020 Ernst and Young Challengue – Team Solo

0
votes

Business Understanding:

This is the goal of the client:

“Can you analyze the weather data to predict public transport service disruption in Dubai? How can we plan for less disruption in the wake of severe weather conditions and leverage the emergency management plan as well as providing uninterrupted services and products to citizens?”

Data Understanding:

The client has supplied a json file with information about the weather of Dubai. This information was recorded each hour between 00 a.m ., January 1, 2018 and 11 p.m, March 16, 2020. We have the main meteorologic metrics, like temperature, pressure, humidity, wind conditions, cloud status, train status, and other columns related to weather. This json file has 19344 rows.

The client also suggest to use the information that provides RTA trough the Dubai Pulse web. In that web we can locate several files with information related to the bus routes:

  • RTA_AVERAGE_SPEED_PER_LINE_BUSES-OPEN. This a set of files with information relative to the datetime, the route, and the speed of the bus. There are 40 files that requires 212 MB.
  • RTA_BUS_RIDERSHIP-OPEN. This a set of files with information relative to the check-ins and check-outs of the riders, that include the datetime, the route, the initial station, the end station. This a big set with more of 1300 files, of 100 MB each one.

I download a sample test of each set of files and see that the information in the first one is perfect. I can correlate the bus information with the weather information, trough the datetime, and i can estimate the disruptions of service using the average speed.

If the speed of the bus is lower or faster we can try to correlate that with the weather conditions. The speed of the bus is directly related to the disruption of service. If we can predict a slower speed using a regression model we can predict the disruption of the service.

The second set of files is focused to the check-ins and check-outs of the riders. We have more days in these files, but we don’t have the speed, and there is no easy way to calculate the speed because we must find for each user  a row with a check-in, and a row with a check-out, to be able to estimate the speed of that route.

So, i will use the RTA_AVERAGE_SPEED_PER_LINE_BUSES-OPEN dataset that is located in this url

Data Preparation:

The first step is to download the data, i use a little script in Python to download the files trough scrapping techniques.

The second step is to make modifications on the bus files. These are the columns of the files:

  • date
  • period
  • route_name
  • route_direction
  • service_type
  • average_speed

I join the 40 files in one dataframe called df_bus, then I group by this dataframe by date and route name. Exclude the route direction, the service type, and agregatte the average speed using the median. Then i convert the datetime format to UNIX, using a lambda function and a parser. The length of this dataframe is 19344 rows.

The third step is to make modifications on the weather files. I load the json on a dataframe, called df_weather. Inside of the dataframe there are several columns that are objects, these are the columns: main, wind, clouds, rain, weather. With a lambda function I convert these objects to columns of a dataframe, and expand the original dataframe with these columns. After that I drop the original objects, and the dt_iso, because i have another column with datetime in UNIX format. The length of this dataframe is 156866 rows.

I merge both dataframes in a dataframe called df_total, using the datetime in UNIX format. The merge uses a inner join trought that column, and  i obtain a new dataframe. The length of this dataframe is 65822 rows.

Now i have information of the bus routes and the weather in that route.

Modelling:

The goal is to try to predict the speed of the bus route, so i will use the average speed as the objective of the neural network. It’s a regressor, so the last layer will use a linear activation, and we will have only a neuron in the last layer. The client requests to use a tensorflow solution. So i implement a Keras regressor model.

I standarize the dataset and i make a train/test division of the dataset. I test several parameters using HyperTuning with GridSearchCV.

I use a function with the model to make several tests and modify the content of the function to check different architectures of the solution, like DDN, LSTM, etc. The metrics of each model are in the evaluation step.

Evaluation:

The metric i have selected is the median absolute error. The loss is calculated by taking the median of all absolute differences between the target and the prediction. This is a robust metric versus outliers. I have used crossvalidation with 5 folds so I will calculate in each fold the MSE, and I will obtain the median MSE at the end of the 5 folds.

The less the loss, the best the model. These are the values obtained for the best combination of hyperparameters:
  • 5 Dense Layer – MSE = 8.21 % of error.
  • 10 Dense Layer – MSE = 8.16 % of error
  • 12 Dense Layer – MSE = 7.48 % of error.
  • 15 Dense Layer – MSE = 8.01 % of error.
  • 5 LSTM Layer – MSE = NaN
  • 10 LSTM Layer – MSE = NaN
  • 15 LSTM Layer – MSE = NaN

This means we can obtain an average speed prediction with an error of 7.74% using the weather information.

Deployment:

I present the code of the model using Colab. If you down’t want to use Colab you are free to export the Jupyter Notebook and modify the data folder.

Here is the colab with the code

Data preparation

Scrapping of bus files from the RTA of Dubai Pulse.

This requires the installation of a geckdriver under Colab

In [0]:
# install wget, firefox, geckodriver, and selenium
!pip install wget
!pip install selenium
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Colab is used with Google Drive to store the datasets.

The code will require a pre-existing folder under your Drive account called /Colab Notebooks/D2020/Bus/'

In [0]:
import requests
import logging as log
from lxml import html
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import wget
from google.colab import drive
drive.mount('/content/drive')
datasetFolder='/content/drive/My Drive/Colab Notebooks/D2020/Bus/'

def downloadCSV():
    url = "https://www.dubaipulse.gov.ae/data/allresource/rta-bus/rta_average_speed_per_line_buses-open?organisation=rta&service=rta-bus&count=40&result_per_page=100&as_sfid=AAAAAAXQ-dqMA8yvyDCuTT0GMC-1DU06mAFsPJppMNnkAH0H--Qse7PRib25W90JbA9W4reS3xLHUPX245bxvdszC5WRoOUrM7ULj1s9mogF2S6eDxvzapzC45azH8NR9Popk14%3D&as_fid=9396bc997a5a44ecd2d45b42e3341dee8bf728b5"
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
    driver.set_window_size(1120, 550)
    driver.get(url)
    driver.maximize_window()
    wait = WebDriverWait(driver,10)
    wait.until(EC.visibility_of_element_located((By.CLASS_NAME,"col-md-width-60" )))
    content = driver.page_source
    response = re.findall(re.escape('https://www.dubaipulse.gov.ae/dataset') + '.*', content)
    for resp in response:
        if 'csv' in resp:
            csvFile = resp.replace(re.escape('">'),'')
            wget.download(csvFile, datasetFolder)
            
downloadCSV()
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:20: DeprecationWarning: use options instead of chrome_options

Easy scrapping, just selecting https elements, with csv substring

In [0]:
import pandas as pd
import os
from google.colab import drive
drive.mount('/content/drive')
datasetFolder='/content/drive/My Drive/Colab Notebooks/D2020/'
dataset_weather = "https://datacases.s3.us-east-2.amazonaws.com/datathon-2020/Ernst+and+Young/Dubai+Weather_20180101_20200316.txt"
df_weather = pd.read_json(dataset_weather)
df_weather.to_csv(datasetFolder + 'weather.csv', index=False)
files_bus = os.listdir(datasetFolder + 'Bus/')
df_bus = pd.read_csv(datasetFolder + 'Bus/' + files_bus[0])
for file in files_bus[1:]:
  if 'csv' in file:
    df1 = pd.read_csv(datasetFolder + 'Bus/' + file)
    df_bus = df_bus.append(df1, ignore_index=True)
df_bus.to_csv(datasetFolder + 'bus.csv', index=False)
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Dataframe of bus. I group all the elemts using date and route name.

Then i create a new datetime UNIX format column with the datetime information.

In [0]:
from dateutil import parser
import datetime
df_bus = df_bus.groupby(['date','route_name'], as_index= False)['average_speed'].mean().sort_values(['date', 'route_name'])
df_bus = df_bus.rename(columns = {'date':'dt_iso'})
df_bus['dt'] = df_bus.apply(lambda row : int((parser.parse(row['dt_iso']) - datetime.datetime(1970, 1, 1)).total_seconds()), axis = 1)
df_bus = df_bus.drop('dt_iso', axis=1)
df_bus[:3]
Out[0]:
route_name average_speed dt
0 10 9.072000 1480550400
1 103 15.726500 1480550400
2 104 15.765182 1480550400

Dataframe of weather. I expand the object columns to the actual dataframe, and remove the objects.

Rename the columns, and drop of non-necesary columns

In [0]:
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['main']), axis = 1))
df_weather = df_weather.drop('main', axis=1)
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['wind']), axis = 1))
df_weather = df_weather.drop('wind', axis=1)
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['clouds']), axis = 1))
df_weather = df_weather.drop('clouds', axis=1)
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['rain']), axis = 1))
df_weather = df_weather.drop('rain', axis=1)
df_weather = df_weather.join(df_weather.apply(lambda x: pd.Series(x['weather'][0]), axis = 1))
df_weather = df_weather.drop('weather', axis=1)
df_weather = df_weather.drop('dt_iso', axis=1)
df_weather = df_weather.drop('all', axis=1)
df_weather = df_weather.drop(0, axis=1)
df_weather = df_weather.rename(columns = {'temp':'main_temp','temp_min':'main_temp_min','temp_max':'main_temp_max','feels_like':'main_feels_like','pressure':'main_pressure','speed':'wind_speed','deg':'wind_deg','1h':'rain_1h','3h':'rain_3h','id':'weather_id','description':'weather_description','icon':'weather_icon'})
df_weather[:3]

Merge both dataframes using a inner join on datetime column

In [0]:
df_total = pd.merge(df_weather, df_bus, how = 'inner', on = 'dt')
df_total[:3]
df_total.to_csv(datasetFolder + 'total.csv', index=False)

Compare lengths of dataframes

In [0]:
print('Length of weather dataset:', len(df_weather))
print('Length of bus dataset:', len(df_bus))
print('Length of total dataset:', len(df_total))
Length of weather dataset: 19344
Length of bus dataset: 156866
Length of total dataset: 65822

Modeling

I use a crossvalidation scheme with 3 folds and GridSearch.

The model function is encapsulated to make diferent tests

In [1]:
import pandas as pd
from keras.layers import Dense, Dropout, Input, LSTM
from keras.models import Sequential
from keras.layers.merge import add
from keras.optimizers import Adam
from keras.models import Model
from keras import losses
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from keras.wrappers.scikit_learn import KerasRegressor
from numpy.random import uniform
import numpy as np
from keras import backend as K
from tensorflow.python.ops import math_ops
from sklearn.model_selection import KFold,cross_val_score
from numpy import ndarray
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from google.colab import drive
drive.mount('/content/drive')
datasetFolder='/content/drive/My Drive/Colab Notebooks/D2020/'
dataset = pd.read_csv(datasetFolder + 'total.csv')

def LSTMModel(learn_rate=0.01, neurons=12):
    inputs = Input(shape=(xTrain.shape[1],xTrain.shape[2]))
    se1 = LSTM(neurons, activation='relu')(inputs)
    outputs = Dense(1, activation='linear')(se1)
    model.compile(loss=losses.mean_absolute_error, optimizer=Adam(lr=learn_rate))

def DNNModel(learn_rate=0.01, neurons=12):
    inputs = Input(shape=(xTrain.shape[1],))
    se1 = Dense(neurons, activation='relu')(inputs)
    outputs = Dense(1, activation='linear')(se1)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(loss=losses.mean_absolute_error, optimizer=Adam(lr=learn_rate))
    return model

architecture='DNN'
dataset= dataset._get_numeric_data()
dataset=dataset.fillna(0)
X=dataset.drop(['average_speed'], axis=1)
y=dataset.take([-1], axis=1).values.ravel()
scaler = StandardScaler()
X=scaler.fit_transform(X)
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2, random_state = 2020)
if architecture=='DNN':
  model = KerasRegressor(build_fn=DNNModel, verbose=1)
if architecture=='LSTM':
  xTrain = xTrain.reshape(xTrain.shape[0], 1, xTrain.shape[1])
  model = KerasRegressor(build_fn=LSTMModel, verbose=1)
learn_rate = [0.001, 0.01]
epochs = [10, 20]
param_grid = dict(epochs=epochs, learn_rate=learn_rate)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1, cv=3, verbose=2)
grid_result = grid.fit(xTrain, yTrain)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
Using TensorFlow backend.
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] epochs=10, learn_rate=0.001 .....................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Epoch 1/10
35104/35104 [==============================] - 2s 70us/step - loss: 12.1043
Epoch 2/10
35104/35104 [==============================] - 2s 61us/step - loss: 8.5589
Epoch 3/10
35104/35104 [==============================] - 2s 61us/step - loss: 8.2046
Epoch 4/10
35104/35104 [==============================] - 2s 62us/step - loss: 8.0736
Epoch 5/10
35104/35104 [==============================] - 2s 62us/step - loss: 8.0207
Epoch 6/10
35104/35104 [==============================] - 2s 61us/step - loss: 8.0004
Epoch 7/10
35104/35104 [==============================] - 2s 61us/step - loss: 7.9866
Epoch 8/10
35104/35104 [==============================] - 2s 61us/step - loss: 7.9799
Epoch 9/10
35104/35104 [==============================] - 2s 60us/step - loss: 7.9740
Epoch 10/10
35104/35104 [==============================] - 2s 60us/step - loss: 7.9714
17553/17553 [==============================] - 0s 26us/step
[CV] ...................... epochs=10, learn_rate=0.001, total=  23.2s
[CV] epochs=10, learn_rate=0.001 .....................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.2s remaining:    0.0s
Epoch 1/10
35105/35105 [==============================] - 2s 65us/step - loss: 12.2686
Epoch 2/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.8189
Epoch 3/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.3627
Epoch 4/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.1731
Epoch 5/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.1110
Epoch 6/10
35105/35105 [==============================] - 2s 65us/step - loss: 8.0841
Epoch 7/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.0611
Epoch 8/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.0470
Epoch 9/10
35105/35105 [==============================] - 2s 64us/step - loss: 8.0352
Epoch 10/10
35105/35105 [==============================] - 2s 66us/step - loss: 8.0253
17552/17552 [==============================] - 1s 29us/step
[CV] ...................... epochs=10, learn_rate=0.001, total=  23.2s
[CV] epochs=10, learn_rate=0.001 .....................................
Epoch 1/10
35105/35105 [==============================] - 2s 65us/step - loss: 14.1641
Epoch 2/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.6018
Epoch 3/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.2072
Epoch 4/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.0955
Epoch 5/10
35105/35105 [==============================] - 2s 64us/step - loss: 8.0428
Epoch 6/10
35105/35105 [==============================] - 2s 63us/step - loss: 8.0128
Epoch 7/10
35105/35105 [==============================] - 2s 63us/step - loss: 7.9908
Epoch 8/10
35105/35105 [==============================] - 2s 62us/step - loss: 7.9801
Epoch 9/10
35105/35105 [==============================] - 2s 63us/step - loss: 7.9688
Epoch 10/10
35105/35105 [==============================] - 2s 64us/step - loss: 7.9614
17552/17552 [==============================] - 0s 27us/step
[CV] ...................... epochs=10, learn_rate=0.001, total=  23.0s
[CV] epochs=10, learn_rate=0.01 ......................................
Epoch 1/10
35104/35104 [==============================] - 2s 64us/step - loss: 8.7764
Epoch 2/10
35104/35104 [==============================] - 2s 62us/step - loss: 8.0405
Epoch 3/10
35104/35104 [==============================] - 2s 61us/step - loss: 8.0274
Epoch 4/10
35104/35104 [==============================] - 2s 62us/step - loss: 8.0082
Epoch 5/10
35104/35104 [==============================] - 2s 61us/step - loss: 8.0055
Epoch 6/10
35104/35104 [==============================] - 2s 61us/step - loss: 8.0017
Epoch 7/10
35104/35104 [==============================] - 2s 61us/step - loss: 7.9909
Epoch 8/10
35104/35104 [==============================] - 2s 62us/step - loss: 8.0072
Epoch 9/10
35104/35104 [==============================] - 2s 62us/step - loss: 7.9912
Epoch 10/10
35104/35104 [==============================] - 2s 66us/step - loss: 7.9907
17553/17553 [==============================] - 1s 29us/step
[CV] ....................... epochs=10, learn_rate=0.01, total=  22.6s
[CV] epochs=10, learn_rate=0.01 ......................................
Epoch 1/10
35105/35105 [==============================] - 2s 68us/step - loss: 8.7036
Epoch 2/10
35105/35105 [==============================] - 2s 66us/step - loss: 8.0617
Epoch 3/10
35105/35105 [==============================] - 2s 65us/step - loss: 8.0488
Epoch 4/10
35105/35105 [==============================] - 2s 65us/step - loss: 8.0370
Epoch 5/10
35105/35105 [==============================] - 2s 62us/step - loss: 8.0329
Epoch 6/10
35105/35105 [==============================] - 2s 61us/step - loss: 8.0203
Epoch 7/10
35105/35105 [==============================] - 2s 62us/step - loss: 8.0213
Epoch 8/10
35105/35105 [==============================] - 2s 61us/step - loss: 8.0234
Epoch 9/10
35105/35105 [==============================] - 2s 61us/step - loss: 8.0189
Epoch 10/10
35105/35105 [==============================] - 2s 61us/step - loss: 8.0180
17552/17552 [==============================] - 0s 27us/step
[CV] ....................... epochs=10, learn_rate=0.01, total=  22.9s
[CV] epochs=10, learn_rate=0.01 ......................................
Epoch 1/10
35105/35105 [==============================] - 2s 62us/step - loss: 8.5827
Epoch 2/10
35105/35105 [==============================] - 2s 61us/step - loss: 8.0265
Epoch 3/10
35105/35105 [==============================] - 2s 61us/step - loss: 8.0093
Epoch 4/10
35105/35105 [==============================] - 2s 61us/step - loss: 7.9931
Epoch 5/10
35105/35105 [==============================] - 2s 62us/step - loss: 7.9763
Epoch 6/10
35105/35105 [==============================] - 2s 61us/step - loss: 7.9627
Epoch 7/10
35105/35105 [==============================] - 2s 60us/step - loss: 7.9705
Epoch 8/10
35105/35105 [==============================] - 2s 61us/step - loss: 7.9608
Epoch 9/10
35105/35105 [==============================] - 2s 60us/step - loss: 7.9649
Epoch 10/10
35105/35105 [==============================] - 2s 61us/step - loss: 7.9608
17552/17552 [==============================] - 0s 26us/step
[CV] ....................... epochs=10, learn_rate=0.01, total=  22.2s
[CV] epochs=20, learn_rate=0.001 .....................................
Epoch 1/20
35104/35104 [==============================] - 2s 62us/step - loss: 12.0080
Epoch 2/20
35104/35104 [==============================] - 2s 60us/step - loss: 8.7657
Epoch 3/20
35104/35104 [==============================] - 2s 62us/step - loss: 8.2866
Epoch 4/20
35104/35104 [==============================] - 2s 62us/step - loss: 8.1024
Epoch 5/20
35104/35104 [==============================] - 2s 63us/step - loss: 8.0441
Epoch 6/20
35104/35104 [==============================] - 2s 63us/step - loss: 8.0194
Epoch 7/20
35104/35104 [==============================] - 2s 62us/step - loss: 8.0048
Epoch 8/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9955
Epoch 9/20
35104/35104 [==============================] - 2s 63us/step - loss: 7.9908
Epoch 10/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9871
Epoch 11/20
35104/35104 [==============================] - 2s 63us/step - loss: 7.9839
Epoch 12/20
35104/35104 [==============================] - 2s 63us/step - loss: 7.9827
Epoch 13/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9780
Epoch 14/20
35104/35104 [==============================] - 2s 63us/step - loss: 7.9785
Epoch 15/20
35104/35104 [==============================] - 2s 64us/step - loss: 7.9755
Epoch 16/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9727
Epoch 17/20
35104/35104 [==============================] - 2s 61us/step - loss: 7.9710
Epoch 18/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9692
Epoch 19/20
35104/35104 [==============================] - 2s 60us/step - loss: 7.9671
Epoch 20/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9674
17553/17553 [==============================] - 1s 29us/step
[CV] ...................... epochs=20, learn_rate=0.001, total=  44.5s
[CV] epochs=20, learn_rate=0.001 .....................................
Epoch 1/20
35105/35105 [==============================] - 2s 64us/step - loss: 11.4584
Epoch 2/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.6010
Epoch 3/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.2296
Epoch 4/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.1089
Epoch 5/20
35105/35105 [==============================] - 2s 63us/step - loss: 8.0672
Epoch 6/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0458
Epoch 7/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0355
Epoch 8/20
35105/35105 [==============================] - 2s 63us/step - loss: 8.0274
Epoch 9/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0236
Epoch 10/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0173
Epoch 11/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0138
Epoch 12/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0108
Epoch 13/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0087
Epoch 14/20
35105/35105 [==============================] - 2s 63us/step - loss: 8.0055
Epoch 15/20
35105/35105 [==============================] - 2s 63us/step - loss: 8.0024
Epoch 16/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0018
Epoch 17/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0022
Epoch 18/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9988
Epoch 19/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9962
Epoch 20/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9958
17552/17552 [==============================] - 0s 27us/step
[CV] ...................... epochs=20, learn_rate=0.001, total=  44.4s
[CV] epochs=20, learn_rate=0.001 .....................................
Epoch 1/20
35105/35105 [==============================] - 2s 64us/step - loss: 11.6671
Epoch 2/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.6389
Epoch 3/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.2119
Epoch 4/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0719
Epoch 5/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0147
Epoch 6/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9853
Epoch 7/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9729
Epoch 8/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9633
Epoch 9/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9558
Epoch 10/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9506
Epoch 11/20
35105/35105 [==============================] - 2s 63us/step - loss: 7.9456
Epoch 12/20
35105/35105 [==============================] - 2s 63us/step - loss: 7.9404
Epoch 13/20
35105/35105 [==============================] - 2s 63us/step - loss: 7.9397
Epoch 14/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9366
Epoch 15/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9337
Epoch 16/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9331
Epoch 17/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9301
Epoch 18/20
35105/35105 [==============================] - 2s 63us/step - loss: 7.9271
Epoch 19/20
35105/35105 [==============================] - 2s 63us/step - loss: 7.9257
Epoch 20/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9237
17552/17552 [==============================] - 0s 26us/step
[CV] ...................... epochs=20, learn_rate=0.001, total=  44.4s
[CV] epochs=20, learn_rate=0.01 ......................................
Epoch 1/20
35104/35104 [==============================] - 2s 63us/step - loss: 8.6074
Epoch 2/20
35104/35104 [==============================] - 2s 62us/step - loss: 8.0392
Epoch 3/20
35104/35104 [==============================] - 2s 63us/step - loss: 8.0155
Epoch 4/20
35104/35104 [==============================] - 2s 61us/step - loss: 8.0052
Epoch 5/20
35104/35104 [==============================] - 2s 61us/step - loss: 7.9955
Epoch 6/20
35104/35104 [==============================] - 2s 61us/step - loss: 7.9971
Epoch 7/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9889
Epoch 8/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9951
Epoch 9/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9840
Epoch 10/20
35104/35104 [==============================] - 2s 63us/step - loss: 7.9779
Epoch 11/20
35104/35104 [==============================] - 2s 61us/step - loss: 7.9799
Epoch 12/20
35104/35104 [==============================] - 2s 61us/step - loss: 7.9792
Epoch 13/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9622
Epoch 14/20
35104/35104 [==============================] - 2s 61us/step - loss: 7.9804
Epoch 15/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9736
Epoch 16/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9635
Epoch 17/20
35104/35104 [==============================] - 2s 61us/step - loss: 7.9654
Epoch 18/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9614
Epoch 19/20
35104/35104 [==============================] - 2s 62us/step - loss: 7.9673
Epoch 20/20
35104/35104 [==============================] - 2s 61us/step - loss: 7.9581
17553/17553 [==============================] - 0s 28us/step
[CV] ....................... epochs=20, learn_rate=0.01, total=  44.2s
[CV] epochs=20, learn_rate=0.01 ......................................
Epoch 1/20
35105/35105 [==============================] - 2s 63us/step - loss: 8.6338
Epoch 2/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0783
Epoch 3/20
35105/35105 [==============================] - 2s 60us/step - loss: 8.0521
Epoch 4/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.0438
Epoch 5/20
35105/35105 [==============================] - 2s 60us/step - loss: 8.0411
Epoch 6/20
35105/35105 [==============================] - 2s 60us/step - loss: 8.0361
Epoch 7/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.0254
Epoch 8/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.0247
Epoch 9/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.0222
Epoch 10/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0172
Epoch 11/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0135
Epoch 12/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0239
Epoch 13/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.0203
Epoch 14/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.0065
Epoch 15/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.0118
Epoch 16/20
35105/35105 [==============================] - 2s 63us/step - loss: 8.0146
Epoch 17/20
35105/35105 [==============================] - 2s 64us/step - loss: 8.0070
Epoch 18/20
35105/35105 [==============================] - 2s 63us/step - loss: 8.0170
Epoch 19/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.0046
Epoch 20/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.0026
17552/17552 [==============================] - 0s 26us/step
[CV] ....................... epochs=20, learn_rate=0.01, total=  44.1s
[CV] epochs=20, learn_rate=0.01 ......................................
Epoch 1/20
35105/35105 [==============================] - 2s 62us/step - loss: 8.5378
Epoch 2/20
35105/35105 [==============================] - 2s 61us/step - loss: 8.0048
Epoch 3/20
35105/35105 [==============================] - 2s 60us/step - loss: 7.9973
Epoch 4/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9848
Epoch 5/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9812
Epoch 6/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9767
Epoch 7/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9742
Epoch 8/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9641
Epoch 9/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9697
Epoch 10/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9629
Epoch 11/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9646
Epoch 12/20
35105/35105 [==============================] - 2s 60us/step - loss: 7.9733
Epoch 13/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9649
Epoch 14/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9638
Epoch 15/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9608
Epoch 16/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9615
Epoch 17/20
35105/35105 [==============================] - 2s 61us/step - loss: 7.9572
Epoch 18/20
35105/35105 [==============================] - 2s 62us/step - loss: 7.9649
Epoch 19/20
35105/35105 [==============================] - 2s 63us/step - loss: 7.9562
Epoch 20/20
35105/35105 [==============================] - 2s 66us/step - loss: 7.9508
17552/17552 [==============================] - 1s 30us/step
[CV] ....................... epochs=20, learn_rate=0.01, total=  44.0s
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:  6.7min finished
Epoch 1/20
52657/52657 [==============================] - 3s 66us/step - loss: 11.0565
Epoch 2/20
52657/52657 [==============================] - 3s 64us/step - loss: 8.3222
Epoch 3/20
52657/52657 [==============================] - 3s 64us/step - loss: 8.1039
Epoch 4/20
52657/52657 [==============================] - 3s 62us/step - loss: 8.0482
Epoch 5/20
52657/52657 [==============================] - 3s 62us/step - loss: 8.0264
Epoch 6/20
52657/52657 [==============================] - 3s 62us/step - loss: 8.0141
Epoch 7/20
52657/52657 [==============================] - 3s 62us/step - loss: 8.0053
Epoch 8/20
52657/52657 [==============================] - 3s 60us/step - loss: 7.9991
Epoch 9/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9930
Epoch 10/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9857
Epoch 11/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9780
Epoch 12/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9723
Epoch 13/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9646
Epoch 14/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9652
Epoch 15/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9621
Epoch 16/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9607
Epoch 17/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9552
Epoch 18/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9564
Epoch 19/20
52657/52657 [==============================] - 3s 61us/step - loss: 7.9549
Epoch 20/20
52657/52657 [==============================] - 3s 62us/step - loss: 7.9532
Best: -7.982277 using {'epochs': 20, 'learn_rate': 0.001}
-8.001391 (0.047504) with: {'epochs': 10, 'learn_rate': 0.001}
-8.016768 (0.021405) with: {'epochs': 10, 'learn_rate': 0.01}
-7.982277 (0.040320) with: {'epochs': 20, 'learn_rate': 0.001}
-7.984415 (0.067404) with: {'epochs': 20, 'learn_rate': 0.01}

Evaluation

This is the best model obtained

In [0]:
import pandas as pd
from keras.layers import Dense, Dropout, Input, LSTM
from keras.models import Sequential
from keras.layers.merge import add
from keras.optimizers import Adam
from keras.models import Model
from keras import losses
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from keras.wrappers.scikit_learn import KerasRegressor
from numpy.random import uniform
import numpy as np
from keras import backend as K
from tensorflow.python.ops import math_ops
from sklearn.model_selection import KFold,cross_val_score
from numpy import ndarray
from google.colab import drive
drive.mount('/content/drive')

datasetFolder='/content/drive/My Drive/Colab Notebooks/D2020/'
dataset = pd.read_csv(datasetFolder + 'total.csv')

def base_model(learn_rate=0.01, neurons=12):
    inputs = Input(shape=(15,))
    se1 = Dense(neurons, activation='relu')(inputs)
    outputs = Dense(1, activation='linear')(se1)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(loss=losses.mean_absolute_error, optimizer=Adam(lr=learn_rate))
    return model

dataset= dataset._get_numeric_data()
dataset=dataset.fillna(0)
X=dataset.drop(['average_speed'], axis=1)
y=dataset.take([-1], axis=1).values.ravel()

scaler = StandardScaler()
X=scaler.fit_transform(X)

folds=5
result=ndarray((folds,),float)
kf = KFold(n_splits=folds,random_state=None, shuffle=True)
k=0
for train_index, test_index in kf.split(X):
    xTrain, xTest = X[train_index], X[test_index]
    yTrain, yTest = y[train_index], y[test_index]
   
    model = KerasRegressor(build_fn=base_model, verbose=1)
    model = DNNModel(learn_rate=0.001, neurons=12)
    model.fit(xTrain, yTrain, epochs=20, verbose=1)
    
    yPred = model.predict(xTest)
    value = metrics.median_absolute_error(yTest, yPred)
    print('MSE Error: {}'.format(value))
    result[k] = metrics.median_absolute_error(yTest, yPred)
    k=k+1

I use a loss to calculate to regression.

The less the loss the best the model

In [4]:
print('MEAN MSE Error: {}'.format(np.mean(result)))
MEAN MSE Error: 7.480626303475118

Future tasks:

Now we have a prediction of the average speed, with this information and the length route we can show predictions about when will the bus arrive at each bus stop. This step is marked to do in future tasks.

 

 

 

 

 

 

 

Share this

One thought on “Datathon 2020 Ernst and Young Challengue – Team Solo

Leave a Reply