Datathon 2020 SolutionsDatathons Solutions

Predicting weather disruption of public transport

The purpose of the project was to build a model that tries to predict potential delays in Dubai’s bus transportation schedule, based on the weather conditions. Additional Extreme Gradient Boosting model was built, which is based on the weather conditions by 5 hours ago, which slightly improved the prediction of a few outliers, although this came at the cost of reducing the prediction accuracy for non-outliers. The overall prediction power was unfortunately unimpressive and could potentially be improved by analyzing the bus transportation data at an hourly level, by including additional data, such as global weather forecasts and traffic estimates, but also by exploring more feature engineering options, for example – seasonality, business activity, hourly segments and outlying flags.

0
votes

1. Business Understanding

As the case intro states, the extreme weather conditions can negatively affect the reliability and capacity of the public transportation by potentially damaging the transport infrastructure. This leads to increased costs - not only for infrastructure and transport vehicles maintenance, but also by causing unexpected delays and decreasing the public transport reliability.

If we can predict such events, we would be able to take precautions - by planning costs in advance, by rescheduling in advance or by choosing (or advising for) alternative transportation methods.

As a side note, public transport disruptions do not affect only the government-provided transportation service - they could be used as a reference; as a signal for all drivers and passengers that there are problems on the road.

2. Data Understanding

Data collection

The data used in this project consists of the following:

  • Dubai Weather data, provided in the case details
  • Dubai public bus transportation data, downloaded from Dubai Pulse website. This includes all 640 files available between 01-01-2018 and 16-03-2020. This period is selected so that it fully covers the period of the weather data.

Data processing

The weather data needs to be cleaned of redundant columns. Visualizations need to be made, in order to look for dependencies, cyclic occurrences, relationships between features.

The bus transportation data needs to be downloaded, read and processed, most importantly - identifying the time difference between two consecutive rides of the same bus line. The aggregated information by date will be joined to the weather data and machine learning models will be applied in order to find relationships between weather conditions and transport duration disruptions. It is assumed that the weather conditions will mostly affect the bus ride duration, therefore the median time interval between bus rides will be evaluated. Median is chosen instead of mean value, because it is more reliable metric when dealing with non-normal data with outliers.

Other public transportation methods are out of scope.

Approach limitations

  • Aggregated vs. granular data

One of the major limitations of the current project approach is that the public transportation data was aggregated at a daily (instead of an hourly) basis and therefore it lacks granularity.

  • Additional data is needed

Looking for additional data could have been helpful, for example any historical infrastructure damages, government spend on road reconstruction by month, public transport rescheduling, etc. Also, weather forecasts for broader region, as well as traffic estimations, could improve the predictins and provide valuable information.

  • Transportation type

Another limitation is that the current project is based on bus transportation data only, while other transport types could also be taken into account. It was assumed that the bus transportation is impacted at highest extent by the weather conditions and by other factors as well (e.g. other drivers on the road), therefore it was preferred over the other transportation types.

  • Seasonal and hourly specifics have not been taken into account

Seasonality has not been implemented into the modeling part. Information about business holidays and hourly segments (day, night, peak, off-peak) could also improve the model results.

3. Data Preparation

3.1. Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
import itertools
from urllib.request import urlopen
from pandas.io.json import json_normalize
from datetime import datetime, timedelta
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)
%matplotlib inline

3.2. Load raw data

3.2.1. Load weather data

In [2]:
response = urlopen("https://datacases.s3.us-east-2.amazonaws.com/datathon-2020/Ernst+and+Young/Dubai+Weather_20180101_20200316.txt")
rawdata = json.loads(response.read().decode('utf-8', 'replace'))
In [3]:
df = json_normalize(rawdata)
#weather = pd.DataFrame([y for x in df['weather'].values.tolist() for y in x])
In [4]:
for i in range(len(df.weather)):
    if len(df.weather[i]) > 1:
        print(f"{i}, {len(df.weather[i])}")
11221, 2
12875, 2
15466, 2

There are 3 cases with hourly record having more than one weather condition recorded. Only the first one will be kept.

In [5]:
for i in range(len(df.weather)):
    df.weather[i] = df.weather[i][0]
condition = json_normalize(df.weather)
F:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [6]:
df = pd.concat([df, condition], axis=1)
data = df.copy()
In [7]:
data.head(3)
Out[7]:
city_name lat lon weather dt dt_iso timezone main.temp main.temp_min main.temp_max ... main.humidity wind.speed wind.deg clouds.all rain.1h rain.3h id main description icon
0 Dubai 25.07501 55.188761 {'id': 800, 'main': 'Clear', 'description': 's... 1514764800 2018-01-01 00:00:00 +0000 UTC 14400 14.99 13.0 18.0 ... 87 3.1 150 1 NaN NaN 800 Clear sky is clear 01n
1 Dubai 25.07501 55.188761 {'id': 800, 'main': 'Clear', 'description': 's... 1514768400 2018-01-01 01:00:00 +0000 UTC 14400 14.63 13.0 17.0 ... 93 2.6 150 1 NaN NaN 800 Clear sky is clear 01n
2 Dubai 25.07501 55.188761 {'id': 800, 'main': 'Clear', 'description': 's... 1514772000 2018-01-01 02:00:00 +0000 UTC 14400 14.03 12.0 17.0 ... 93 1.5 150 1 NaN NaN 800 Clear sky is clear 01n

3 rows × 22 columns

Creating timestamp:

In [8]:
data = data.reset_index()

for i in range(len(data.dt)):
    data.loc[data.index==i, 'timestamp'] = datetime.fromtimestamp(data.dt[i])
    
data = data.drop(['index', 'dt', 'dt_iso', 'weather', 'id', 'icon', 'city_name', 
                  'lat', 'lon', 'timezone', 'description'], axis=1)

There are missing values in both rain columns - they will be infilled with zeros.

In [9]:
data['rain.1h'].fillna(0, inplace=True)
data['rain.3h'].fillna(0, inplace=True)

3.2.2. Data Exploration

3.2.3.1 Data types
In [10]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19344 entries, 0 to 19343
Data columns (total 13 columns):
main.temp          19344 non-null float64
main.temp_min      19344 non-null float64
main.temp_max      19344 non-null float64
main.feels_like    19344 non-null float64
main.pressure      19344 non-null int64
main.humidity      19344 non-null int64
wind.speed         19344 non-null float64
wind.deg           19344 non-null int64
clouds.all         19344 non-null int64
rain.1h            19344 non-null float64
rain.3h            19344 non-null float64
main               19344 non-null object
timestamp          19344 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(7), int64(4), object(1)
memory usage: 1.9+ MB
In [11]:
data.rename(columns={'main.temp':'temp', 'main.temp_min':'temp_min', 'main.temp_max':'temp_max', 'main.feels_like': 'temp_feel', 
                     'main.pressure':'pressure', 'main.humidity':'humidity', 'clouds.all':'cloudiness',
                     'wind.speed':'wind_speed', 'wind.deg':'wind_deg', 'rain.1h':'rain_1h', 'rain.3h':'rain_3h', 
                     'main':'condition'}, inplace=True)
In [12]:
data.head()
Out[12]:
temp temp_min temp_max temp_feel pressure humidity wind_speed wind_deg cloudiness rain_1h rain_3h condition timestamp
0 14.99 13.0 18.0 13.70 1015 87 3.1 150 1 0.0 0.0 Clear 2018-01-01 02:00:00
1 14.63 13.0 17.0 13.91 1015 93 2.6 150 1 0.0 0.0 Clear 2018-01-01 03:00:00
2 14.03 12.0 17.0 13.89 1016 93 1.5 150 1 0.0 0.0 Clear 2018-01-01 04:00:00
3 13.78 12.0 17.0 13.14 1016 93 2.1 180 1 0.0 0.0 Mist 2018-01-01 05:00:00
4 14.28 12.0 18.0 13.45 1017 93 2.6 160 1 0.0 0.0 Mist 2018-01-01 06:00:00
In [13]:
data.describe()
Out[13]:
temp temp_min temp_max temp_feel pressure humidity wind_speed wind_deg cloudiness rain_1h rain_3h
count 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000
mean 28.102823 26.661868 29.810532 27.684793 1009.416098 52.495089 3.879056 188.379239 13.751964 0.000601 0.003641
std 7.329419 7.580049 7.240840 8.309911 8.017003 21.660800 2.098738 106.258945 26.479664 0.023249 0.068185
min 10.890000 7.000000 12.000000 6.340000 972.000000 4.000000 0.300000 0.000000 0.000000 0.000000 0.000000
25% 22.030000 20.920000 23.840000 20.750000 1003.000000 35.000000 2.315000 100.000000 0.000000 0.000000 0.000000
50% 28.060000 26.670000 30.000000 27.315000 1011.000000 53.000000 3.600000 180.000000 1.000000 0.000000 0.000000
75% 33.880000 32.810000 35.122500 34.890000 1016.000000 69.000000 5.100000 290.000000 19.000000 0.000000 0.000000
max 45.940000 45.360000 48.000000 47.890000 1026.000000 100.000000 14.900000 360.000000 100.000000 2.030000 3.810000
3.2.3.2. Data visualization
In [14]:
f = plt.figure(figsize=(20,25));
gs = f.add_gridspec(4, 2)
ax1 = f.add_subplot(gs[0, 0])
sns.boxplot(data=data, x='condition', y='temp')
plt.title('Temperature Boxplots by Weather Condition')
ax2 = f.add_subplot(gs[0, 1])
sns.boxplot(data=data, x='condition', y='pressure')
plt.title('Atm. Pressure (hPa) by Weather Condition')
ax3 = f.add_subplot(gs[1, 0])
sns.boxplot(data=data, x='condition', y='humidity')
plt.title('Humidity (%) by Weather Condition')
ax4 = f.add_subplot(gs[1, 1])
sns.boxplot(data=data, x='condition', y='cloudiness')
plt.title('Cloudiness (%) by Weather Condition')
ax5 = f.add_subplot(gs[2, 0])
sns.boxplot(data=data, x='condition', y='wind_speed')
plt.title('Wind Speed (m/s) by Weather Condition')
ax6 = f.add_subplot(gs[2, 1])
sns.boxplot(data=data, x='condition', y='wind_deg')
plt.title('Wind Direction (degrees) by Weather Condition')
ax7 = f.add_subplot(gs[3, 0])
sns.boxplot(data=data, x='condition', y='rain_1h')
plt.title('Rain volume in mm (last 1 hour) by Weather Condition')
ax8 = f.add_subplot(gs[3, 1])
sns.boxplot(data=data, x='condition', y='rain_3h')
plt.title('Rain volume in mm (last 3 hours) by Weather Condition')
plt.show();

Several findings:

  • The temperature is highest during Dust weather condition.
  • While it's Clear, the average temperature is relatively high, but the range is also very high. However, highest temperatures (37+ degrees) were achieved only while it's clear, cloudy or dusty. On the other hand, the lowest temperatures were in foggy and misty weather.
  • Lowest pressure is recorded in dusty weather.
  • Low humidity while Clear, Cloudy, Dust and Smoke; highest during Fog, Mist and Haze.
  • High wind spead during Dust, Thunderstorm and Smoke.
  • No rain volume recorded during thunderstorms.
In [15]:
plt.rcParams['figure.figsize'] = (10, 5)
plt.title('Temperature throughout the period');
sns.lineplot(data=data, x='timestamp', y='temp_min', color='b');
sns.lineplot(data=data, x='timestamp', y='temp', color='yellow');
sns.lineplot(data=data, x='timestamp', y='temp_max', color='r');
sns.lineplot(data=data, x='timestamp', y='temp_feel', color='g');
plt.legend(['min','avg','max','feels']);
F:\Anaconda\lib\site-packages\pandas\plotting\_matplotlib\converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)

All temperature features are moving pretty much together, with highest temperatures recorded in July and August and lowest in January and February.

In [16]:
sns.lineplot(data=data, x='timestamp', y='pressure', color='c')
plt.title('Atmospheric pressure');

Generally, the atmospheric pressure varies within tight intervals on monthly basis, but shows a clear seasonal character, being highest in January and lowest in July. Later will be observed a correlation plot between the features that will likely highlight negative correlation between the atmospheric pressure and the temperature.

In [17]:
sns.lineplot(data=data, x='timestamp', y='humidity', color='orange')
plt.title('Humidity');

Although that the volatility of humidity varies greatly, here we can also observe some kind of cyclic behaviour, with highest humidity recorded in January and lowest in May-June.

In [18]:
sns.lineplot(data=data, x='timestamp', y='cloudiness', color='c')
plt.title('Cloudiness');

It's slightly more difficult to look for patterns in the dynamics of cloudiness percentage, but it looks a bit lower in May and June, as well as in some parts of February, August and October.

In [19]:
f=plt.figure(figsize=(20,5))
gs = f.add_gridspec(1,2)
f.add_subplot(gs[0,0])
sns.lineplot(data=data, x='timestamp', y='wind_speed', color='darkcyan')
plt.title('Wind speed');
f.add_subplot(gs[0,1])
sns.lineplot(data=data, x='timestamp', y='wind_deg', color='darkolivegreen')
plt.title('Wind direction (High=N,W; Low=E,S)');
In [20]:
sns.lineplot(data=data, x='timestamp', y='rain_1h', color='y')
sns.lineplot(data=data, x='timestamp', y='rain_3h', color='r')
plt.title('Rain volume (mm) by date')
plt.legend(['last 1 hour', 'last 3 hours']);

Due to Dubai's climate specifics, there is no much evidence of raining. Data for "the last 3 hours" does not exist before the end of 2019 - has either not been observed or not been collected. There is some sort of seasonality, although not so well established:

  • in 2018 the highest rain volumes have been recorded in March and October.
  • in 2019 rains have been recorded in February, April and May, as well as high quantities in November and December.
  • in 2020 the highest figures are in January.

From June to September typically there's no raining.

In [21]:
sns.lineplot(data=data, x='condition', y='rain_1h', color='y');
sns.lineplot(data=data, x='condition', y='rain_3h', color='r');
plt.title('Rain volume by weather condition');
plt.legend(['last 1 hour','last 3 hours']);
In [22]:
corrmatrix = data.iloc[:,:9].corr()
In [23]:
plt.figure(figsize=(10,6))
ax = sns.heatmap(corrmatrix, vmin=-1, vmax=1, annot=True,);
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.title('Correlation matrix')
plt.show();

Several observations:

  • Confirmed is the strong negative correlation between the temperature and the pressure that was mentioned above. Many years ago Gay-Lussac has claimed for the opposite, but it seems that his statement applies only in closed systems with constant gas volume (Source: https://en.wikipedia.org/wiki/Gay-Lussac%27s_law#Pressure-temperature_law)
  • There is somewhat strong negative correlation between the temperature and humidity. However, this negative correlation goes somewhat weaker when calculated between the human perception of the temperature ('feels like temperature') and the humidity
  • There is some evidence of a negative relationship between the wind speed and humidity, meaning that probably the wind carries away the little water drops in the air
  • There is a positive correlatin between the wind speed and wind degrees - meaning that when the wind speed is higher for the higher degree value, and therefore - the winds from North and West more often have higher speed than the ones from South and East (Source: https://snowfence.umn.edu/Components/winddirectionanddegrees.htm)
In [24]:
f = plt.figure(figsize=(20,6))
gs = f.add_gridspec(1,2)
f.add_subplot(gs[0,0])
ax1 = sns.heatmap(data.loc[data.condition=='Clear','temp':'cloudiness'].corr(), vmin=-1, vmax=1, annot=True, cmap='YlGnBu');
bottom, top = ax1.get_ylim()
ax1.set_ylim(bottom + 0.5, top - 0.5)
plt.title('Correlation matrix - Clear condition')

f.add_subplot(gs[0,1])
ax2 = sns.heatmap(data.loc[data.condition=='Thunderstorm','temp':'cloudiness'].corr(), vmin=-1, vmax=1, annot=True, cmap='YlGnBu');
bottom, top = ax2.get_ylim()
ax2.set_ylim(bottom + 0.5, top - 0.5)
plt.title('Correlation matrix - Thunderstorms')
plt.show();

In order to catch what specific changes happen during thunderstorm, we can observe the differences in correlation matrix during clear weather and during thunderstorms. Here are the main differences:

  • In clear weather, there is a strong negative correlation between the temperature and atmospheric pressure. During thunderstorms such correlation almost does not exist.
  • In clear weather, there is some, although weak, positive correlation between the temperature and the wind speed. During thunderstorms, there is no correlation. However, we can see that the human perception of the temperature does not correlate with the wind speed in clear weather, but correlates negatively with the wind speed during thunderstorms, and so does with the wind degrees - meaning that in low temperatures there wind directions are mainly North and West. This is also associated with higher humidity and higher cloudiness percentage.
In [25]:
f = plt.figure(figsize=(20,6))
gs = f.add_gridspec(1,2)
f.add_subplot(gs[0,0])
sns.lineplot(data=data, x='condition', y='pressure')
plt.title('Atmospheric pressure (mm) by weather condition');
gs = f.add_gridspec(1,2)
f.add_subplot(gs[0,1])
sns.lineplot(data=data, x='condition', y='humidity');
plt.title('Humidity (%) by weather condition');
plt.show();

There is some interesting evidence on the two charts above - that both atmospheric pressure and humidity are high during thunderstorms. They could be explored further, in order to find if a thunderstorm could be predicted by the dynamics of pressure and humidity, and probably wind speed/direction. However, this analysis will not be implemented in the current project.

In [26]:
data.tail()
Out[26]:
temp temp_min temp_max temp_feel pressure humidity wind_speed wind_deg cloudiness rain_1h rain_3h condition timestamp
19339 22.85 21.0 25.45 22.19 1015 64 3.6 50 0 0.0 0.0 Clear 2020-03-16 21:00:00
19340 22.35 21.0 24.00 21.17 1015 68 4.6 60 0 0.0 0.0 Clear 2020-03-16 22:00:00
19341 21.52 20.0 23.36 21.43 1015 72 3.1 60 0 0.0 0.0 Clear 2020-03-16 23:00:00
19342 21.04 19.0 23.36 21.19 1014 77 3.1 70 0 0.0 0.0 Clear 2020-03-17 00:00:00
19343 20.31 18.0 23.28 19.83 1014 77 3.6 60 0 0.0 0.0 Clear 2020-03-17 01:00:00

3.2.3. Load public transportation data

The case description does not specify which particular type of public transportation to be analyzed, but due to the tight timelines the best approach would be to focus on the bus transportation - assuming that it should be the most affected type of transportation by weather conditions. Moreover, any bus transportation issues could be a signal for general road transportation ones - this is something that other public transportation types will not be able to hint about.

The Dubai Pulse website API seems to be inaccessible for non-UAE citizens, therefore the only way to acquire the data is by downloading separately each of the 640 CSV files for the bus transportation since 1st January 2018. To reduce the manual effort, Selenium webdriver is going to be set up to download the files one by one - putting a generous 10 seconds wait time between download requests to avoid server overloading. This action took about 5 hours and required a storage size of 60 GB.

In [27]:
from selenium import webdriver
import time
In [28]:
url = 'https://www.dubaipulse.gov.ae/data/allresource/rta-bus/rta_bus_ridership-open?organisation=rta&service=rta-bus&count=1212&result_per_page=1300&as_sfid=AAAAAAVQxFA0BFeFROVV-_FUrwIfaEqwRWpoZA-y-UptqSEqxmERCKYLhWrwqWh3AfDCDdi1moQM5yS3Qjy2NzBMeMFf3DsQYwQOBarG4FRgrDCOeBE9L_Tq7J9m8CMoTDSCXIY%3D&as_fid=ff49229e06fa994326e53390b91e89d1dc5e2954'
In [29]:
driver = webdriver.Firefox(executable_path=r'D:\Library\env\geckodriver.exe')
In [30]:
driver.get(url)

The easiest way to find the position of each 'Download' button is by getting all such buttons by their xpath:

In [167]:
results = driver.find_elements_by_xpath('//div[@class="file-action col-sm-12 col-xs-5"]')
print('Number of results', len(results))
Number of results 1212

There are 1212 bus transportation files uploaded, but some of them are attributed to 2017. Below is the code that downloads the files - executed in several separate runs, using the i-th parameter.

In [170]:
# the code below is commented out in the submitted version of this file
#i=571
#for result in results[i:]:
    #result.click()
    #time.sleep(10)

3.2.3. Explore sample bus transportation data - proof of concept

The purpose of this section is to set up a proof of concept for a later use - explore one of the bus transportation files, observe the data and find potential ways to extract the information that could be useful for the task. The biggest challenge would be find the most efficient way to process 600+ files, having in mind that some of them contain more than a million records. This makes the looping approach undesirable, because iterating through all the files will probably exceed the Datathon article submission deadline!

In [206]:
csv = pd.read_csv(r"D:\Library\Datathon\2020\Bus\bus_ridership_2018-01-16_00-00-00.csv")
In [207]:
def timestamping(date, time):
    datesplit = date.split('-')
    timesplit = time.split(':')
    timestamp = datetime(int(datesplit[0]),int(datesplit[1]),int(datesplit[2]),int(timesplit[0]),int(timesplit[1]),int(timesplit[2]))
    return timestamp
In [237]:
start = datetime.now()
csv['timestamp'] = csv.apply(lambda x: timestamping(x['txn_date'],x['txn_time']), axis=1)
print(f"time: {datetime.now() - start}")
time: 0:00:57.578293

Almost a minute to get the timestamp of the sample file is not great, but definitely better than iterating. The next task is to sort the data and calculate the time difference between consecutive transactions.

In [235]:
csv = csv.sort_values(by=['route_name','end_location','timestamp']).reset_index()
In [238]:
start = datetime.now()
csv['diff_prev'] = csv.timestamp.diff()
print(f"time: {datetime.now() - start}")
time: 0:00:00.035002
In [386]:
csv.head()
Out[386]:
level_0 index txn_type txn_date txn_time start_location end_location route_name start_zone end_zone timestamp diff_prev
0 5276 528179 Check out 2018-01-16 05:29:42 NaN Al Jafiliya Bus Station A 10 Zone 6 Zone 6 2018-01-16 05:29:42 NaT
1 5277 406813 Check out 2018-01-16 05:29:46 NaN Al Jafiliya Bus Station A 10 Zone 6 Zone 6 2018-01-16 05:29:46 00:00:04
2 5278 373183 Check out 2018-01-16 05:29:54 NaN Al Jafiliya Bus Station A 10 Zone 6 Zone 6 2018-01-16 05:29:54 00:00:08
3 5279 395555 Check in 2018-01-16 05:30:28 NaN Al Jafiliya Bus Station A 10 Zone 6 Zone 6 2018-01-16 05:30:28 00:00:34
4 5289 357914 Check out 2018-01-16 05:51:06 NaN Al Jafiliya Bus Station A 10 Zone 6 Zone 6 2018-01-16 05:51:06 00:20:38

The initial approach was to find each transaction that is recorded 30 seconds or more after the previous (or before the following one) and assign a specific (increment) value to each. However, this process did not complete in more than half an hour for just one file and unfortunately the approach was discarded. A quicker one was to identify and get the first occurence of each group and put it into a new data frame.

In [283]:
csv2 = csv.loc[(csv.diff_prev > np.timedelta64(30,'s')) | (csv.diff_prev < np.timedelta64(0,'s')),:]
In [271]:
len(csv)
len(csv2)
Out[271]:
785825
Out[271]:
197461

From the initial 785k records in the file, about as much as 75% of the data was discarded and not used in the project. Now we can calculate the duration between the rides.

In [291]:
csv2['duration'] = (abs(csv2.timestamp.diff(periods=-1)) / np.timedelta64(1,'s'))
F:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

The last record has no duration, therefore needs manual calculation - its timestamp minus the timestamp of the last record from the initial (detailed) data.

In [296]:
csv2.iloc[-1,-1] = (csv.iloc[-1,-2] - csv2.iloc[-1,-3]) / np.timedelta64(1,'s')
In [539]:
csv2.head()
Out[539]:
index txn_type txn_date txn_time start_location end_location route_name start_zone end_zone timestamp diff_prev duration
2 205253 Check in 2020-03-16 17:48:46 Al Bakhit Center 1 Al Bakhit Center 1 10 Zone 5 Zone 5 2020-03-16 17:48:46 08:38:11 5856.0
3 537946 Check out 2020-03-16 19:26:22 NaN Al Bakhit Center 1 10 Zone 5 Zone 5 2020-03-16 19:26:22 01:37:36 10833.0
4 482980 Check out 2020-03-16 22:26:55 NaN Al Bakhit Center 1 10 Zone 5 Zone 5 2020-03-16 22:26:55 03:00:33 4991.0
7 49841 Check in 2020-03-16 23:50:06 Al Bakhit Center 1 Al Bakhit Center 1 10 Zone 5 Zone 5 2020-03-16 23:50:06 01:23:05 21672.0
8 58915 Check in 2020-03-17 05:51:18 Al Bakhit Center 1 Al Bakhit Center 1 10 Zone 5 Zone 5 2020-03-17 05:51:18 06:01:12 1954.0

For the purposes of the project I will only use aggregated data from the bus transportation files.

In [300]:
csv2.duration.describe()
Out[300]:
count    197461.000000
mean       3054.870962
std        9784.843996
min           7.000000
25%         244.000000
50%        1018.000000
75%        1790.000000
max       86382.000000
Name: duration, dtype: float64

Expectedly, the data is non-normal and highly right-skewed. The mean value of duration is three times bigger than the median value, meaning that it is biased and affected by the outliers. The maximum duration (difference between two rides of a single bus line) is almost 24 hours - probably accounting to bus lines riding once per day. The standard deviation is also high - almost 3 hours. The most meaningful piece of the descriptive statistics is the median value, which is approx 17 minutes - this refers to the median time difference between different rides of a single bus line.

In [308]:
sns.distplot(csv2.duration)
plt.title('Ride duration histogram (28-01-2018)');

3.2.4. Read all bus transportation data

Now we have finished the proof of concept and we can try applying it on all the files. This is a huge effort, requiring more than 7 hours of processing time.

In [449]:
ride_duration = pd.DataFrame(columns=['date','count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'])
In [1075]:
# this cell was rerun as blank at the end, before the submission, as the output was too long and irrelevant to the article
files = []
src = "D:/Library/Datathon/2020/Bus/"
i=1

with os.scandir(src) as files:
    for file in files:
        print(f"Processing file: {i}, the time is: {datetime.now()}")
        csv = pd.read_csv(src + file.name)
        
        # sort values, create timestamp, calculate time difference (in seconds) between rides of each bus line
        csv['timestamp'] = csv.apply(lambda x: timestamping(x['txn_date'],x['txn_time']), axis=1)
        csv = csv.sort_values(by=['route_name','end_location','timestamp']).reset_index()
        csv['diff_prev'] = csv.timestamp.diff()
        
        # get the first record in each ride, on each stop, for each bus line
        csv2 = csv.loc[(csv.diff_prev > np.timedelta64(30,'s')) | (csv.diff_prev < np.timedelta64(0,'s')),:].copy()
        
        # calculate ride durations
        csv2['duration'] = (abs(csv2.timestamp.diff(periods=-1)) / np.timedelta64(1,'s'))
        
        # get date and duration descriptive statistics from each provided file
        filedate = file.name.split('_')[2]
        datetime_file = datetime(int(filedate.split('-')[0]), int(filedate.split('-')[1]), int(filedate.split('-')[2]))
        a = []
        b = []
        b.append(csv2.duration.describe().values)
        b = list(itertools.chain.from_iterable(b))
        a.append([datetime_file,b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7]])
        
        # save the date and duration descriptive statistics from each file to ride_duration data frame
        ride_duration = pd.concat([ride_duration, pd.DataFrame(a,
                                                               columns=['date','count', 'mean', 'std', 'min', '25%', '50%', 
                                                                        '75%', 'max'])])
        i+=1
In [533]:
ride_duration_backup = ride_duration.copy()
In [534]:
ride_duration.head()
Out[534]:
date count mean std min 25% 50% 75% max
0 2018-01-01 172312.0 3315.272906 10175.650012 31.0 228.0 1027.0 1863.0 86373.0
0 2018-01-02 199509.0 3019.851410 9735.072278 31.0 234.0 1004.0 1744.0 86359.0
0 2018-01-03 198930.0 3017.459257 9718.005803 31.0 239.0 1006.0 1751.0 86349.0
0 2018-01-04 205639.0 3022.186915 9901.432035 31.0 224.0 997.0 1733.0 86330.0
0 2018-01-05 171372.0 3353.207233 10443.294682 29.0 242.0 1041.0 1868.0 86380.0
In [535]:
sns.lineplot(data=ride_duration, x='date', y='50%');
plt.ylabel('median');
plt.title('Median bus ride duration by date');
plt.show();
3.2.4. Combine weather and public transportation data
In [554]:
data['date'] = pd.to_datetime(data.timestamp).apply(lambda x: x.date())
ride_duration['date'] = pd.to_datetime(ride_duration.date).apply(lambda x: x.date())
In [649]:
combined_data = data.join(ride_duration[['date','50%','std']].rename(columns={'50%':'median'}).set_index('date'), on='date')
In [650]:
combined_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19341 entries, 0 to 19340
Data columns (total 17 columns):
temp           19341 non-null float64
temp_min       19341 non-null float64
temp_max       19341 non-null float64
temp_feel      19341 non-null float64
pressure       19341 non-null int64
humidity       19341 non-null int64
wind_speed     19341 non-null float64
wind_deg       19341 non-null int64
cloudiness     19341 non-null int64
rain_1h        19341 non-null float64
rain_3h        19341 non-null float64
condition      19338 non-null object
description    19338 non-null object
timestamp      19341 non-null datetime64[ns]
date           19341 non-null object
median         15334 non-null float64
std            15334 non-null float64
dtypes: datetime64[ns](1), float64(9), int64(4), object(3)
memory usage: 2.5+ MB
In [651]:
combined_data.loc[combined_data['median'].isna()].date.unique()
Out[651]:
array([datetime.date(2018, 1, 10), datetime.date(2018, 1, 11),
       datetime.date(2018, 1, 12), datetime.date(2018, 1, 13),
       datetime.date(2018, 1, 14), datetime.date(2018, 2, 2),
       datetime.date(2018, 2, 10), datetime.date(2018, 2, 18),
       datetime.date(2018, 2, 19), datetime.date(2018, 2, 20),
       datetime.date(2018, 2, 21), datetime.date(2018, 2, 23),
       datetime.date(2018, 2, 25), datetime.date(2018, 2, 26),
       datetime.date(2018, 2, 27), datetime.date(2018, 2, 28),
       datetime.date(2018, 3, 1), datetime.date(2018, 3, 2),
       datetime.date(2018, 3, 3), datetime.date(2018, 3, 4),
       datetime.date(2018, 3, 9), datetime.date(2018, 3, 10),
       datetime.date(2018, 3, 11), datetime.date(2018, 3, 12),
       datetime.date(2018, 3, 13), datetime.date(2018, 3, 18),
       datetime.date(2018, 3, 21), datetime.date(2018, 3, 26),
       datetime.date(2019, 6, 29), datetime.date(2019, 6, 30),
       datetime.date(2019, 7, 1), datetime.date(2019, 7, 2),
       datetime.date(2019, 7, 3), datetime.date(2019, 7, 4),
       datetime.date(2019, 7, 5), datetime.date(2019, 7, 6),
       datetime.date(2019, 7, 7), datetime.date(2019, 7, 8),
       datetime.date(2019, 7, 9), datetime.date(2019, 7, 15),
       datetime.date(2019, 7, 17), datetime.date(2019, 7, 18),
       datetime.date(2019, 7, 19), datetime.date(2019, 7, 23),
       datetime.date(2019, 7, 24), datetime.date(2019, 7, 25),
       datetime.date(2019, 7, 26), datetime.date(2019, 7, 27),
       datetime.date(2019, 7, 28), datetime.date(2019, 7, 29),
       datetime.date(2019, 8, 4), datetime.date(2019, 8, 5),
       datetime.date(2019, 8, 7), datetime.date(2019, 8, 8),
       datetime.date(2019, 8, 10), datetime.date(2019, 8, 17),
       datetime.date(2019, 8, 18), datetime.date(2019, 8, 19),
       datetime.date(2019, 8, 20), datetime.date(2019, 8, 23),
       datetime.date(2019, 8, 29), datetime.date(2019, 8, 30),
       datetime.date(2019, 9, 2), datetime.date(2019, 9, 4),
       datetime.date(2019, 9, 6), datetime.date(2019, 9, 13),
       datetime.date(2019, 9, 23), datetime.date(2019, 9, 24),
       datetime.date(2019, 9, 25), datetime.date(2019, 10, 6),
       datetime.date(2019, 10, 7), datetime.date(2019, 10, 8),
       datetime.date(2019, 10, 9), datetime.date(2019, 10, 10),
       datetime.date(2019, 10, 12), datetime.date(2019, 10, 13),
       datetime.date(2019, 10, 20), datetime.date(2019, 10, 21),
       datetime.date(2019, 10, 22), datetime.date(2019, 10, 23),
       datetime.date(2019, 10, 24), datetime.date(2019, 10, 25),
       datetime.date(2019, 10, 26), datetime.date(2019, 10, 28),
       datetime.date(2019, 10, 29), datetime.date(2019, 10, 30),
       datetime.date(2019, 10, 31), datetime.date(2019, 11, 1),
       datetime.date(2019, 11, 2), datetime.date(2019, 11, 3),
       datetime.date(2019, 11, 4), datetime.date(2019, 11, 5),
       datetime.date(2019, 11, 6), datetime.date(2019, 11, 16),
       datetime.date(2019, 11, 17), datetime.date(2019, 11, 18),
       datetime.date(2019, 11, 19), datetime.date(2019, 11, 20),
       datetime.date(2019, 11, 21), datetime.date(2019, 11, 22),
       datetime.date(2019, 11, 23), datetime.date(2019, 11, 24),
       datetime.date(2019, 11, 25), datetime.date(2019, 11, 26),
       datetime.date(2019, 11, 27), datetime.date(2019, 11, 28),
       datetime.date(2019, 11, 29), datetime.date(2019, 11, 30),
       datetime.date(2019, 12, 1), datetime.date(2019, 12, 2),
       datetime.date(2019, 12, 3), datetime.date(2019, 12, 4),
       datetime.date(2019, 12, 5), datetime.date(2019, 12, 6),
       datetime.date(2019, 12, 7), datetime.date(2019, 12, 11),
       datetime.date(2019, 12, 12), datetime.date(2019, 12, 13),
       datetime.date(2019, 12, 14), datetime.date(2019, 12, 16),
       datetime.date(2019, 12, 17), datetime.date(2019, 12, 19),
       datetime.date(2019, 12, 20), datetime.date(2019, 12, 21),
       datetime.date(2019, 12, 22), datetime.date(2019, 12, 23),
       datetime.date(2019, 12, 24), datetime.date(2019, 12, 25),
       datetime.date(2019, 12, 26), datetime.date(2019, 12, 27),
       datetime.date(2019, 12, 28), datetime.date(2019, 12, 29),
       datetime.date(2019, 12, 30), datetime.date(2020, 1, 1),
       datetime.date(2020, 1, 2), datetime.date(2020, 1, 3),
       datetime.date(2020, 1, 4), datetime.date(2020, 1, 5),
       datetime.date(2020, 1, 6), datetime.date(2020, 1, 7),
       datetime.date(2020, 1, 8), datetime.date(2020, 1, 9),
       datetime.date(2020, 1, 10), datetime.date(2020, 1, 11),
       datetime.date(2020, 1, 14), datetime.date(2020, 1, 15),
       datetime.date(2020, 1, 16), datetime.date(2020, 1, 17),
       datetime.date(2020, 1, 18), datetime.date(2020, 1, 19),
       datetime.date(2020, 1, 20), datetime.date(2020, 1, 22),
       datetime.date(2020, 1, 23), datetime.date(2020, 1, 25),
       datetime.date(2020, 1, 28), datetime.date(2020, 1, 30),
       datetime.date(2020, 1, 31), datetime.date(2020, 2, 1),
       datetime.date(2020, 2, 2), datetime.date(2020, 2, 3),
       datetime.date(2020, 2, 4), datetime.date(2020, 2, 7),
       datetime.date(2020, 2, 11), datetime.date(2020, 2, 13),
       datetime.date(2020, 2, 15), datetime.date(2020, 2, 23),
       datetime.date(2020, 2, 24)], dtype=object)

Unfortunately, bus transportation statistics is indeed not provided for these dates on the Dubai Pulse website. Moreover, the missing dates account for 4007 data records in the weather dataset, which is about 20% of the total. Dropping such a big part of the data is not recommended, therefore the better approach would be to infill the NULLs with the median values for the respective month, and thus, to keep the impact (if any) of the seasonal fluctuations.

In [652]:
combined_data['month'] = combined_data.date.apply(lambda x: str(x)[:7])
In [653]:
monthly_agg = combined_data.groupby('month')['median','std'].median()
In [654]:
combined_data = combined_data.join(monthly_agg.rename(columns={'median':'monthly_median', 'std':'monthly_std'}), on='month')
In [655]:
combined_data.set_index('timestamp', inplace=True)
In [656]:
for i in combined_data.loc[combined_data['median'].isna()].index:
    combined_data.loc[combined_data.index == i, 'median'] = combined_data.loc[combined_data.index == i, 'monthly_median']
    combined_data.loc[combined_data.index == i, 'std'] = combined_data.loc[combined_data.index == i, 'monthly_std']
In [657]:
combined_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 19341 entries, 2018-01-01 02:00:00 to 2020-03-17 01:00:00
Data columns (total 19 columns):
temp              19341 non-null float64
temp_min          19341 non-null float64
temp_max          19341 non-null float64
temp_feel         19341 non-null float64
pressure          19341 non-null int64
humidity          19341 non-null int64
wind_speed        19341 non-null float64
wind_deg          19341 non-null int64
cloudiness        19341 non-null int64
rain_1h           19341 non-null float64
rain_3h           19341 non-null float64
condition         19338 non-null object
description       19338 non-null object
date              19341 non-null object
median            19341 non-null float64
std               19341 non-null float64
month             19341 non-null object
monthly_median    19341 non-null float64
monthly_std       19341 non-null float64
dtypes: float64(11), int64(4), object(4)
memory usage: 3.6+ MB
In [658]:
combined_data.tail(10)
Out[658]:
temp temp_min temp_max temp_feel pressure humidity wind_speed wind_deg cloudiness rain_1h rain_3h condition description date median std month monthly_median monthly_std
timestamp
2020-03-16 16:00:00 25.79 25.00 27.05 21.53 1014 47 7.7 360 0 0.0 0.0 Clear sky is clear 2020-03-16 1082.0 19287.988736 2020-03 1085.0 21084.159176
2020-03-16 17:00:00 25.13 23.25 27.05 23.13 1014 53 5.1 360 0 0.0 0.0 Clear sky is clear 2020-03-16 1082.0 19287.988736 2020-03 1085.0 21084.159176
2020-03-16 18:00:00 24.56 23.25 27.05 22.78 1015 57 5.1 20 0 0.0 0.0 Clear sky is clear 2020-03-16 1082.0 19287.988736 2020-03 1085.0 21084.159176
2020-03-16 19:00:00 24.34 23.35 25.45 23.13 1015 60 4.6 20 0 0.0 0.0 Clear sky is clear 2020-03-16 1082.0 19287.988736 2020-03 1085.0 21084.159176
2020-03-16 20:00:00 23.80 22.00 25.45 23.14 1015 64 4.1 40 0 0.0 0.0 Clear sky is clear 2020-03-16 1082.0 19287.988736 2020-03 1085.0 21084.159176
2020-03-16 21:00:00 22.85 21.00 25.45 22.19 1015 64 3.6 50 0 0.0 0.0 Clear sky is clear 2020-03-16 1082.0 19287.988736 2020-03 1085.0 21084.159176
2020-03-16 22:00:00 22.35 21.00 24.00 21.17 1015 68 4.6 60 0 0.0 0.0 Clear sky is clear 2020-03-16 1082.0 19287.988736 2020-03 1085.0 21084.159176
2020-03-16 23:00:00 21.52 20.00 23.36 21.43 1015 72 3.1 60 0 0.0 0.0 NaN NaN 2020-03-16 1082.0 19287.988736 2020-03 1085.0 21084.159176
2020-03-17 00:00:00 21.04 19.00 23.36 21.19 1014 77 3.1 70 0 0.0 0.0 NaN NaN 2020-03-17 1085.0 20954.465394 2020-03 1085.0 21084.159176
2020-03-17 01:00:00 20.31 18.00 23.28 19.83 1014 77 3.6 60 0 0.0 0.0 NaN NaN 2020-03-17 1085.0 20954.465394 2020-03 1085.0 21084.159176

As only the last 3 data records have no value for condition, and the several previous have the value 'Clear', we can fill 'Clear' to the last 3 as well. Several columns, including the description, will not be needed and can be dropped.

In [659]:
combined_data.condition.fillna('Clear', inplace=True)
In [667]:
final_data = combined_data.drop(['date', 'description', 'month', 'monthly_median', 'monthly_std'], axis=1)
In [668]:
final_data.head()
Out[668]:
temp temp_min temp_max temp_feel pressure humidity wind_speed wind_deg cloudiness rain_1h rain_3h condition median std
timestamp
2018-01-01 02:00:00 14.99 13.0 18.0 13.70 1015 87 3.1 150 1 0.0 0.0 Clear 1027.0 10175.650012
2018-01-01 03:00:00 14.63 13.0 17.0 13.91 1015 93 2.6 150 1 0.0 0.0 Clear 1027.0 10175.650012
2018-01-01 04:00:00 14.03 12.0 17.0 13.89 1016 93 1.5 150 1 0.0 0.0 Clear 1027.0 10175.650012
2018-01-01 05:00:00 13.78 12.0 17.0 13.14 1016 93 2.1 180 1 0.0 0.0 Mist 1027.0 10175.650012
2018-01-01 06:00:00 14.28 12.0 18.0 13.45 1017 93 2.6 160 1 0.0 0.0 Mist 1027.0 10175.650012

4. Modelling

4.1. Pre-processing

In [670]:
X = final_data.iloc[:,:-2]
y = final_data.iloc[:,-2]

The 'condition' column is a categorical one, therefore needs to be one-hot encoded.

In [669]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
In [672]:
cond_labelenc = LabelEncoder()
X.iloc[:,-1] = cond_labelenc.fit_transform(X.iloc[:,-1])
In [673]:
X.head(2)
Out[673]:
temp temp_min temp_max temp_feel pressure humidity wind_speed wind_deg cloudiness rain_1h rain_3h condition
timestamp
2018-01-01 02:00:00 14.99 13.0 18.0 13.70 1015 87 3.1 150 1 0.0 0.0 0
2018-01-01 03:00:00 14.63 13.0 17.0 13.91 1015 93 2.6 150 1 0.0 0.0 0
In [674]:
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(categories='auto'), [11])], remainder='passthrough')
X = ct.fit_transform(X)
In [677]:
X[0]
Out[677]:
array([1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 1.499e+01, 1.300e+01, 1.800e+01,
       1.370e+01, 1.015e+03, 8.700e+01, 3.100e+00, 1.500e+02, 1.000e+00,
       0.000e+00, 0.000e+00])
In [689]:
y = np.array(y)
In [709]:
sns.distplot(y);

The distribution of the dependent variable is not normal, so we can improve the models performance if we transform it by log(1+y).

In [710]:
y = np.log1p(y)

Now we can split the data into training and testing sets.

In [711]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

The metric that is going to be used to evaluate the models is the mean absolute error. There will also be a visual representation of the predictions.

In [798]:
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score

4.2. Regression models

The following models will be put into action:

  • Multiple Linear Regression
  • Random Forest Regression
  • Support Vector Regression
  • Extra Gradient Boosting Regressor
4.2.1. Multiple Linear Regression
In [910]:
from sklearn.linear_model import LinearRegression
In [911]:
regressor1 = LinearRegression()
regressor1.fit(X_train, y_train)

y_pred1 = np.expm1(regressor1.predict(X_test))
Out[911]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [912]:
MAE1 = mean_absolute_error(np.expm1(y_test), y_pred1)
print(f"Mean absolute error: {MAE1}")
Mean absolute error: 46.75457318398206
4.2.2. Random Forest Regression

Here we will also apply grid search to find the best parameters for the model

In [701]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
In [900]:
parameters = {'n_estimators': [2, 3, 5, 10, 15, 20, 50, 100, 200, 500, 750, 1000],
              'max_leaf_nodes': [3, 5, 10, 20, 35, 50],
              'random_state': [0]}
grid_search = GridSearchCV(estimator = RandomForestRegressor(),
                           param_grid = parameters,
                           scoring = 'neg_mean_absolute_error',
                           cv = 5,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
print(f"Best MAE: {grid_search.best_score_ * (-1)}")
print(f"Best parameters: {grid_search.best_params_}")
Best MAE: 0.03553088387619551
Best parameters: {'max_leaf_nodes': 5, 'n_estimators': 200, 'random_state': 0}
In [915]:
regressor2 = RandomForestRegressor(n_estimators=200, max_leaf_nodes=5, random_state=0)
regressor2.fit(X_train, y_train)

y_pred2 = np.expm1(regressor2.predict(X_test))
Out[915]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=5,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=200,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)
In [916]:
MAE2 = mean_absolute_error(np.expm1(y_test), y_pred2)
print(f"Mean absolute error: {MAE2}")
Mean absolute error: 45.614449383172776
4.2.3. Support Vector Regression
In [722]:
from sklearn.svm import SVR

The Support Vector models work best if the data is scaled first.

In [729]:
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
sc_y = StandardScaler()
sc_X_train = sc_X.fit_transform(X_train)
sc_y_train = sc_y.fit_transform(y_train.reshape(-1,1))
sc_X_test = sc_X.fit_transform(X_test)
In [890]:
parameters = {'C': [0.5, 1, 5, 10, 20, 50, 100],
              'kernel': ['rbf']}
grid_search = GridSearchCV(estimator = SVR(),
                           param_grid = parameters,
                           scoring = 'neg_mean_absolute_error',
                           cv = 5,
                           n_jobs = -1)
grid_search = grid_search.fit(sc_X_train, sc_y_train)
print(f"Best MAE: {grid_search.best_score_ * (-1)}")
print(f"Best parameters: {grid_search.best_params_}")
F:\Anaconda\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Best MAE: 0.35292226606165944
Best parameters: {'C': 20, 'kernel': 'rbf'}
In [892]:
regressor3 = SVR(kernel = 'rbf', C = 20)
regressor3.fit(sc_X_train, sc_y_train)

y_pred3 = regressor3.predict(sc_X_test)
y_pred3 = np.expm1(sc_y.inverse_transform(y_pred3))
F:\Anaconda\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[892]:
SVR(C=20, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)
In [893]:
MAE3 = mean_absolute_error(np.expm1(y_test), y_pred3)
print(f"Mean absolute error: {MAE3}")
Mean absolute error: 41.307124327395904
4.2.4. Extra Gradient Boosting (XGBoost) Regressor
In [732]:
from xgboost import XGBRegressor
In [956]:
parameters = {'base_score': [0.1, 0.5, 0.7, 1, 2],
              'learning_rate': [0.01, 0.03, 0.1, 0.3],
              'n_estimators': [20, 50, 100, 150, 200, 300, 400, 1000],
              'max_depth': [3, 5, 7, 9]}
grid_search = GridSearchCV(estimator = XGBRegressor(),
                           param_grid = parameters,
                           scoring = 'neg_mean_absolute_error',
                           cv = 5,
                           n_jobs = -1)
#grid_search = grid_search.fit(X_train, y_train)
#print(f"Best MAE: {grid_search.best_score_ * (-1)}")
#print(f"Best parameters: {grid_search.best_params_}")
In [997]:
regressor4 = XGBRegressor(learning_rate=0.02, n_estimators=300, max_depth=9)
regressor4.fit(X_train, y_train)
y_pred4 = np.expm1(regressor4.predict(X_test))
Out[997]:
XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints=None,
             learning_rate=0.02, max_delta_step=0, max_depth=9,
             min_child_weight=1, missing=nan, monotone_constraints=None,
             n_estimators=300, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
             validate_parameters=False, verbosity=None)
In [998]:
MAE4 = mean_absolute_error(np.expm1(y_test), y_pred4)
print(f"Mean absolute error: {MAE4}")
Mean absolute error: 45.40822667772379

5. Evaluation

5.1. Models summary

In [999]:
models_summary = {'Multiple Linear': MAE1, 'Random Forest': MAE2, 'SVR': MAE3, 'XGB': MAE4}
In [1000]:
sns.barplot(x=list(models_summary.keys()), y=list(models_summary.values()), color='r')
plt.title("Mean absolute error by model (the lower, the better)");

The lowest mean absolute error is returned by the SVR model. However, the task is not as trivial as usually - we need not only to find the best fitting model, but also the model that does the best job in predicting the outliers - because they are the ones that are causing the issue described in the business understanding section.

In [1005]:
plt.figure(figsize=(20,10))
sns.lineplot(data=y_test, color='r').set(yscale='log');
sns.lineplot(data=np.log1p(y_pred4), color='b').set(yscale='log');
sns.lineplot(data=np.log1p(y_pred3), color='y').set(yscale='log');
sns.lineplot(data=np.log1p(y_pred2), color='brown').set(yscale='log');
sns.lineplot(data=np.log1p(y_pred1), color='cyan').set(yscale='log');
plt.title('Models summary: predictions vs the test set')
plt.legend(['Test set','XGB prediction','SVM prediction','RF prediction','Linear prediciton']);