In [166]:
import os
import pandas as pd
import numpy as np
import json
import pprint
import seaborn as sns
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import time
import warnings
from statsmodels.formula.api import ols
warnings.filterwarnings('ignore')
Business understanding¶
The aim of this article is to present an data driven approach for predicting traffic disruption using weather data. The key point in this endeavor to define meaningful metrics representing disrupted traffic. Bad weather leads to harsher driving conditions which affect almost all kinds of public transport, bus transport in particular. Weather also influences social behaviour (e.g. increased passenger flow in buses or more traffic jams on the streets). Therefore, we expect that weather will have some predictive power in determining bus irregularities.
Data understanding¶
The data considered for this project was extracted from the Roads and Transport Authority (RTA) by DubaiPulse (https://www.dubaipulse.gov.ae/organisation/rta) and weather data provided by OpenWeather. In particular, bus routes and bus ridership data was used during the analysis phase. The final time period considered in this evaluation is 1 Jan 2019 - 31 Dec 2019.
The weather dataset contains various continuous as well as categorical weather parameters recorded on hourly basis. The data are from just one weather station in Dubai.
The ridership dataset contains bus passenger actions (check in and check out actions) with a timestamp and bus station on a specified route. We do not have vehicle level data available, neither timetables. We will attempt to represent traffic disruption as accurately as possible, even in the absence of some crucial data about bus behaviour.
Weather data preprocessing¶
The following steps were taken to preprocess weather data:
- load and parse weather data for the period 1 Jan 2018 - 16 Mar 2020
- remove irrelevant components
- investigate correlations, outliers, time trends
- identify potential predictor - weather.description with 17 descriptive categories covering most weather conditions with the exception of wind.
In [304]:
### Open and parse the provided weather data
with open('weather_data.txt','r') as f:
data = json.load(f)
df_weather = pd.DataFrame()
for i in range(len(data)):
json_hourly = data[i]
json_hourly['weather'] = json_hourly['weather'][0]
_df = json_normalize(json_hourly) # flatten the nested json
df_weather = pd.concat([df_weather,_df])
In [308]:
### View raw dataset
df_weather.describe()
Out[308]:
1.Exclude GPS coordinates, timezone, city name because the location is always the same.
2.Exclude dt as we have the human readable UTC timestamp anyway
3.Exclude weather.main and weather.icon as weather.description contains all this info anyway
In [309]:
exclude_cols = ['weather.main', 'weather.icon', 'city_name', 'dt', 'lat', 'lon', 'timezone']
df_cleaned = df_weather.drop(exclude_cols, axis='columns')
In [310]:
### These are the meaningful categories for weather.description:
for col in ['weather.description']:
print(df_weather[col].unique(), len(df_weather[col].unique()))
In [311]:
### Convert the string timestamp to a proper datetime timestamp. Set it as index for convenience.
df_cleaned['dt_iso'] = df_cleaned['dt_iso'].apply(lambda x: pd.to_datetime(x[:-10], infer_datetime_format=True))
df_cleaned = df_cleaned.set_index('dt_iso')
df_cleaned = df_cleaned.sort_index()
### Summary statistics for weather over Jan 2018 - Mar 2020
df_cleaned.describe()
Out[311]:
In [312]:
# Create pickle, in order to load the data faster from now on
#df_cleaned.to_pickle('./weather_data.pkl')
In [2]:
### Only if pickle was already created
df_cleaned = pd.read_pickle('weather_data.pkl')
In [5]:
df_cleaned[['main.temp_min','main.temp_max', 'main.temp']].plot(figsize=(15,5))
Out[5]:
The temperature data is physically reasonable. We can see the seasonal variations.
In [8]:
### Pearson correlation coefficients between weather features
#df_cleaned.corr()
In [7]:
fig, ax = plt.subplots(figsize=(15, 12))
ax = sns.heatmap(df_cleaned.corr(),annot=True)
Temperature and pressure are thermodynamically related so the strong negative correlation makes sense. We should only include one of those two in the model.
In [245]:
### Filter weather data only for daytime and aggregate on daily basis.
### This is very important in order to match the aggregated bus ridership data for the modelling stage.
df_cleaned = df_cleaned.between_time('07:00','22:00')
df_cleaned['date'] = df_cleaned.index.date
### Include important, weakly correlated weather variables. This maximizes our information gain from the weather data.
df_agg = df_cleaned.groupby('date').agg({'main.temp':'mean',
'main.temp_max':'max',
'main.temp_min':'min',
'main.humidity':'mean',
'wind.speed':'mean'})
### Take the most frequently occurring weather description string as representative for the day
df_agg['weather.description'] = df_cleaned.groupby('date')['weather.description'] \
.agg(lambda x: x.value_counts().index[0])
In [313]:
### Plot the aggregated quantities to confirm weather behaviour still looks fine and we haven't lost information
df_agg[['main.temp']].plot(figsize=(15,5))
Out[313]:
In [314]:
df_agg[['main.humidity']].plot(figsize=(15,5))
Out[314]:
In [315]:
df_agg[['wind.speed']].plot(figsize=(15,5))
Out[315]:
In [248]:
### Create pickle for aggregated weather data
### They are now ready to be joined with traffic data for the model
#df_agg.to_pickle('./weather_data_daily_min_max.pkl')
Ridership data preprocessing¶
- download all 363 available bus ridership files from DubaiPulse from 1 Jan 2019 - 31 Dec 2019. The overlapping timeframe with the weather dataset is for the entire 2019.
- investigate route information and systematically select a bus line with 2 directions and same amount of bus stops (no fewer than 10 stops) for simplicity
- identify and cluster bus stop touchpoints - a touchpoint is defined as the collection of passenger check-in or check-out actions at a specific bus station at time x. Consecutive passenger actions belonging to the same bus touchpoint should happen within maximum 3 minutes of each other. This allows us to differentiate between multiple passenger actions at the same time but on different stops along the same route (people getting on different buses at the same time). Most importantly, this allows us to express traffic disruption with two measures - time interval between two consecutive buses arriving at a bus stop and onboarding duration.
Defining disruption¶
These are a few of the possible measures describing traffic disruption listed in order of importance:
- time between buses (TBB) - the time between consecutive bus arrivals on a bus stop of the same route. TBB will vary a lot or increase if there are disruptions.
- excessively long onboarding process - onboarding duration is defined as the time difference between the first and last passenger check-in or check-out action within the bus stop touchpoint. We estimate this to be one of the proxies of bus delays.
- number of passenger actions per bus touchpoint - we expect a busier day or time of the day with more passengers to lead to more delays. We also expect that length of onboarding process is correlated with number of passenger actions.
Our focus in this project will be TBB.
An investigation of the bus routes dataset uncovers that some of the bus lines follow very complex tracks, therefore aggregating the information for all routes in the subsequent modeling is probably not a good idea. To solve this problem we decided to systematically check the characteristics of the routes. We focused on a suitable bus line which follows the same route in both directions (forward and backward). Line 28 is such a line, it's a popular bus line in the center of the city and should be rather representative for the bus behaviour.
In [170]:
### Raw data
df_events = pd.read_csv('/Users/romina/Downloads/Bus_Ridership_2019-01-21_00-00-00.csv')
df_events.head()
Out[170]:
In [169]:
### Raw data
df_routes = pd.read_csv('/Users/romina/Downloads/Bus_Routes.csv')
df_routes.head()
Out[169]:
In [316]:
### Find systematically suitable routes which reduce the complexity of the problem using the Bus_Routes.csv file.
df_routes = pd.read_csv('/Users/romina/datathon/Bus_Routes.csv')
routes_ev = df_events['route_name'].unique()
routes = df_routes['route_name'].unique()
for route in routes:
_df = df_routes[df_routes['route_name']==route]
dirs = _df['direction'].unique()
Ndir = len(dirs)
if Ndir == 2 and route in routes_ev:
# print(f"Route '{route}' has {Ndir} tracks.")
if ( len(_df[_df['direction']==dirs[0]]['stop_number'].unique()) ==
len(_df[_df['direction']==dirs[1]]['stop_number'].unique()) ):
Nstops = len(_df[_df['direction']==dirs[1]]['stop_number'].unique())
if Nstops > 10:
print(f"Route '{route}' has {Ndir} tracks, same number of stops in both directions with {Nstops} stops.")
In [39]:
### Create function to preprocess each day of bus ridership data
### With this function we are able to obtain time between buses (TBB) for all bus stops of route 28
### out of the raw ridership data. Then we can use TBB as a measure of disruption in the model!
def preprocess_daily_file_to_bus_touchpoints(df, filedate):
# Filter only for the date described in the file name
df = df[df['txn_date'] == filedate]
# Convert to timestamp
df['timestamp'] = pd.to_datetime( df['txn_date'] + ' ' + df['txn_time'], infer_datetime_format=True)
# Exclude non-relevant and redundant variables
exclude = ['start_location', 'start_zone', 'end_zone', 'txn_date', 'txn_time']
df = df.drop(exclude,axis='columns')
df = df.set_index('timestamp').sort_index()
df = df.between_time('7:00', '22:00')
# Just route 28
df = df[df['route_name']=='28']
df['dummy'] = 1.0 # just for plotting
df['time'] = df.index
# Group by "end_location"(=bus stop) and cluster the bus touchpoints
# with every passenger action within 3 minutes of each other.
grouped = df.groupby('end_location')
df_corrected = pd.DataFrame()
for name,df_group in grouped:
# for each bus stop check the time between consecutive passenger actions
df_group['shifted_time'] = df_group['time'].shift()
# if > 3 min => start of new bus touchpoint detected
df_group['is_new'] = df_group['time'] - df_group['shifted_time'] > timedelta(0,3*60)
# cumulative sum to create cluster of passenger actions
df_group['stop_int'] = df_group['is_new'].cumsum()
# remove intermediate variables
df_group = df_group.drop(['shifted_time','is_new'],axis=1)
# unite different bus stops back together
df_corrected = pd.concat([df_corrected, df_group], axis=0)
# Aggregate the passenger level data to achieve bus touchpoint level summary
df_summary = df_corrected.groupby(by=['stop_int','end_location'], as_index=False).agg({'dummy':'count',
'time': ['min', 'max'],
'route_name': 'last'})
df_summary.columns = df_summary.columns.map('_'.join)
df_summary['onboarding_duration'] = (df_summary['time_max'] - df_summary['time_min']).dt.total_seconds()
df_summary = df_summary.rename(columns = {'dummy_count':'Npassengers',
'time_min':'onboarding_start',
'time_max':'onboarding_end',
'route_name_last':'route_name',
'end_location_':'end_location'})
df_summary.drop('stop_int_', axis=1, inplace = True)
df_summary = df_summary.set_index('onboarding_start').sort_index()
df_summary = df_summary.reset_index().sort_values(['end_location','onboarding_end'])
# Remove bus touchpoints at origin (start/end of route) - onboarding duration > 10 minutes
df_summary = df_summary[df_summary['onboarding_duration']<60*10]
# Calculate TBB
grouped_bus_stops = df_summary.groupby('end_location')
df_final = pd.DataFrame()
for name,df_group in grouped_bus_stops:
df_group['TBB'] = df_group['onboarding_end'].diff().dt.total_seconds()/60
df_final = pd.concat([df_final, df_group], axis=0)
return df_final.dropna(subset=['TBB'],axis=0)
In [40]:
### Create function to get descriptive statistics out of TBB per day
def aggregate_daily_basis(df, filedate):
data = {'TBB_mean': df['TBB'].mean(),
'TBB_std': df['TBB'].std(),
'TBB_median': df['TBB'].median(),
'TBB_q75': df['TBB'].quantile(.75),
'TBB_q25': df['TBB'].quantile(.25),
'onboarding_duration_mean': df[df['onboarding_duration']!=0]['onboarding_duration'].mean()}
dt = pd.to_datetime(filedate, infer_datetime_format=True)
return pd.DataFrame(data=data,index=[dt.date()])
In [41]:
### Load, preprocess and aggregate bus data, file by file, for the entire 2019
bus_data_path = '/Users/romina/datathon/2019'
start_time = time.time()
all_files = sorted(os.listdir(bus_data_path))
all_the_data = pd.DataFrame()
for ind,f in enumerate(all_files):
filedate = f.split('.')[0].split('_')[2]
filepath = os.path.join(bus_data_path, f)
df_events = pd.read_csv(filepath)
preprocessed_data = preprocess_daily_file_to_bus_touchpoints(df_events, filedate)
agg_data = aggregate_daily_basis(preprocessed_data, filedate)
all_the_data = pd.concat([all_the_data, agg_data], axis=0)
if ind%30 == 0:
print(f'{round(time.time() - start_time,2)} seconds elapsed so far ({ind}/{len(all_files)} files preprocessed).')
In [43]:
### These are the potential dependent variables that describe traffic disruption
### All TBB related parameters are in minutes, onboarding duration is in seconds
all_the_data.head()
Out[43]:
In [45]:
all_the_data.describe()
Out[45]:
In [129]:
### We can see periodical behaviour. Weekends in Dubai are on Friday and Saturday as opposed to Europe!
### This can be clearly seen in the data with the peaks in time between buses.
### From the visual investigation the weekends seem to be the most important factor for increasing time between buses.
### This will be taken into account in the modelling.
all_the_data[[x for x in all_the_data.columns if x != 'onboarding_duration_mean']].plot(figsize=(15,5),marker ='*')
Out[129]:
In [135]:
### Plotting just data during the week. We don't observe significant long term time series trends.
### There seems to be an interesting event during May 2019 with consistently higher TBB metrics.
### To account for possible seasonality within the week days
all_the_data['weekend_flag'] = ((pd.DatetimeIndex(all_the_data.index).dayofweek).isin([4,5])).astype(int)
all_the_data['weekday'] = pd.DatetimeIndex(all_the_data.index).dayofweek.astype(str)
all_the_data_week = all_the_data[all_the_data['weekend_flag']==0]
all_the_data_week[['TBB_mean','TBB_std','TBB_median','TBB_q75','TBB_q25']].plot(figsize=(15,5),marker ='*')
Out[135]:
In [44]:
### Create a pickle
#all_the_data.to_pickle('./bus_data_daily_5.pkl')
Statistical modelling¶
- For each route and bus stop within the route observe the distribution of TBB during the day (between 7am and 10pm). With this we control for the effects of night bus schedules.
- We model the distribution of TBB over the entire day along with aggregated weather data on the daily basis. This allows us to control for the problematic cases when a bus skips the stop (no passengers to get on or off) which may seem like a delay, but aren't. In a daily distribution over multiple stops of popular routes, such cases would be outliers and can be suppressed using the summary metrics of the distribution.
In [6]:
# Only if pickle was already created
#weather_data = pd.read_pickle('weather_data_daily_min_max.pkl')
#all_the_data = pd.read_pickle('bus_data_daily_5.pkl')
In [136]:
# Filtering weather data only for 2019
ov_start, ov_end = datetime(2019,1,1).date(), datetime(2020,1,1).date() # get data from the overlap period
weather_data = weather_data[ (weather_data.index >= ov_start) & (weather_data.index < ov_end) ]
### Joining weather data (independent variables) and processed bus ridership data (dependent variables)
model_data = all_the_data.join(weather_data, how = 'left')
model_data.columns = [x.replace(".","_") for x in model_data.columns]
model_data.head()
Out[136]:
Let's fit several regression models! We would like to start with a classical, more robust approach, particularly since we only work with 1 year of day level data (360 observations). Advanced ML solutions should be considered only if needed.¶
In [12]:
model = ols("TBB_mean ~ C(weather_description, Treatment(reference = 'sky is clear')) + weekend_flag", data=model_data)
results = model.fit()
results.summary()
Out[12]:
Model 1: We take weather description and weekend flag as independent variables, mean TBB as dependent.¶
Independent of the weather description, time between buses will be on average 3.96 minutes longer on weekends compared to weekdays.¶
No matter weekend or not, the broken clouds weather will add on average extra 0.6 minutes to the time between buses, compared to clear sky weather. We control for the weekend effect with the weekend flag.¶
In [141]:
### Introduce interquantile range as another metric. It's a measure for bus irregularity
model_data['TBB_IQR'] = model_data['TBB_q75'] - model_data['TBB_q25']
model = ols("TBB_IQR ~ main_temp_min + main_humidity + wind_speed + weekend_flag", data=model_data)
results = model.fit()
results.summary()
Out[141]:
Model 2: The new information that we gain from this model is that wind also has a small effect. 1 m/s increase in wind speed leads to .18 minutes increase in bus irregularity.¶
In [148]:
### Finally, we augment the TBB_mean model with a weekday predictor which is even more sensitive than the weekend flag
model = ols("TBB_mean ~ C(weather_description, Treatment(reference = 'sky is clear')) + weekday", data=model_data)
results = model.fit()
results.summary()
Out[148]:
Model 3: This is the augmented version of model 1 with more sensitive weekday predictor instead of weekend flag. It confirms the conclusions from model 1 but now we also control for the possible effects of any weekday. This is the most reasonable model, also with the highest Adj. R squared coefficient (0.864), which represents the variance explained by the model.¶
Evaluation¶
The benefits of a linear regression are not only its easy interpretability but also the fact that you only need to split your data in training and testing set - no need for validation set (due to the analytical nature of the algorithm). We apply cross validation with train and test set on model 3. In each iteration of the crossvalidation procedure 80% of the observations are assigned to the training set, the remaining 20% are assigned to the test set.
In [160]:
def estimate_rmse(prediction, label):
return np.sqrt(((prediction - label)**2).mean())
di = {'sky is clear': 'sky is clear', 'overcast clouds': 'overcast clouds',
'few clouds': 'few clouds', 'broken clouds': 'broken clouds',
'dust': 'dust', 'scattered clouds': 'scattered clouds', 'light_rain': 'overcast clouds'
}
model_data['refactored_weather_desc'] = model_data['weather_description'].map(di)
counter = 0
rmse_list = []
while counter<1000:
train_data = model_data.sample(frac = 0.8)
test_data = model_data.loc[~model_data.index.isin(train_data.index)]
model = ols("TBB_mean ~ C(refactored_weather_desc, Treatment(reference = 'sky is clear')) + weekday", \
data=train_data)
results = model.fit()
test_data['predictions_weather'] = results.predict(test_data)
rmse_list.append(estimate_rmse(test_data['predictions_weather'], test_data['TBB_mean']))
counter += 1
In [165]:
avg_rmse = np.mean(rmse_list)
print(f'The average RMSE from 1000 crossvalidation iterations is {round(avg_rmse,2)} minutes.')
In other words, we can predict the average daily TBB in Dubai with an error of 0.71 minutes.
Conclusion & Deployment¶
The model with weather description and weekday provides basic insights about the impact of weather on bus traffic. It also allows us to use it directly to predict expected time between buses on a higher level - daily. The easy interpretability of this model as well as its higher level scope are two reasons why such a model has high business value. On the other hand, from the conducted analysis it is clear that the sensitivity of the model is insufficient. We believe that this is caused by the fact that it is based on daily aggregations. However, with the limited resources in this particular situation, the daily aggregation is the best way to approach the case in order to address the two main data insufficiencies.:
- There is no guarantee that there are passengers getting on or off on each stop. It is possible that the bus is just missing some stops as no-one wants to on-/off-board there. In such a scenario, the bus is in fact still on time but there are no data to prove it. In the data it would look like TBB is much longer and there have been delays. These would be the outliers in the TBB distribution and more often than not they do NOT mean disruption.
- Lack of vehicle level data (we can't track the paths of each bus separately). We just know that A bus picked up passengers at a specific bus stop. It is clear that at a given moment in time, there are several buses driving a given route, but each of them is located near a different bus stop on that route (if it weren’t the case, the people would have to wait a long time). Neither do we have data on the number of buses per bus line, nor a vehicle identifier in the data set. We only have passenger check-ins and check-outs, and therefore one potentially relevant parameter to extract is exactly the so-called TBB (Time Between two consecutive Buses on a given bus stop).
In order to address these two issues, we decided to work with the daily distribution of the TBB parameter. If we build a model on a more granular level with this data, we will be forced to provide potentially inflated delay times as ground truth. With our approach we intend to capture global variations in the behaviour of TBB potentially caused by weather. A disruption due to weather would on average not lead to huge abrupt changes in TBB.
GPS datasets per bus would solve both of the above issues.
Assuming we have such data at our disposal,a model with higher granularity would definitely be more appropriate for production level purposes. This would allow to account for peak vs non-peak traffic in the model. Certainly, it would allow to make use of the weather variations within the day as well.
In summary, we would consider the following further steps:
- Enhance the dataset with GPS data per vehicle.
- Come up with a joined dataset on hourly basis or several hours to get finer resolution.
- Add additional routes, not just 28, considering the peculiarities of the different bus line routes.
- Use also the onboarding duration as a measure of traffic disruption and not only TBB.
In [48]:
# plotting distribution of TBB (time between buses) in minutes over the day
plt.figure(figsize=(12,8))
plt.hist(df_summary_day[df_summary_day['end_location']=='Satwa, Square 1']['onboarding_end']
.diff().dt.total_seconds().values/60, 15)
plt.xlabel('TBB (min)', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title(f'Histogram of time between buses for station Satwa, Square 1 on 2019-01-21')
Out[48]:
Archive¶
Assuming normal operation on that day we see that TBB has a mean of about 18 mins and a standard deviation of 8.73. Upon disruption caused by bad weather the distribution of TBB will probably change. With our model we will quantify this effect.
In [57]:
#mean TBB
mean_TBB = np.nanmean(df_summary_day[df_summary_day['end_location']=='Satwa, Square 1']['onboarding_end']
.diff().dt.total_seconds().values/60)
#std TBB
std_TBB = np.nanstd(df_summary_day[df_summary_day['end_location']=='Satwa, Square 1']['onboarding_end']
.diff().dt.total_seconds().values/60)
mean_TBB, std_TBB
Out[57]:
import os
import pandas as pd
import numpy as np
import json
import pprint
import seaborn as sns
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import time
import warnings
from statsmodels.formula.api import ols
warnings.filterwarnings('ignore')
Business understanding¶
The aim of this article is to present an data driven approach for predicting traffic disruption using weather data. The key point in this endeavor to define meaningful metrics representing disrupted traffic. Bad weather leads to harsher driving conditions which affect almost all kinds of public transport, bus transport in particular. Weather also influences social behaviour (e.g. increased passenger flow in buses or more traffic jams on the streets). Therefore, we expect that weather will have some predictive power in determining bus irregularities.
Data understanding¶
The data considered for this project was extracted from the Roads and Transport Authority (RTA) by DubaiPulse (https://www.dubaipulse.gov.ae/organisation/rta) and weather data provided by OpenWeather. In particular, bus routes and bus ridership data was used during the analysis phase. The final time period considered in this evaluation is 1 Jan 2019 - 31 Dec 2019.
The weather dataset contains various continuous as well as categorical weather parameters recorded on hourly basis. The data are from just one weather station in Dubai.
The ridership dataset contains bus passenger actions (check in and check out actions) with a timestamp and bus station on a specified route. We do not have vehicle level data available, neither timetables. We will attempt to represent traffic disruption as accurately as possible, even in the absence of some crucial data about bus behaviour.
Weather data preprocessing¶
The following steps were taken to preprocess weather data:
- load and parse weather data for the period 1 Jan 2018 - 16 Mar 2020
- remove irrelevant components
- investigate correlations, outliers, time trends
- identify potential predictor - weather.description with 17 descriptive categories covering most weather conditions with the exception of wind.
### Open and parse the provided weather data
with open('weather_data.txt','r') as f:
data = json.load(f)
df_weather = pd.DataFrame()
for i in range(len(data)):
json_hourly = data[i]
json_hourly['weather'] = json_hourly['weather'][0]
_df = json_normalize(json_hourly) # flatten the nested json
df_weather = pd.concat([df_weather,_df])
### View raw dataset
df_weather.describe()
1.Exclude GPS coordinates, timezone, city name because the location is always the same.
2.Exclude dt as we have the human readable UTC timestamp anyway
3.Exclude weather.main and weather.icon as weather.description contains all this info anyway
exclude_cols = ['weather.main', 'weather.icon', 'city_name', 'dt', 'lat', 'lon', 'timezone']
df_cleaned = df_weather.drop(exclude_cols, axis='columns')
### These are the meaningful categories for weather.description:
for col in ['weather.description']:
print(df_weather[col].unique(), len(df_weather[col].unique()))
### Convert the string timestamp to a proper datetime timestamp. Set it as index for convenience.
df_cleaned['dt_iso'] = df_cleaned['dt_iso'].apply(lambda x: pd.to_datetime(x[:-10], infer_datetime_format=True))
df_cleaned = df_cleaned.set_index('dt_iso')
df_cleaned = df_cleaned.sort_index()
### Summary statistics for weather over Jan 2018 - Mar 2020
df_cleaned.describe()
# Create pickle, in order to load the data faster from now on
#df_cleaned.to_pickle('./weather_data.pkl')
### Only if pickle was already created
df_cleaned = pd.read_pickle('weather_data.pkl')
df_cleaned[['main.temp_min','main.temp_max', 'main.temp']].plot(figsize=(15,5))
The temperature data is physically reasonable. We can see the seasonal variations.
### Pearson correlation coefficients between weather features
#df_cleaned.corr()
fig, ax = plt.subplots(figsize=(15, 12))
ax = sns.heatmap(df_cleaned.corr(),annot=True)
Temperature and pressure are thermodynamically related so the strong negative correlation makes sense. We should only include one of those two in the model.
### Filter weather data only for daytime and aggregate on daily basis.
### This is very important in order to match the aggregated bus ridership data for the modelling stage.
df_cleaned = df_cleaned.between_time('07:00','22:00')
df_cleaned['date'] = df_cleaned.index.date
### Include important, weakly correlated weather variables. This maximizes our information gain from the weather data.
df_agg = df_cleaned.groupby('date').agg({'main.temp':'mean',
'main.temp_max':'max',
'main.temp_min':'min',
'main.humidity':'mean',
'wind.speed':'mean'})
### Take the most frequently occurring weather description string as representative for the day
df_agg['weather.description'] = df_cleaned.groupby('date')['weather.description'] \
.agg(lambda x: x.value_counts().index[0])
### Plot the aggregated quantities to confirm weather behaviour still looks fine and we haven't lost information
df_agg[['main.temp']].plot(figsize=(15,5))
df_agg[['main.humidity']].plot(figsize=(15,5))
df_agg[['wind.speed']].plot(figsize=(15,5))
### Create pickle for aggregated weather data
### They are now ready to be joined with traffic data for the model
#df_agg.to_pickle('./weather_data_daily_min_max.pkl')
Ridership data preprocessing¶
- download all 363 available bus ridership files from DubaiPulse from 1 Jan 2019 - 31 Dec 2019. The overlapping timeframe with the weather dataset is for the entire 2019.
- investigate route information and systematically select a bus line with 2 directions and same amount of bus stops (no fewer than 10 stops) for simplicity
- identify and cluster bus stop touchpoints - a touchpoint is defined as the collection of passenger check-in or check-out actions at a specific bus station at time x. Consecutive passenger actions belonging to the same bus touchpoint should happen within maximum 3 minutes of each other. This allows us to differentiate between multiple passenger actions at the same time but on different stops along the same route (people getting on different buses at the same time). Most importantly, this allows us to express traffic disruption with two measures - time interval between two consecutive buses arriving at a bus stop and onboarding duration.
Defining disruption¶
These are a few of the possible measures describing traffic disruption listed in order of importance:
- time between buses (TBB) - the time between consecutive bus arrivals on a bus stop of the same route. TBB will vary a lot or increase if there are disruptions.
- excessively long onboarding process - onboarding duration is defined as the time difference between the first and last passenger check-in or check-out action within the bus stop touchpoint. We estimate this to be one of the proxies of bus delays.
- number of passenger actions per bus touchpoint - we expect a busier day or time of the day with more passengers to lead to more delays. We also expect that length of onboarding process is correlated with number of passenger actions.
Our focus in this project will be TBB.
An investigation of the bus routes dataset uncovers that some of the bus lines follow very complex tracks, therefore aggregating the information for all routes in the subsequent modeling is probably not a good idea. To solve this problem we decided to systematically check the characteristics of the routes. We focused on a suitable bus line which follows the same route in both directions (forward and backward). Line 28 is such a line, it's a popular bus line in the center of the city and should be rather representative for the bus behaviour.
### Raw data
df_events = pd.read_csv('/Users/romina/Downloads/Bus_Ridership_2019-01-21_00-00-00.csv')
df_events.head()
### Raw data
df_routes = pd.read_csv('/Users/romina/Downloads/Bus_Routes.csv')
df_routes.head()
### Find systematically suitable routes which reduce the complexity of the problem using the Bus_Routes.csv file.
df_routes = pd.read_csv('/Users/romina/datathon/Bus_Routes.csv')
routes_ev = df_events['route_name'].unique()
routes = df_routes['route_name'].unique()
for route in routes:
_df = df_routes[df_routes['route_name']==route]
dirs = _df['direction'].unique()
Ndir = len(dirs)
if Ndir == 2 and route in routes_ev:
# print(f"Route '{route}' has {Ndir} tracks.")
if ( len(_df[_df['direction']==dirs[0]]['stop_number'].unique()) ==
len(_df[_df['direction']==dirs[1]]['stop_number'].unique()) ):
Nstops = len(_df[_df['direction']==dirs[1]]['stop_number'].unique())
if Nstops > 10:
print(f"Route '{route}' has {Ndir} tracks, same number of stops in both directions with {Nstops} stops.")
### Create function to preprocess each day of bus ridership data
### With this function we are able to obtain time between buses (TBB) for all bus stops of route 28
### out of the raw ridership data. Then we can use TBB as a measure of disruption in the model!
def preprocess_daily_file_to_bus_touchpoints(df, filedate):
# Filter only for the date described in the file name
df = df[df['txn_date'] == filedate]
# Convert to timestamp
df['timestamp'] = pd.to_datetime( df['txn_date'] + ' ' + df['txn_time'], infer_datetime_format=True)
# Exclude non-relevant and redundant variables
exclude = ['start_location', 'start_zone', 'end_zone', 'txn_date', 'txn_time']
df = df.drop(exclude,axis='columns')
df = df.set_index('timestamp').sort_index()
df = df.between_time('7:00', '22:00')
# Just route 28
df = df[df['route_name']=='28']
df['dummy'] = 1.0 # just for plotting
df['time'] = df.index
# Group by "end_location"(=bus stop) and cluster the bus touchpoints
# with every passenger action within 3 minutes of each other.
grouped = df.groupby('end_location')
df_corrected = pd.DataFrame()
for name,df_group in grouped:
# for each bus stop check the time between consecutive passenger actions
df_group['shifted_time'] = df_group['time'].shift()
# if > 3 min => start of new bus touchpoint detected
df_group['is_new'] = df_group['time'] - df_group['shifted_time'] > timedelta(0,3*60)
# cumulative sum to create cluster of passenger actions
df_group['stop_int'] = df_group['is_new'].cumsum()
# remove intermediate variables
df_group = df_group.drop(['shifted_time','is_new'],axis=1)
# unite different bus stops back together
df_corrected = pd.concat([df_corrected, df_group], axis=0)
# Aggregate the passenger level data to achieve bus touchpoint level summary
df_summary = df_corrected.groupby(by=['stop_int','end_location'], as_index=False).agg({'dummy':'count',
'time': ['min', 'max'],
'route_name': 'last'})
df_summary.columns = df_summary.columns.map('_'.join)
df_summary['onboarding_duration'] = (df_summary['time_max'] - df_summary['time_min']).dt.total_seconds()
df_summary = df_summary.rename(columns = {'dummy_count':'Npassengers',
'time_min':'onboarding_start',
'time_max':'onboarding_end',
'route_name_last':'route_name',
'end_location_':'end_location'})
df_summary.drop('stop_int_', axis=1, inplace = True)
df_summary = df_summary.set_index('onboarding_start').sort_index()
df_summary = df_summary.reset_index().sort_values(['end_location','onboarding_end'])
# Remove bus touchpoints at origin (start/end of route) - onboarding duration > 10 minutes
df_summary = df_summary[df_summary['onboarding_duration']<60*10]
# Calculate TBB
grouped_bus_stops = df_summary.groupby('end_location')
df_final = pd.DataFrame()
for name,df_group in grouped_bus_stops:
df_group['TBB'] = df_group['onboarding_end'].diff().dt.total_seconds()/60
df_final = pd.concat([df_final, df_group], axis=0)
return df_final.dropna(subset=['TBB'],axis=0)
### Create function to get descriptive statistics out of TBB per day
def aggregate_daily_basis(df, filedate):
data = {'TBB_mean': df['TBB'].mean(),
'TBB_std': df['TBB'].std(),
'TBB_median': df['TBB'].median(),
'TBB_q75': df['TBB'].quantile(.75),
'TBB_q25': df['TBB'].quantile(.25),
'onboarding_duration_mean': df[df['onboarding_duration']!=0]['onboarding_duration'].mean()}
dt = pd.to_datetime(filedate, infer_datetime_format=True)
return pd.DataFrame(data=data,index=[dt.date()])
### Load, preprocess and aggregate bus data, file by file, for the entire 2019
bus_data_path = '/Users/romina/datathon/2019'
start_time = time.time()
all_files = sorted(os.listdir(bus_data_path))
all_the_data = pd.DataFrame()
for ind,f in enumerate(all_files):
filedate = f.split('.')[0].split('_')[2]
filepath = os.path.join(bus_data_path, f)
df_events = pd.read_csv(filepath)
preprocessed_data = preprocess_daily_file_to_bus_touchpoints(df_events, filedate)
agg_data = aggregate_daily_basis(preprocessed_data, filedate)
all_the_data = pd.concat([all_the_data, agg_data], axis=0)
if ind%30 == 0:
print(f'{round(time.time() - start_time,2)} seconds elapsed so far ({ind}/{len(all_files)} files preprocessed).')
### These are the potential dependent variables that describe traffic disruption
### All TBB related parameters are in minutes, onboarding duration is in seconds
all_the_data.head()
all_the_data.describe()
### We can see periodical behaviour. Weekends in Dubai are on Friday and Saturday as opposed to Europe!
### This can be clearly seen in the data with the peaks in time between buses.
### From the visual investigation the weekends seem to be the most important factor for increasing time between buses.
### This will be taken into account in the modelling.
all_the_data[[x for x in all_the_data.columns if x != 'onboarding_duration_mean']].plot(figsize=(15,5),marker ='*')
### Plotting just data during the week. We don't observe significant long term time series trends.
### There seems to be an interesting event during May 2019 with consistently higher TBB metrics.
### To account for possible seasonality within the week days
all_the_data['weekend_flag'] = ((pd.DatetimeIndex(all_the_data.index).dayofweek).isin([4,5])).astype(int)
all_the_data['weekday'] = pd.DatetimeIndex(all_the_data.index).dayofweek.astype(str)
all_the_data_week = all_the_data[all_the_data['weekend_flag']==0]
all_the_data_week[['TBB_mean','TBB_std','TBB_median','TBB_q75','TBB_q25']].plot(figsize=(15,5),marker ='*')
### Create a pickle
#all_the_data.to_pickle('./bus_data_daily_5.pkl')
Statistical modelling¶
- For each route and bus stop within the route observe the distribution of TBB during the day (between 7am and 10pm). With this we control for the effects of night bus schedules.
- We model the distribution of TBB over the entire day along with aggregated weather data on the daily basis. This allows us to control for the problematic cases when a bus skips the stop (no passengers to get on or off) which may seem like a delay, but aren't. In a daily distribution over multiple stops of popular routes, such cases would be outliers and can be suppressed using the summary metrics of the distribution.
# Only if pickle was already created
#weather_data = pd.read_pickle('weather_data_daily_min_max.pkl')
#all_the_data = pd.read_pickle('bus_data_daily_5.pkl')
# Filtering weather data only for 2019
ov_start, ov_end = datetime(2019,1,1).date(), datetime(2020,1,1).date() # get data from the overlap period
weather_data = weather_data[ (weather_data.index >= ov_start) & (weather_data.index < ov_end) ]
### Joining weather data (independent variables) and processed bus ridership data (dependent variables)
model_data = all_the_data.join(weather_data, how = 'left')
model_data.columns = [x.replace(".","_") for x in model_data.columns]
model_data.head()
Let's fit several regression models! We would like to start with a classical, more robust approach, particularly since we only work with 1 year of day level data (360 observations). Advanced ML solutions should be considered only if needed.¶
model = ols("TBB_mean ~ C(weather_description, Treatment(reference = 'sky is clear')) + weekend_flag", data=model_data)
results = model.fit()
results.summary()
Model 1: We take weather description and weekend flag as independent variables, mean TBB as dependent.¶
Independent of the weather description, time between buses will be on average 3.96 minutes longer on weekends compared to weekdays.¶
No matter weekend or not, the broken clouds weather will add on average extra 0.6 minutes to the time between buses, compared to clear sky weather. We control for the weekend effect with the weekend flag.¶
### Introduce interquantile range as another metric. It's a measure for bus irregularity
model_data['TBB_IQR'] = model_data['TBB_q75'] - model_data['TBB_q25']
model = ols("TBB_IQR ~ main_temp_min + main_humidity + wind_speed + weekend_flag", data=model_data)
results = model.fit()
results.summary()
Model 2: The new information that we gain from this model is that wind also has a small effect. 1 m/s increase in wind speed leads to .18 minutes increase in bus irregularity.¶
### Finally, we augment the TBB_mean model with a weekday predictor which is even more sensitive than the weekend flag
model = ols("TBB_mean ~ C(weather_description, Treatment(reference = 'sky is clear')) + weekday", data=model_data)
results = model.fit()
results.summary()
Model 3: This is the augmented version of model 1 with more sensitive weekday predictor instead of weekend flag. It confirms the conclusions from model 1 but now we also control for the possible effects of any weekday. This is the most reasonable model, also with the highest Adj. R squared coefficient (0.864), which represents the variance explained by the model.¶
Evaluation¶
The benefits of a linear regression are not only its easy interpretability but also the fact that you only need to split your data in training and testing set - no need for validation set (due to the analytical nature of the algorithm). We apply cross validation with train and test set on model 3. In each iteration of the crossvalidation procedure 80% of the observations are assigned to the training set, the remaining 20% are assigned to the test set.
def estimate_rmse(prediction, label):
return np.sqrt(((prediction - label)**2).mean())
di = {'sky is clear': 'sky is clear', 'overcast clouds': 'overcast clouds',
'few clouds': 'few clouds', 'broken clouds': 'broken clouds',
'dust': 'dust', 'scattered clouds': 'scattered clouds', 'light_rain': 'overcast clouds'
}
model_data['refactored_weather_desc'] = model_data['weather_description'].map(di)
counter = 0
rmse_list = []
while counter<1000:
train_data = model_data.sample(frac = 0.8)
test_data = model_data.loc[~model_data.index.isin(train_data.index)]
model = ols("TBB_mean ~ C(refactored_weather_desc, Treatment(reference = 'sky is clear')) + weekday", \
data=train_data)
results = model.fit()
test_data['predictions_weather'] = results.predict(test_data)
rmse_list.append(estimate_rmse(test_data['predictions_weather'], test_data['TBB_mean']))
counter += 1
avg_rmse = np.mean(rmse_list)
print(f'The average RMSE from 1000 crossvalidation iterations is {round(avg_rmse,2)} minutes.')
In other words, we can predict the average daily TBB in Dubai with an error of 0.71 minutes.
Conclusion & Deployment¶
The model with weather description and weekday provides basic insights about the impact of weather on bus traffic. It also allows us to use it directly to predict expected time between buses on a higher level - daily. The easy interpretability of this model as well as its higher level scope are two reasons why such a model has high business value. On the other hand, from the conducted analysis it is clear that the sensitivity of the model is insufficient. We believe that this is caused by the fact that it is based on daily aggregations. However, with the limited resources in this particular situation, the daily aggregation is the best way to approach the case in order to address the two main data insufficiencies.:
- There is no guarantee that there are passengers getting on or off on each stop. It is possible that the bus is just missing some stops as no-one wants to on-/off-board there. In such a scenario, the bus is in fact still on time but there are no data to prove it. In the data it would look like TBB is much longer and there have been delays. These would be the outliers in the TBB distribution and more often than not they do NOT mean disruption.
- Lack of vehicle level data (we can't track the paths of each bus separately). We just know that A bus picked up passengers at a specific bus stop. It is clear that at a given moment in time, there are several buses driving a given route, but each of them is located near a different bus stop on that route (if it weren’t the case, the people would have to wait a long time). Neither do we have data on the number of buses per bus line, nor a vehicle identifier in the data set. We only have passenger check-ins and check-outs, and therefore one potentially relevant parameter to extract is exactly the so-called TBB (Time Between two consecutive Buses on a given bus stop).
In order to address these two issues, we decided to work with the daily distribution of the TBB parameter. If we build a model on a more granular level with this data, we will be forced to provide potentially inflated delay times as ground truth. With our approach we intend to capture global variations in the behaviour of TBB potentially caused by weather. A disruption due to weather would on average not lead to huge abrupt changes in TBB.
GPS datasets per bus would solve both of the above issues.
Assuming we have such data at our disposal,a model with higher granularity would definitely be more appropriate for production level purposes. This would allow to account for peak vs non-peak traffic in the model. Certainly, it would allow to make use of the weather variations within the day as well.
In summary, we would consider the following further steps:
- Enhance the dataset with GPS data per vehicle.
- Come up with a joined dataset on hourly basis or several hours to get finer resolution.
- Add additional routes, not just 28, considering the peculiarities of the different bus line routes.
- Use also the onboarding duration as a measure of traffic disruption and not only TBB.
# plotting distribution of TBB (time between buses) in minutes over the day
plt.figure(figsize=(12,8))
plt.hist(df_summary_day[df_summary_day['end_location']=='Satwa, Square 1']['onboarding_end']
.diff().dt.total_seconds().values/60, 15)
plt.xlabel('TBB (min)', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title(f'Histogram of time between buses for station Satwa, Square 1 on 2019-01-21')
Archive¶
Assuming normal operation on that day we see that TBB has a mean of about 18 mins and a standard deviation of 8.73. Upon disruption caused by bad weather the distribution of TBB will probably change. With our model we will quantify this effect.
#mean TBB
mean_TBB = np.nanmean(df_summary_day[df_summary_day['end_location']=='Satwa, Square 1']['onboarding_end']
.diff().dt.total_seconds().values/60)
#std TBB
std_TBB = np.nanstd(df_summary_day[df_summary_day['end_location']=='Satwa, Square 1']['onboarding_end']
.diff().dt.total_seconds().values/60)
mean_TBB, std_TBB
7 thoughts on “Weather-proof Mobility”
A good approach to the topic- Nice and clean definition of business understanding. Data analysis performed with correct focus and omitting unnecessary segments which would complicate the solution. We are missing the final model/conclusion of what can be observed and used as an outcome of the whole “project”.
Also, I would suggest focussing on a shorter time-series segment (daily patterns) to identify peak/offpeak periods and how they interact with TBB in this case together with weather conditions. We all know that bus driers should be professionals but the majority of “normal” non-bus driers are not and they are heavily impacted in distracting sensor inputs (thunderstorm, rain, people cutting in, or even forgetting how to drive when weather condition changes).
Also, I would advise using some additional datasets which were not part of the initial dataset, like aggregated daily traffic estimates on an hourly basis provided by some navigation applications because that can additionally help with model precision. – I’m adding my last sentence about additional dataset to all teams focusing on this problem because no one did even consider it and that is something you can always do on any project – focus not on internal/provided data but find something to augment it π
Hi, romina & bogomil
That’s great that you played with the data about public transport… And the documentation is well structured.
Here are my comments about the periodic component in TBB variables. You could take into account this “seasonal” component – e.g. model it, then remove it from data, then account for the weather effect, next make forecast and then reintroduce the periodic component back. The reason for this split of periodic & non-periodic parts (and using different models) is the different reasons (factors) for their presence in the data…
Another approach is to add more variables (day of the week), attempting to represent the periodicity…
Nice, clear approach: good work.
That said, have you looked at adding interaction terms in the regression (i.e. weather * weekday)? This might necessitate changing the resolution of the data from daily to half-day or even less.
Also, it might benefit the analysis to look into quantile regression.
Best regards,
Thanks for the comment!
We did explore some models with interaction terms, as we could imagine that different social behavior on weekends vs weekdays could change the way weather influences traffic. From the models that we tried out, we did not discover an interaction effect worth publishing. Indeed, increasing the granularity of the model may very well reveal such effects.
Hi, again π
I like very much your work: the considerations related to the data, the interpretation of the outliers, the conclusions and also the good business understanding. π
I have some comments & questions about the final model: looking at the p-values you put many not significant factors in the model or I misunderstood something. In order to reduce the possibility of overfitting, I would remove some of them in order only significant factors finally to stay. And to check the model for overfitting, we should compare R2, adj.R2, RMSE, etc., both for the train and test samples. In the case of the cross validation you did we also should do this for the average measures of the model quality. Also using linear regression, we impose particular hypothesis about the type of relation between the factors and the dependent. So, it would be good to check other models as well, especially non-parametric ones. Nevertheless, I really like what you have done.
Hello Alex, thanks for the detailed feedback!
You are right about the non-significant variables. Regarding model 3, we have 2 categorical variables with multiple levels. Some levels of both variables have p<0.05, therefore we kept them in the model we evaluated. In models 1 and 2, we kept the summary with all coefficients for transparency, but as you pointed out, they should be omitted for the final interpretation of the model.
Indeed, we didn't comment on the potential violation of assumptions and alternative modeling options. Non-parametric models, non-linear models and time series regression models are all valid extensions for adequately describing the relationship between weather and traffic disruptions.