Datathon 2020 SolutionsDatathons Solutions

Weather-proof Mobility

2
votes
In [166]:
import os
import pandas as pd
import numpy as np
import json
import pprint
import seaborn as sns
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import time
import warnings
from statsmodels.formula.api import ols

warnings.filterwarnings('ignore')

Business understanding

The aim of this article is to present an data driven approach for predicting traffic disruption using weather data. The key point in this endeavor to define meaningful metrics representing disrupted traffic. Bad weather leads to harsher driving conditions which affect almost all kinds of public transport, bus transport in particular. Weather also influences social behaviour (e.g. increased passenger flow in buses or more traffic jams on the streets). Therefore, we expect that weather will have some predictive power in determining bus irregularities.

Data understanding

The data considered for this project was extracted from the Roads and Transport Authority (RTA) by DubaiPulse (https://www.dubaipulse.gov.ae/organisation/rta) and weather data provided by OpenWeather. In particular, bus routes and bus ridership data was used during the analysis phase. The final time period considered in this evaluation is 1 Jan 2019 - 31 Dec 2019.

The weather dataset contains various continuous as well as categorical weather parameters recorded on hourly basis. The data are from just one weather station in Dubai.

The ridership dataset contains bus passenger actions (check in and check out actions) with a timestamp and bus station on a specified route. We do not have vehicle level data available, neither timetables. We will attempt to represent traffic disruption as accurately as possible, even in the absence of some crucial data about bus behaviour.

Weather data preprocessing

The following steps were taken to preprocess weather data:

  • load and parse weather data for the period 1 Jan 2018 - 16 Mar 2020
  • remove irrelevant components
  • investigate correlations, outliers, time trends
  • identify potential predictor - weather.description with 17 descriptive categories covering most weather conditions with the exception of wind.
In [304]:
### Open and parse the provided weather data

with open('weather_data.txt','r') as f:
    data = json.load(f)

    df_weather = pd.DataFrame()
    for i in range(len(data)):
        json_hourly = data[i]
        json_hourly['weather'] = json_hourly['weather'][0]
        _df = json_normalize(json_hourly)  # flatten the nested json
        df_weather = pd.concat([df_weather,_df])
In [308]:
### View raw dataset
df_weather.describe()
Out[308]:
lat lon dt timezone main.temp main.temp_min main.temp_max main.feels_like main.pressure main.humidity wind.speed wind.deg clouds.all weather.id rain.1h rain.3h
count 1.934400e+04 1.934400e+04 1.934400e+04 19344.0 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 28.000000 85.000000
mean 2.507501e+01 5.518876e+01 1.549582e+09 14400.0 28.102823 26.661868 29.810532 27.684793 1009.416098 52.495089 3.879056 188.379239 13.751964 791.048904 0.415357 0.828706
std 1.291090e-11 2.732107e-11 2.010339e+07 0.0 7.329419 7.580049 7.240840 8.309911 8.017003 21.660800 2.098738 106.258945 26.479664 41.348948 0.456707 0.615405
min 2.507501e+01 5.518876e+01 1.514765e+09 14400.0 10.890000 7.000000 12.000000 6.340000 972.000000 4.000000 0.300000 0.000000 0.000000 200.000000 0.110000 0.130000
25% 2.507501e+01 5.518876e+01 1.532174e+09 14400.0 22.030000 20.920000 23.840000 20.750000 1003.000000 35.000000 2.315000 100.000000 0.000000 800.000000 0.167500 0.380000
50% 2.507501e+01 5.518876e+01 1.549582e+09 14400.0 28.060000 26.670000 30.000000 27.315000 1011.000000 53.000000 3.600000 180.000000 1.000000 800.000000 0.215000 1.000000
75% 2.507501e+01 5.518876e+01 1.566991e+09 14400.0 33.880000 32.810000 35.122500 34.890000 1016.000000 69.000000 5.100000 290.000000 19.000000 800.000000 0.400000 1.000000
max 2.507501e+01 5.518876e+01 1.584400e+09 14400.0 45.940000 45.360000 48.000000 47.890000 1026.000000 100.000000 14.900000 360.000000 100.000000 804.000000 2.030000 3.810000
1.Exclude GPS coordinates, timezone, city name because the location is always the same.
2.Exclude dt as we have the human readable UTC timestamp anyway
3.Exclude weather.main and weather.icon as weather.description contains all this info anyway
In [309]:
exclude_cols = ['weather.main', 'weather.icon', 'city_name', 'dt', 'lat', 'lon', 'timezone']
df_cleaned = df_weather.drop(exclude_cols, axis='columns')
In [310]:
### These are the meaningful categories for weather.description:
for col in ['weather.description']:
    print(df_weather[col].unique(), len(df_weather[col].unique()))
['sky is clear' 'mist' 'haze' 'few clouds' 'broken clouds'
 'scattered clouds' 'light rain' 'overcast clouds' 'fog' 'dust'
 'moderate rain' 'light intensity shower rain' 'thunderstorm'
 'thunderstorm with rain' 'smoke' 'thunderstorm with heavy rain'
 'thunderstorm with light rain'] 17
In [311]:
### Convert the string timestamp to a proper datetime timestamp. Set it as index for convenience.
df_cleaned['dt_iso'] = df_cleaned['dt_iso'].apply(lambda x: pd.to_datetime(x[:-10], infer_datetime_format=True))
df_cleaned = df_cleaned.set_index('dt_iso')
df_cleaned = df_cleaned.sort_index()

### Summary statistics for weather over Jan 2018 - Mar 2020
df_cleaned.describe()
Out[311]:
main.temp main.temp_min main.temp_max main.feels_like main.pressure main.humidity wind.speed wind.deg clouds.all weather.id rain.1h rain.3h
count 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 28.000000 85.000000
mean 28.102823 26.661868 29.810532 27.684793 1009.416098 52.495089 3.879056 188.379239 13.751964 791.048904 0.415357 0.828706
std 7.329419 7.580049 7.240840 8.309911 8.017003 21.660800 2.098738 106.258945 26.479664 41.348948 0.456707 0.615405
min 10.890000 7.000000 12.000000 6.340000 972.000000 4.000000 0.300000 0.000000 0.000000 200.000000 0.110000 0.130000
25% 22.030000 20.920000 23.840000 20.750000 1003.000000 35.000000 2.315000 100.000000 0.000000 800.000000 0.167500 0.380000
50% 28.060000 26.670000 30.000000 27.315000 1011.000000 53.000000 3.600000 180.000000 1.000000 800.000000 0.215000 1.000000
75% 33.880000 32.810000 35.122500 34.890000 1016.000000 69.000000 5.100000 290.000000 19.000000 800.000000 0.400000 1.000000
max 45.940000 45.360000 48.000000 47.890000 1026.000000 100.000000 14.900000 360.000000 100.000000 804.000000 2.030000 3.810000
In [312]:
# Create pickle, in order to load the data faster from now on
#df_cleaned.to_pickle('./weather_data.pkl')
In [2]:
### Only if pickle was already created
df_cleaned = pd.read_pickle('weather_data.pkl')
In [5]:
df_cleaned[['main.temp_min','main.temp_max', 'main.temp']].plot(figsize=(15,5))
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a24b0c810>

The temperature data is physically reasonable. We can see the seasonal variations.

In [8]:
### Pearson correlation coefficients between weather features
#df_cleaned.corr()
In [7]:
fig, ax = plt.subplots(figsize=(15, 12))

ax = sns.heatmap(df_cleaned.corr(),annot=True)

Temperature and pressure are thermodynamically related so the strong negative correlation makes sense. We should only include one of those two in the model.

In [245]:
### Filter weather data only for daytime and aggregate on daily basis. 
### This is very important in order to match the aggregated bus ridership data for the modelling stage.
df_cleaned = df_cleaned.between_time('07:00','22:00')

df_cleaned['date'] = df_cleaned.index.date

### Include important, weakly correlated weather variables. This maximizes our information gain from the weather data.
df_agg = df_cleaned.groupby('date').agg({'main.temp':'mean',
                                         'main.temp_max':'max',
                                         'main.temp_min':'min',
                                         'main.humidity':'mean',
                                         'wind.speed':'mean'})

### Take the most frequently occurring weather description string as representative for the day
df_agg['weather.description'] = df_cleaned.groupby('date')['weather.description'] \
                                          .agg(lambda x: x.value_counts().index[0])
In [313]:
### Plot the aggregated quantities to confirm weather behaviour still looks fine and we haven't lost information
df_agg[['main.temp']].plot(figsize=(15,5))
Out[313]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a464a6ad0>
In [314]:
df_agg[['main.humidity']].plot(figsize=(15,5))
Out[314]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a4920db50>
In [315]:
df_agg[['wind.speed']].plot(figsize=(15,5))
Out[315]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a258c88d0>
In [248]:
### Create pickle for aggregated weather data
### They are now ready to be joined with traffic data for the model
#df_agg.to_pickle('./weather_data_daily_min_max.pkl')

Ridership data preprocessing

  • download all 363 available bus ridership files from DubaiPulse from 1 Jan 2019 - 31 Dec 2019. The overlapping timeframe with the weather dataset is for the entire 2019.
  • investigate route information and systematically select a bus line with 2 directions and same amount of bus stops (no fewer than 10 stops) for simplicity
  • identify and cluster bus stop touchpoints - a touchpoint is defined as the collection of passenger check-in or check-out actions at a specific bus station at time x. Consecutive passenger actions belonging to the same bus touchpoint should happen within maximum 3 minutes of each other. This allows us to differentiate between multiple passenger actions at the same time but on different stops along the same route (people getting on different buses at the same time). Most importantly, this allows us to express traffic disruption with two measures - time interval between two consecutive buses arriving at a bus stop and onboarding duration.

Defining disruption

These are a few of the possible measures describing traffic disruption listed in order of importance:

  • time between buses (TBB) - the time between consecutive bus arrivals on a bus stop of the same route. TBB will vary a lot or increase if there are disruptions.
  • excessively long onboarding process - onboarding duration is defined as the time difference between the first and last passenger check-in or check-out action within the bus stop touchpoint. We estimate this to be one of the proxies of bus delays.
  • number of passenger actions per bus touchpoint - we expect a busier day or time of the day with more passengers to lead to more delays. We also expect that length of onboarding process is correlated with number of passenger actions.

Our focus in this project will be TBB.

An investigation of the bus routes dataset uncovers that some of the bus lines follow very complex tracks, therefore aggregating the information for all routes in the subsequent modeling is probably not a good idea. To solve this problem we decided to systematically check the characteristics of the routes. We focused on a suitable bus line which follows the same route in both directions (forward and backward). Line 28 is such a line, it's a popular bus line in the center of the city and should be rather representative for the bus behaviour.

In [170]:
### Raw data
df_events = pd.read_csv('/Users/romina/Downloads/Bus_Ridership_2019-01-21_00-00-00.csv')
df_events.head()
Out[170]:
txn_type txn_date txn_time start_location end_location route_name start_zone end_zone
0 Check in 2019-01-20 15:56:20 Naif Intersection 1 Naif Intersection 1 X13 Zone 5 Zone 5
1 Check out 2019-01-21 07:28:50 NaN Deira City Center Bus Station A 22 Zone 5 Zone 5
2 Check out 2019-01-21 20:18:44 NaN Mirdiff City Centre F10 Zone 4 Zone 4
3 Check in 2019-01-21 19:24:48 ADCB Metro Station Seaside ADCB Metro Station Seaside 28 Zone 6 Zone 6
4 Check out 2019-01-21 12:03:46 NaN Oud Metha Metro Station 2 E16 Zone 0 Zone 0
In [169]:
### Raw data
df_routes = pd.read_csv('/Users/romina/Downloads/Bus_Routes.csv')
df_routes.head()
Out[169]:
route_name route_type direction stop_name stop_number
0 10 Urban GSBS1 -> QZBSB1 Gold Souq Bus Station 1 0
1 10 Urban GSBS1 -> QZBSB1 Naif Intersection 1 1
2 10 Urban GSBS1 -> QZBSB1 Naif, Intersection 1 1
3 10 Urban GSBS1 -> QZBSB1 Burj Nahar, Intersection 1 2
4 10 Urban GSBS1 -> QZBSB1 Al Nakhal 1 1 3
In [316]:
### Find systematically suitable routes which reduce the complexity of the problem using the Bus_Routes.csv file.
df_routes = pd.read_csv('/Users/romina/datathon/Bus_Routes.csv')

routes_ev = df_events['route_name'].unique()
routes = df_routes['route_name'].unique()

for route in routes:
    _df = df_routes[df_routes['route_name']==route]
    dirs = _df['direction'].unique()
    Ndir = len(dirs)
    if Ndir == 2 and route in routes_ev:
#         print(f"Route '{route}' has {Ndir} tracks.")
        if ( len(_df[_df['direction']==dirs[0]]['stop_number'].unique()) ==
            len(_df[_df['direction']==dirs[1]]['stop_number'].unique()) ):
            Nstops = len(_df[_df['direction']==dirs[1]]['stop_number'].unique())
            if Nstops > 10:
                print(f"Route '{route}' has {Ndir} tracks, same number of stops in both directions with {Nstops} stops.")
Route '28' has 2 tracks, same number of stops in both directions with 22 stops.
Route '32C' has 2 tracks, same number of stops in both directions with 35 stops.
Route '83' has 2 tracks, same number of stops in both directions with 31 stops.
Route 'F47' has 2 tracks, same number of stops in both directions with 50 stops.
Route 'N55' has 2 tracks, same number of stops in both directions with 49 stops.
In [39]:
### Create function to preprocess each day of bus ridership data
### With this function we are able to obtain time between buses (TBB) for all bus stops of route 28
### out of the raw ridership data. Then we can use TBB as a measure of disruption in the model!

def preprocess_daily_file_to_bus_touchpoints(df, filedate):
    # Filter only for the date described in the file name
    df = df[df['txn_date'] == filedate]
    
    # Convert to timestamp
    df['timestamp'] = pd.to_datetime( df['txn_date'] + ' ' + df['txn_time'], infer_datetime_format=True)
    
    # Exclude non-relevant and redundant variables
    exclude = ['start_location', 'start_zone', 'end_zone', 'txn_date', 'txn_time']
    df = df.drop(exclude,axis='columns')
    df = df.set_index('timestamp').sort_index()
    df = df.between_time('7:00', '22:00')

    # Just route 28
    df = df[df['route_name']=='28']
    df['dummy'] = 1.0 # just for plotting
    df['time'] = df.index
    
    # Group by "end_location"(=bus stop) and cluster the bus touchpoints 
    # with every passenger action within 3 minutes of each other.
    grouped = df.groupby('end_location')

    df_corrected = pd.DataFrame()

    for name,df_group in grouped:
    
        # for each bus stop check the time between consecutive passenger actions
        df_group['shifted_time'] = df_group['time'].shift()
        # if > 3 min => start of new bus touchpoint detected
        df_group['is_new'] = df_group['time'] - df_group['shifted_time']  > timedelta(0,3*60)
        # cumulative sum to create cluster of passenger actions
        df_group['stop_int'] = df_group['is_new'].cumsum()
        # remove intermediate variables
        df_group = df_group.drop(['shifted_time','is_new'],axis=1)
    
        # unite different bus stops back together
        df_corrected = pd.concat([df_corrected, df_group], axis=0)
        
    # Aggregate the passenger level data to achieve bus touchpoint level summary
    df_summary = df_corrected.groupby(by=['stop_int','end_location'], as_index=False).agg({'dummy':'count',
                                                                 'time': ['min', 'max'],
                                                                 'route_name': 'last'})
    df_summary.columns = df_summary.columns.map('_'.join)
    df_summary['onboarding_duration'] = (df_summary['time_max'] - df_summary['time_min']).dt.total_seconds()
    df_summary = df_summary.rename(columns = {'dummy_count':'Npassengers',
                                              'time_min':'onboarding_start',
                                              'time_max':'onboarding_end',
                                              'route_name_last':'route_name',
                                              'end_location_':'end_location'})
    df_summary.drop('stop_int_', axis=1, inplace = True)
    df_summary = df_summary.set_index('onboarding_start').sort_index()
    df_summary = df_summary.reset_index().sort_values(['end_location','onboarding_end'])
    
    # Remove bus touchpoints at origin (start/end of route) - onboarding duration > 10 minutes
    df_summary = df_summary[df_summary['onboarding_duration']<60*10]
    
    # Calculate TBB 
    grouped_bus_stops = df_summary.groupby('end_location')

    df_final = pd.DataFrame()

    for name,df_group in grouped_bus_stops:
        df_group['TBB'] = df_group['onboarding_end'].diff().dt.total_seconds()/60
        df_final = pd.concat([df_final, df_group], axis=0)
        
    return df_final.dropna(subset=['TBB'],axis=0)
In [40]:
### Create function to get descriptive statistics out of TBB per day 
def aggregate_daily_basis(df, filedate):
    data = {'TBB_mean': df['TBB'].mean(),
            'TBB_std': df['TBB'].std(),
            'TBB_median': df['TBB'].median(),
            'TBB_q75': df['TBB'].quantile(.75),
            'TBB_q25': df['TBB'].quantile(.25),
            'onboarding_duration_mean': df[df['onboarding_duration']!=0]['onboarding_duration'].mean()}
    
    dt =  pd.to_datetime(filedate, infer_datetime_format=True)
    
    return pd.DataFrame(data=data,index=[dt.date()])
In [41]:
### Load, preprocess and aggregate bus data, file by file, for the entire 2019
bus_data_path = '/Users/romina/datathon/2019'

start_time = time.time()
all_files = sorted(os.listdir(bus_data_path))

all_the_data = pd.DataFrame()

for ind,f in enumerate(all_files):
    filedate = f.split('.')[0].split('_')[2]
    filepath = os.path.join(bus_data_path, f)
    df_events = pd.read_csv(filepath)
    
    preprocessed_data = preprocess_daily_file_to_bus_touchpoints(df_events, filedate)   
    agg_data = aggregate_daily_basis(preprocessed_data, filedate)
    
    all_the_data = pd.concat([all_the_data, agg_data], axis=0)
    if ind%30 == 0:
        print(f'{round(time.time() - start_time,2)} seconds elapsed so far ({ind}/{len(all_files)} files preprocessed).')
2.16 seconds elapsed so far (0/363 files preprocessed).
77.43 seconds elapsed so far (30/363 files preprocessed).
155.23 seconds elapsed so far (60/363 files preprocessed).
229.15 seconds elapsed so far (90/363 files preprocessed).
305.99 seconds elapsed so far (120/363 files preprocessed).
376.73 seconds elapsed so far (150/363 files preprocessed).
446.87 seconds elapsed so far (180/363 files preprocessed).
515.39 seconds elapsed so far (210/363 files preprocessed).
588.51 seconds elapsed so far (240/363 files preprocessed).
663.73 seconds elapsed so far (270/363 files preprocessed).
742.31 seconds elapsed so far (300/363 files preprocessed).
817.66 seconds elapsed so far (330/363 files preprocessed).
891.4 seconds elapsed so far (360/363 files preprocessed).
In [43]:
### These are the potential dependent variables that describe traffic disruption
### All TBB related parameters are in minutes, onboarding duration is in seconds
all_the_data.head()
Out[43]:
TBB_mean TBB_std TBB_median TBB_q75 TBB_q25 onboarding_duration_mean
2019-01-01 19.262274 14.953100 15.316667 21.916667 11.600000 61.064315
2019-01-02 17.730909 11.201662 15.183333 20.166667 11.883333 58.625152
2019-01-03 17.479218 10.409402 15.716667 20.183333 12.100000 55.018007
2019-01-04 21.168484 11.627541 19.400000 23.000000 16.000000 55.028914
2019-01-05 21.307247 14.492919 18.200000 23.308333 14.183333 55.664688
In [45]:
all_the_data.describe()
Out[45]:
TBB_mean TBB_std TBB_median TBB_q75 TBB_q25 onboarding_duration_mean
count 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000
mean 18.946133 12.403297 16.435744 21.759011 12.766322 58.278772
std 1.921432 1.430925 1.792802 2.202220 1.559058 2.935356
min 16.972585 9.811547 14.516667 19.325000 10.833333 51.334773
25% 17.548269 11.309576 15.116667 20.216667 11.722917 56.299921
50% 17.999851 12.084714 15.491667 20.833333 11.987500 58.114551
75% 20.986245 13.187111 18.229167 22.779167 14.100000 60.267516
max 26.048686 18.084149 23.333333 32.087500 18.754167 66.734878
In [129]:
### We can see periodical behaviour. Weekends in Dubai are on Friday and Saturday as opposed to Europe!
### This can be clearly seen in the data with the peaks in time between buses. 
### From the visual investigation the weekends seem to be the most important factor for increasing time between buses.
### This will be taken into account in the modelling.

all_the_data[[x for x in all_the_data.columns if x != 'onboarding_duration_mean']].plot(figsize=(15,5),marker ='*')
Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2ec4a110>
In [135]:
### Plotting just data during the week. We don't observe significant long term time series trends.
### There seems to be an interesting event during May 2019 with consistently higher TBB metrics.

### To account for possible seasonality within the week days
all_the_data['weekend_flag'] = ((pd.DatetimeIndex(all_the_data.index).dayofweek).isin([4,5])).astype(int)
all_the_data['weekday'] = pd.DatetimeIndex(all_the_data.index).dayofweek.astype(str)

all_the_data_week = all_the_data[all_the_data['weekend_flag']==0]
all_the_data_week[['TBB_mean','TBB_std','TBB_median','TBB_q75','TBB_q25']].plot(figsize=(15,5),marker ='*')
Out[135]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2ddb6110>
In [44]:
### Create a pickle
#all_the_data.to_pickle('./bus_data_daily_5.pkl')

Statistical modelling

  • For each route and bus stop within the route observe the distribution of TBB during the day (between 7am and 10pm). With this we control for the effects of night bus schedules.
  • We model the distribution of TBB over the entire day along with aggregated weather data on the daily basis. This allows us to control for the problematic cases when a bus skips the stop (no passengers to get on or off) which may seem like a delay, but aren't. In a daily distribution over multiple stops of popular routes, such cases would be outliers and can be suppressed using the summary metrics of the distribution.
In [6]:
# Only if pickle was already created
#weather_data = pd.read_pickle('weather_data_daily_min_max.pkl')
#all_the_data = pd.read_pickle('bus_data_daily_5.pkl')
In [136]:
# Filtering weather data only for 2019
ov_start, ov_end = datetime(2019,1,1).date(), datetime(2020,1,1).date() # get data from the overlap period 
weather_data = weather_data[ (weather_data.index >= ov_start) & (weather_data.index < ov_end) ]

### Joining weather data (independent variables) and processed bus ridership data (dependent variables)
model_data = all_the_data.join(weather_data, how = 'left')
model_data.columns = [x.replace(".","_") for x in model_data.columns]

model_data.head()
Out[136]:
TBB_mean TBB_std TBB_median TBB_q75 TBB_q25 onboarding_duration_mean weekend_flag weekday main_temp main_temp_max main_temp_min main_humidity wind_speed weather_description
2019-01-01 19.262274 14.953100 15.316667 21.916667 11.600000 61.064315 0 1 22.857500 27.00 15.00 54.6875 2.989375 sky is clear
2019-01-02 17.730909 11.201662 15.183333 20.166667 11.883333 58.625152 0 2 22.658125 27.18 15.00 51.0625 2.605000 overcast clouds
2019-01-03 17.479218 10.409402 15.716667 20.183333 12.100000 55.018007 0 3 24.063125 30.00 17.00 49.7500 3.248125 overcast clouds
2019-01-04 21.168484 11.627541 19.400000 23.000000 16.000000 55.028914 1 4 23.510625 28.00 15.97 64.5625 4.661875 scattered clouds
2019-01-05 21.307247 14.492919 18.200000 23.308333 14.183333 55.664688 1 5 22.298125 26.59 13.00 63.6250 2.557500 sky is clear

Let's fit several regression models! We would like to start with a classical, more robust approach, particularly since we only work with 1 year of day level data (360 observations). Advanced ML solutions should be considered only if needed.

In [12]:
model = ols("TBB_mean ~ C(weather_description, Treatment(reference = 'sky is clear')) + weekend_flag", data=model_data)
results = model.fit()
results.summary()
Out[12]:
OLS Regression Results
Dep. Variable: TBB_mean R-squared: 0.865
Model: OLS Adj. R-squared: 0.863
Method: Least Squares F-statistic: 325.8
Date: Mon, 18 May 2020 Prob (F-statistic): 2.55e-150
Time: 17:03:56 Log-Likelihood: -387.76
No. Observations: 363 AIC: 791.5
Df Residuals: 355 BIC: 822.7
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 17.7866 0.048 373.076 0.000 17.693 17.880
C(weather_description, Treatment(reference='sky is clear'))[T.broken clouds] 0.6036 0.242 2.493 0.013 0.128 1.080
C(weather_description, Treatment(reference='sky is clear'))[T.dust] 0.0275 0.229 0.120 0.904 -0.423 0.478
C(weather_description, Treatment(reference='sky is clear'))[T.few clouds] 0.0091 0.178 0.051 0.959 -0.340 0.359
C(weather_description, Treatment(reference='sky is clear'))[T.light rain] -0.2309 0.505 -0.457 0.648 -1.225 0.763
C(weather_description, Treatment(reference='sky is clear'))[T.overcast clouds] 0.2673 0.177 1.506 0.133 -0.082 0.616
C(weather_description, Treatment(reference='sky is clear'))[T.scattered clouds] -0.2467 0.321 -0.768 0.443 -0.878 0.385
weekend_flag 3.9634 0.083 47.543 0.000 3.799 4.127
Omnibus: 190.588 Durbin-Watson: 1.214
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1205.489
Skew: 2.174 Prob(JB): 1.70e-262
Kurtosis: 10.798 Cond. No. 14.2


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 1: We take weather description and weekend flag as independent variables, mean TBB as dependent.

Independent of the weather description, time between buses will be on average 3.96 minutes longer on weekends compared to weekdays.

No matter weekend or not, the broken clouds weather will add on average extra 0.6 minutes to the time between buses, compared to clear sky weather. We control for the weekend effect with the weekend flag.

In [141]:
### Introduce interquantile range as another metric. It's a measure for bus irregularity
model_data['TBB_IQR'] = model_data['TBB_q75'] - model_data['TBB_q25']

model = ols("TBB_IQR ~ main_temp_min +  main_humidity +  wind_speed + weekend_flag", data=model_data)
results = model.fit()
results.summary()
Out[141]:
OLS Regression Results
Dep. Variable: TBB_IQR R-squared: 0.125
Model: OLS Adj. R-squared: 0.115
Method: Least Squares F-statistic: 12.75
Date: Mon, 18 May 2020 Prob (F-statistic): 1.03e-09
Time: 20:33:59 Log-Likelihood: -605.47
No. Observations: 363 AIC: 1221.
Df Residuals: 358 BIC: 1240.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 8.9718 0.512 17.512 0.000 7.964 9.979
main_temp_min -0.0343 0.012 -2.872 0.004 -0.058 -0.011
main_humidity -0.0035 0.006 -0.623 0.534 -0.014 0.007
wind_speed 0.1831 0.062 2.964 0.003 0.062 0.305
weekend_flag 0.8923 0.150 5.933 0.000 0.596 1.188
Omnibus: 136.862 Durbin-Watson: 1.860
Prob(Omnibus): 0.000 Jarque-Bera (JB): 917.043
Skew: 1.420 Prob(JB): 7.36e-200
Kurtosis: 10.250 Cond. No. 405.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 2: The new information that we gain from this model is that wind also has a small effect. 1 m/s increase in wind speed leads to .18 minutes increase in bus irregularity.

In [148]:
### Finally, we augment the TBB_mean model with a weekday predictor which is even more sensitive than the weekend flag
model = ols("TBB_mean ~ C(weather_description, Treatment(reference = 'sky is clear')) + weekday", data=model_data)
results = model.fit()
results.summary()
Out[148]:
OLS Regression Results
Dep. Variable: TBB_mean R-squared: 0.868
Model: OLS Adj. R-squared: 0.864
Method: Least Squares F-statistic: 192.1
Date: Mon, 18 May 2020 Prob (F-statistic): 7.06e-146
Time: 20:47:56 Log-Likelihood: -383.82
No. Observations: 363 AIC: 793.6
Df Residuals: 350 BIC: 844.3
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 17.8500 0.102 175.658 0.000 17.650 18.050
C(weather_description, Treatment(reference='sky is clear'))[T.broken clouds] 0.6075 0.241 2.516 0.012 0.133 1.082
C(weather_description, Treatment(reference='sky is clear'))[T.dust] 0.0324 0.228 0.142 0.887 -0.417 0.481
C(weather_description, Treatment(reference='sky is clear'))[T.few clouds] -0.0118 0.177 -0.066 0.947 -0.361 0.337
C(weather_description, Treatment(reference='sky is clear'))[T.light rain] -0.2951 0.507 -0.582 0.561 -1.292 0.701
C(weather_description, Treatment(reference='sky is clear'))[T.overcast clouds] 0.2899 0.178 1.630 0.104 -0.060 0.640
C(weather_description, Treatment(reference='sky is clear'))[T.scattered clouds] -0.2435 0.321 -0.759 0.448 -0.874 0.387
weekday[T.1] -0.0454 0.139 -0.326 0.745 -0.319 0.229
weekday[T.2] -0.1114 0.141 -0.793 0.428 -0.388 0.165
weekday[T.3] -0.0289 0.140 -0.206 0.837 -0.304 0.247
weekday[T.4] 3.7232 0.141 26.455 0.000 3.446 4.000
weekday[T.5] 4.0764 0.141 28.906 0.000 3.799 4.354
weekday[T.6] -0.1316 0.141 -0.934 0.351 -0.409 0.145
Omnibus: 172.716 Durbin-Watson: 1.167
Prob(Omnibus): 0.000 Jarque-Bera (JB): 927.308
Skew: 1.990 Prob(JB): 4.34e-202
Kurtosis: 9.743 Cond. No. 14.5


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 3: This is the augmented version of model 1 with more sensitive weekday predictor instead of weekend flag. It confirms the conclusions from model 1 but now we also control for the possible effects of any weekday. This is the most reasonable model, also with the highest Adj. R squared coefficient (0.864), which represents the variance explained by the model.

Evaluation

The benefits of a linear regression are not only its easy interpretability but also the fact that you only need to split your data in training and testing set - no need for validation set (due to the analytical nature of the algorithm). We apply cross validation with train and test set on model 3. In each iteration of the crossvalidation procedure 80% of the observations are assigned to the training set, the remaining 20% are assigned to the test set.

In [160]:
def estimate_rmse(prediction, label):
    return np.sqrt(((prediction - label)**2).mean())

di = {'sky is clear': 'sky is clear', 'overcast clouds': 'overcast clouds',
      'few clouds': 'few clouds', 'broken clouds': 'broken clouds',
      'dust': 'dust', 'scattered clouds': 'scattered clouds', 'light_rain': 'overcast clouds'
     }

model_data['refactored_weather_desc'] = model_data['weather_description'].map(di)

counter = 0
rmse_list = []
while counter<1000:
    train_data = model_data.sample(frac = 0.8)
    test_data = model_data.loc[~model_data.index.isin(train_data.index)]
    
    model = ols("TBB_mean ~ C(refactored_weather_desc, Treatment(reference = 'sky is clear')) + weekday", \
                data=train_data)
    results = model.fit()
    test_data['predictions_weather'] = results.predict(test_data)
    
    rmse_list.append(estimate_rmse(test_data['predictions_weather'], test_data['TBB_mean']))
    
    counter += 1
In [165]:
avg_rmse = np.mean(rmse_list)
print(f'The average RMSE from 1000 crossvalidation iterations is {round(avg_rmse,2)} minutes.')
The average RMSE from 1000 crossvalidation iterations is 0.71 minutes.

In other words, we can predict the average daily TBB in Dubai with an error of 0.71 minutes.

Conclusion & Deployment

The model with weather description and weekday provides basic insights about the impact of weather on bus traffic. It also allows us to use it directly to predict expected time between buses on a higher level - daily. The easy interpretability of this model as well as its higher level scope are two reasons why such a model has high business value. On the other hand, from the conducted analysis it is clear that the sensitivity of the model is insufficient. We believe that this is caused by the fact that it is based on daily aggregations. However, with the limited resources in this particular situation, the daily aggregation is the best way to approach the case in order to address the two main data insufficiencies.:

  1. There is no guarantee that there are passengers getting on or off on each stop. It is possible that the bus is just missing some stops as no-one wants to on-/off-board there. In such a scenario, the bus is in fact still on time but there are no data to prove it. In the data it would look like TBB is much longer and there have been delays. These would be the outliers in the TBB distribution and more often than not they do NOT mean disruption.
  1. Lack of vehicle level data (we can't track the paths of each bus separately). We just know that A bus picked up passengers at a specific bus stop. It is clear that at a given moment in time, there are several buses driving a given route, but each of them is located near a different bus stop on that route (if it weren’t the case, the people would have to wait a long time). Neither do we have data on the number of buses per bus line, nor a vehicle identifier in the data set. We only have passenger check-ins and check-outs, and therefore one potentially relevant parameter to extract is exactly the so-called TBB (Time Between two consecutive Buses on a given bus stop).

In order to address these two issues, we decided to work with the daily distribution of the TBB parameter. If we build a model on a more granular level with this data, we will be forced to provide potentially inflated delay times as ground truth. With our approach we intend to capture global variations in the behaviour of TBB potentially caused by weather. A disruption due to weather would on average not lead to huge abrupt changes in TBB.

GPS datasets per bus would solve both of the above issues.

Assuming we have such data at our disposal,a model with higher granularity would definitely be more appropriate for production level purposes. This would allow to account for peak vs non-peak traffic in the model. Certainly, it would allow to make use of the weather variations within the day as well.

In summary, we would consider the following further steps:

  1. Enhance the dataset with GPS data per vehicle.
  2. Come up with a joined dataset on hourly basis or several hours to get finer resolution.
  3. Add additional routes, not just 28, considering the peculiarities of the different bus line routes.
  4. Use also the onboarding duration as a measure of traffic disruption and not only TBB.
In [48]:
# plotting distribution of TBB (time between buses) in minutes over the day
plt.figure(figsize=(12,8))
plt.hist(df_summary_day[df_summary_day['end_location']=='Satwa, Square 1']['onboarding_end']
         .diff().dt.total_seconds().values/60, 15)
plt.xlabel('TBB (min)', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title(f'Histogram of time between buses for station Satwa, Square 1 on 2019-01-21')
Out[48]:
Text(0.5, 1.0, 'Histogram of time between buses for station Satwa, Square 1 on 2019-01-21')

Archive

Assuming normal operation on that day we see that TBB has a mean of about 18 mins and a standard deviation of 8.73. Upon disruption caused by bad weather the distribution of TBB will probably change. With our model we will quantify this effect.

In [57]:
#mean TBB
mean_TBB = np.nanmean(df_summary_day[df_summary_day['end_location']=='Satwa, Square 1']['onboarding_end']
         .diff().dt.total_seconds().values/60)
#std TBB
std_TBB = np.nanstd(df_summary_day[df_summary_day['end_location']=='Satwa, Square 1']['onboarding_end']
         .diff().dt.total_seconds().values/60)
mean_TBB, std_TBB
Out[57]:
(17.96938775510204, 8.738413998947001)

 

 

 

Share this

7 thoughts on “Weather-proof Mobility

  1. 0
    votes

    A good approach to the topic- Nice and clean definition of business understanding. Data analysis performed with correct focus and omitting unnecessary segments which would complicate the solution. We are missing the final model/conclusion of what can be observed and used as an outcome of the whole “project”.
    Also, I would suggest focussing on a shorter time-series segment (daily patterns) to identify peak/offpeak periods and how they interact with TBB in this case together with weather conditions. We all know that bus driers should be professionals but the majority of “normal” non-bus driers are not and they are heavily impacted in distracting sensor inputs (thunderstorm, rain, people cutting in, or even forgetting how to drive when weather condition changes).

    1. 0
      votes

      Also, I would advise using some additional datasets which were not part of the initial dataset, like aggregated daily traffic estimates on an hourly basis provided by some navigation applications because that can additionally help with model precision. – I’m adding my last sentence about additional dataset to all teams focusing on this problem because no one did even consider it and that is something you can always do on any project – focus not on internal/provided data but find something to augment it πŸ˜‰

  2. 0
    votes

    Hi, romina & bogomil
    That’s great that you played with the data about public transport… And the documentation is well structured.
    Here are my comments about the periodic component in TBB variables. You could take into account this “seasonal” component – e.g. model it, then remove it from data, then account for the weather effect, next make forecast and then reintroduce the periodic component back. The reason for this split of periodic & non-periodic parts (and using different models) is the different reasons (factors) for their presence in the data…
    Another approach is to add more variables (day of the week), attempting to represent the periodicity…

  3. 0
    votes

    Nice, clear approach: good work.

    That said, have you looked at adding interaction terms in the regression (i.e. weather * weekday)? This might necessitate changing the resolution of the data from daily to half-day or even less.
    Also, it might benefit the analysis to look into quantile regression.

    Best regards,

    1. 0
      votes

      Thanks for the comment!

      We did explore some models with interaction terms, as we could imagine that different social behavior on weekends vs weekdays could change the way weather influences traffic. From the models that we tried out, we did not discover an interaction effect worth publishing. Indeed, increasing the granularity of the model may very well reveal such effects.

  4. 0
    votes

    Hi, again πŸ™‚
    I like very much your work: the considerations related to the data, the interpretation of the outliers, the conclusions and also the good business understanding. πŸ™‚
    I have some comments & questions about the final model: looking at the p-values you put many not significant factors in the model or I misunderstood something. In order to reduce the possibility of overfitting, I would remove some of them in order only significant factors finally to stay. And to check the model for overfitting, we should compare R2, adj.R2, RMSE, etc., both for the train and test samples. In the case of the cross validation you did we also should do this for the average measures of the model quality. Also using linear regression, we impose particular hypothesis about the type of relation between the factors and the dependent. So, it would be good to check other models as well, especially non-parametric ones. Nevertheless, I really like what you have done.

    1. 0
      votes

      Hello Alex, thanks for the detailed feedback!

      You are right about the non-significant variables. Regarding model 3, we have 2 categorical variables with multiple levels. Some levels of both variables have p<0.05, therefore we kept them in the model we evaluated. In models 1 and 2, we kept the summary with all coefficients for transparency, but as you pointed out, they should be omitted for the final interpretation of the model.

      Indeed, we didn't comment on the potential violation of assumptions and alternative modeling options. Non-parametric models, non-linear models and time series regression models are all valid extensions for adequately describing the relationship between weather and traffic disruptions.

Leave a Reply