Import libraries
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
Download weather data
!wget https://datacases.s3.us-east-2.amazonaws.com/datathon-2020/Ernst+and+Young/Dubai+Weather_20180101_20200316.txt
Weather Data
From "Predicting weather disruption of public transport" case, weather parameters are described below:
city_name City name
lat Geographical coordinates of the location (latitude)
lon Geographical coordinates of the location (longitude)
main
main.temp Temperature
main.feels_like This temperature parameter accounts for the human perception of weather
main.pressure Atmospheric pressure (on the sea level), hPa
main.humidity Humidity, %
main.temp_min Minimum temperature at the moment. This is deviation from temperature that is possible for large cities and megalopolises geographically expanded (use these parameter optionally).
main.temp_max Maximum temperature at the moment. This is deviation from temperature that is possible for large cities and megalopolises geographically expanded (use these parameter optionally).
wind
wind.speed Wind speed. Unit Default: meter/sec
wind.deg Wind direction, degrees (meteorological)
clouds
- clouds.all Cloudiness, %
rain
rain.1h Rain volume for the last hour, mm
rain.3h Rain volume for the last 3 hours, mm
weather (more info Weather condition codes)
weather.id Weather condition id
weather.main Group of weather parameters (Rain, Snow, Extreme etc.)
weather.description Weather condition within the group
weather.icon Weather icon id
dt Time of data calculation, unix, UTC
dt_isoDate and time in UTC format
df = pd.read_json("Dubai+Weather_20180101_20200316.txt")
df.dtypes
Flat json data type
s = df.apply(lambda x: pd.Series(x['weather']),axis=1).stack().reset_index(level=1, drop=True) # Weather column is a list which only contains one object
s.name = 'weather'
df = df.drop('weather', axis=1).join(s)
json_struct = json.loads(df.to_json(orient="records"))
df_flat = pd.json_normalize(json_struct)
for col in df_flat.columns: # remove useless columns
unique = df_flat[col].unique()
print(f'{col} unique size: {unique.size}')
Firstly, as observed from Dubai weather data, parameters including city name, latitude, longitude, time zone, and rain only contain one variable; hence, they can be dropped from the data table.
for col in df_flat.columns: # remove useless columns
unique = df_flat[col].unique()
if unique.size == 1:
df_flat.drop(col, axis=1, inplace=True)
print(f'drop {col}')
else:
print(f'{col} {unique.shape}: {unique[:5]}')
Secondly, in the weather parameter, each id matches one description and each it is more descriptive than main and icon. Therefore, id can represent the weather parameters and others can be dropped.
df_flat.drop(['weather.main', 'weather.icon', 'weather.description'], axis=1, inplace=True)
Thirdly, noticed from rain for 1 hour and rain for 3 hours, most of data are null value since there is less frequent to rain in Dubai, they can be replaced as 0 instead.
df=df_flat
print(df.isnull().sum())
df.fillna(0, inplace=True)
df.describe()
Convert time format to datetime
print(df.dt_iso, df.dtypes)
df.dt_iso = pd.to_datetime(df.dt_iso, format='%Y-%m-%d %H:%M:%S +0000 UTC')
print(df.dt_iso, df.dtypes)
As observed below, temperature and pressure are negatively correlated with each other. Meanwhile, min, max, feels like and average temperature are positively correlated with each other. Therefore, only the temperature should be kept and others can be dropped.
df.sort_values(by=['dt'], inplace=True)
dfwithouttime = df.drop(['dt','dt_iso'], axis=1)
#dfwithouttime=(dfwithouttime-dfwithouttime.min())/(dfwithouttime.max()-dfwithouttime.min()) #normalize
fig, axs = plt.subplots(3, 4, figsize=(28, 15))
fig.subplots_adjust(hspace=.5)
i = 0
j = 0
for col in dfwithouttime.columns:
dfwithouttime[col].plot(ax=axs[i][j], title=col)
j += 1
if j == 4:
j = 0
i += 1
df.drop(['main.temp_min', 'main.temp_max', 'main.feels_like', 'main.pressure'], axis=1, inplace=True)
Traffic data
acci_time Accident time
acci_name categorization of the accident
acci_x Latitude
acci_y Longitude
Download data from Dubai Pulse
!wget http://data.bayanat.ae/ar/dataset/ad38cee7-f70e-4764-9c9d-aab760ce1026/resource/025ea6b2-a806-49c2-8294-4f3a97c09090/download/traffic_incidents-1.csv
!wget https://www.dubaipulse.gov.ae/dataset/c9263194-5ee3-4340-b7c0-3269b26acb43/resource/c3ece154-3071-4116-8650-e769d8416d88/download/traffic_incidents.csv
df1 = pd.read_csv("traffic_incidents.csv")
df2 = pd.read_csv("traffic_incidents-1.csv")
print(df1.dtypes, df2.dtypes)
print(df1.shape[0] + df2.shape[0])
df_union= pd.concat([df1, df2]).drop_duplicates()
print(df_union.shape)
First of all, only the accident time parameter can be used with weather data, other columns can be drop.
df_union = df_union[['acci_time']]
print(df_union.shape, df_union.acci_time.unique().shape)
df_union.tail
df_union.acci_time = pd.to_datetime(df_union.acci_time, format='%d/%m/%Y %H:%M:%S')
df_union.sort_values(by=['acci_time'], inplace=True)
Additionally, the count of the number of traffic accidents occurred within each one hour can be added as one column.
dfh = df_union.groupby([pd.Grouper(key='acci_time',freq='H')]).size().reset_index(name='count')
dfh['count'].plot()
dfh
After inner joining weather data and traffic accidents' data, only time range between 2019-06-27 10:00 and 2020-03-16 23:00 with a total of 6327 hours data can be used for analysis.
result = pd.merge(df, dfh, how='inner', left_on=['dt_iso'], right_on=['acci_time'])
result.drop(['acci_time', 'dt'], axis=1, inplace=True)
result.shape
fig, axs = plt.subplots(3, 3, figsize=(21, 14))
fig.subplots_adjust(hspace=.5)
i = 0
j = 0
for col in set(result.columns) - set(['dt_iso']):
result.plot(x='dt_iso',y=col,ax=axs[i][j], title=col)
j += 1
if j == 3:
j = 0
i += 1
As seen from above, all rain for 1-hour data is 0, so rain 1h can be dropped.
result.drop('rain.1h', axis=1, inplace=True)
Interestingly, there are some rain 3-hours cases. After research, there was a flood during that time. Since there is barely raining in Dubai; as a result, drainage measures might not be advanced in Dubai, and there was a significantly increasing number of traffic cases during that time duration.
Important features to predict accident counts per hour
X = result.drop(['count', 'dt_iso'],axis=1)
y = result['count']
X = (X-X.min())/(X.max()-X.min()) # normalize the data
y = (y-y.min())/(y.max()-y.min()) # normalize the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=23)
random_forest = RandomForestRegressor(n_estimators=100)
random_forest.fit(X_train, y_train)
rf_feature = pd.Series(random_forest.feature_importances_,index=X.columns)
rf_feature = rf_feature.sort_values()
print(rf_feature[::-1][:5])
fig, axs = plt.subplots(1, 2, figsize=(12, 8))
fig.subplots_adjust(wspace=.5)
rf_feature.plot(kind="barh", ax=axs[0], title='Random Forest')
xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
xgb_feature = pd.Series(xgb.feature_importances_,index=X.columns)
xgb_feature = xgb_feature.sort_values()
xgb_feature.plot(kind="barh", ax=axs[1], title="XGBoost")
print(xgb_feature[::-1][:5])
After using both Random forest and XGBoost which are the most popular machine learning algorithms to evaluate the weather and traffic accidents, the windy condition and temperature are important features concluded by both algorithms.
The assumption would be rain is more observable than wind, so people may not go out during the heavy rainy day, and the number of cars decreases. However, people may not be aware of the strong wind, and as a result, the wind has a stronger impact on traffic accidents.
Another assumption would be the temperature also plays an important role in traffic accidents. Perhaps, people should be aware of climate change with greenhouse. Protection of the environment is important to avoid exterme weather condition.
def test(model):
pred = model.predict(X_test)
def cross_val(model):
res = cross_val_score(model, X_train, y_train, cv=10, n_jobs=-1)
print(np.mean(res))
cross_val(xgb)
test(xgb)
cross_val(random_forest)
test(random_forest)
Both models average similar error rates after 10 times cross validation.
Conclusion
The government should make people aware of climate change. Specifically, be aware of wind speed and build cars heavier and more stable.
8 thoughts on “Weather Effects on Traffic Accidents in Dubai”
Good data analysis and problem setup, but we are missing closure, i.e. model which allows us to pair weather forecast data (from some “official” site/location) with historical rides/accident data to predict what will happen in next period (short period of hours to maybe day or two).
That way, the government or public authority could add more buses to minimize any disruption of services.
“Global” problems like climate change are not something we tried to solve here, but rather the influence of weather conditions on people driving cars/buses, etc.
I agree with your comment. I focused more on analysis instead of models. I did train Random Forest and XGBoost models which use weather forecast data to predict the number of traffic accidents during the current hour. One way to improve based on your comment would be shifting one hour of the traffic accidents data to do the training. Another important point would be interesting to add would be trying to figure out above how many accidents predicts the government should add/decrease the number of buses.
However, I am unable to obtain enough bus ridership data due to Dubai Pulse API restriction which only allows its citizens with ID to access. Otherwise, it would be interesting to predict the number of buses instead of the number of traffic accidents based on the weather data.
I agree with your statements. There are some options on how that dataset could be downloaded (Web scraping, etc.) but it would take some processing time.
Also, I would advise using some additional datasets which were not part of the initial dataset, like aggregated daily traffic estimates on an hourly basis provided by some navigation applications because that can additionally help with model precision. We all know that bus driers should be professionals but the majority of βnormalβ non-bus driers are not and they are heavily impacted in distracting sensor inputs (thunderstorm, rain, people cutting in, or even forgetting how to drive when weather condition changes). – I’m adding my last sentence about additional dataset to all teams focusing on this problem because no one did even consider it and that is something you can always do on any project – focus not on internal/provided data but find something to augment it π
For other locations, there are lots of historical datasets for buses. For example, https://transitfeeds.com/p/riverside-transit-agency/531 contains GTFS data in US California Riverside Riverside Transit Agency. However, the question statement requires Dubai. I have tried Google Map API, Here API, and TomTom API. All of them only contain real-time traffic in Dubai. Historical datasets are missing. It would be interesting to know who obtain historical data in this competition.
@y2587wan there is the ability to download CSV for each individual bus line, but as I said, you would need to do some web scraping. It is available for non-UAE people on the link: https://www.dubaipulse.gov.ae/data/allresource/rta-bus/rta_bus_ridership-open?organisation=rta&service=rta-bus&count=1212&result_per_page=1300&as_sfid=AAAAAAVQxFA0BFeFROVV-_FUrwIfaEqwRWpoZA-y-UptqSEqxmERCKYLhWrwqWh3AfDCDdi1moQM5yS3Qjy2NzBMeMFf3DsQYwQOBarG4FRgrDCOeBE9L_Tq7J9m8CMoTDSCXIY%3D&as_fid=ff49229e06fa994326e53390b91e89d1dc5e2954. Nevertheless, you did a really good job in so short timeframe and you should look on my comment as something which can help you on your future competitions or even with your future profession π
Thank you. I just realized when clicking “see more” on bus ridership open page, it will show more data than several days; however, during the competition, I clicked “next page” instead. Sorry for the argument above.
Hi, y2587wan π
Your data prep is cool and is a good premise for modelling and off course for forecasting, which would be used for further decision-making… If you worked with someone, i believe you would move much further π