Big DataDatathon 2020 Solutions

Predicting weather disruption of public transport – provided by Ernst and Young

0
votes

Datathon2020 – Predicting weather disruption of public transport – provided by Ernst and Young

This Project was inspired from the Business Case of Data Science Society Global 2020 Hackathon hosted from May 15 - 17 , 2020

click here for details about the Business Case and the data dictionary

Data Sources :

The datasets used in this project was provided by the organizers, however the external data sourced for were obtained here

The analysis for this project will follow the CRISP-DM pipeline which are ;

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Data Modelling
  • Results
  • Deployment - Storytelling

Business Understanding

The summary of the project is to predict public transport service disruption in Dubai using the weather data analysis

  • Goal : Can you analyze the weather data to predict public transport service disruption in Dubai? How can we plan for less disruption in the wake of severe weather conditions and leverage the emergency management plan as well as providing uninterrupted services and products to citizens?

Data Understanding and Data Preprocessing

This stage involves loading the data and performing necessary data cleaning, preprocessing and feature engineering on the data to prepare it for analysis and modelling

  • Importing Necessary Libraries
In [56]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

plt.style.use('ggplot')
import plotly.graph_objects as go
  • Loading the datasets into a dataframe
In [939]:
data = pd.read_json('Dubai+Weather_20180101_20200316.txt')
transport = pd.DataFrame(data=None,columns=['year','month','transport_type','trips'])

for i in os.listdir('Transport'):
    month_data = pd.read_csv("Transport/" + i)
    transport = pd.concat([transport,month_data],axis=0)
In [940]:
data.shape
Out[940]:
(19344, 11)
In [941]:
data.tail(3)
Out[941]:
city_name lat lon main wind clouds weather dt dt_iso timezone rain
19341 Dubai 25.07501 55.188761 {'temp': 21.52, 'temp_min': 20, 'temp_max': 23... {'speed': 3.1, 'deg': 60} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584392400 2020-03-16 21:00:00 +0000 UTC 14400 NaN
19342 Dubai 25.07501 55.188761 {'temp': 21.04, 'temp_min': 19, 'temp_max': 23... {'speed': 3.1, 'deg': 70} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584396000 2020-03-16 22:00:00 +0000 UTC 14400 NaN
19343 Dubai 25.07501 55.188761 {'temp': 20.31, 'temp_min': 18, 'temp_max': 23... {'speed': 3.6, 'deg': 60} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584399600 2020-03-16 23:00:00 +0000 UTC 14400 NaN
  • Data Preprocessing and Data Cleaning
In [942]:
transport.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 0 to 1
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   year            95 non-null     object
 1   month           95 non-null     object
 2   transport_type  95 non-null     object
 3   trips           95 non-null     object
dtypes: object(4)
memory usage: 3.7+ KB
In [943]:
transport.reset_index(inplace=True)
transport.drop('index',axis=1,inplace=True)
In [944]:
transport.head()
Out[944]:
year month transport_type trips
0 2018 Feb Marine 141840
1 2018 Feb Tram 528515
2 2018 Feb Bus 11111573
3 2018 Feb Metro 16915232
4 2018 Mar Marine 166561
  • Transforming the date to pandas date format

  • Dropping columns with constant labels such as city_name and timezone

In [945]:
data.drop(['city_name','timezone','dt_iso'],axis=1,inplace=True)
In [946]:
def convert_time(timestamp):
    return datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')
In [947]:
data['dt'] = data['dt'].apply(convert_time)
data['dt'] = pd.to_datetime(data['dt'])
In [948]:
data.head(2)
Out[948]:
lat lon main wind clouds weather dt rain
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN
In [949]:
transport.head(2)
Out[949]:
year month transport_type trips
0 2018 Feb Marine 141840
1 2018 Feb Tram 528515

Feature Engineering

  • Using the date column created to engineer new date time features such as Month and Year
In [950]:
data['month'] = data['dt'].dt.month
data['year'] = data['dt'].dt.year
data['weekdays'] = data['dt'].dt.weekday
In [951]:
data.head()
Out[951]:
lat lon main wind clouds weather dt rain month year weekdays
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN 1 2018 0
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN 1 2018 0
2 25.07501 55.188761 {'temp': 14.03, 'temp_min': 12, 'temp_max': 17... {'speed': 1.5, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 03:00:00 NaN 1 2018 0
3 25.07501 55.188761 {'temp': 13.78, 'temp_min': 12, 'temp_max': 17... {'speed': 2.1, 'deg': 180} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 04:00:00 NaN 1 2018 0
4 25.07501 55.188761 {'temp': 14.28, 'temp_min': 12, 'temp_max': 18... {'speed': 2.6, 'deg': 160} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 05:00:00 NaN 1 2018 0
  • Creating the id in the data to be used to map the Transport data using the Create id function
  • Transforming the Main , Wind, Clouds , weather and rain columns to extract the details into a proper format to be used for analysis
In [952]:
data['main'].iloc[0]
Out[952]:
{'temp': 14.99,
 'temp_min': 13,
 'temp_max': 18,
 'feels_like': 13.7,
 'pressure': 1015,
 'humidity': 87}
In [953]:
main = data['main'].astype(str).str.strip('{}').str.split(', ', expand=True)
wind = data['wind'].astype(str).str.strip('{}').str.split(', ', expand=True)
weather = data['weather'].astype(str).str.strip('{}').str.split(', ', expand=True)
  • Renaming the Columns
In [954]:
main.columns = ['main_temp','Temp_min','Temp_max','Feels_like','Pressure','Humidity']
wind.columns = ['Speed','Deg']
weather.columns = ['id','Main','Description','icon','5','6','7','8']
weather.drop(['5','6','7','8','id'],axis=1,inplace=True)
In [955]:
data.head()
Out[955]:
lat lon main wind clouds weather dt rain month year weekdays
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN 1 2018 0
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN 1 2018 0
2 25.07501 55.188761 {'temp': 14.03, 'temp_min': 12, 'temp_max': 17... {'speed': 1.5, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 03:00:00 NaN 1 2018 0
3 25.07501 55.188761 {'temp': 13.78, 'temp_min': 12, 'temp_max': 17... {'speed': 2.1, 'deg': 180} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 04:00:00 NaN 1 2018 0
4 25.07501 55.188761 {'temp': 14.28, 'temp_min': 12, 'temp_max': 18... {'speed': 2.6, 'deg': 160} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 05:00:00 NaN 1 2018 0
In [956]:
data.drop(['main','wind','weather'],axis=1,inplace=True)
data = pd.concat([data,main,wind,weather],axis=1)
In [957]:
data.head(1)
Out[957]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 {'all': 1} 2018-01-01 01:00:00 NaN 1 2018 0 'temp': 14.99 'temp_min': 13 'temp_max': 18 'feels_like': 13.7 'pressure': 1015 'humidity': 87 'speed': 3.1 'deg': 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
In [958]:
data['clouds'] = data['clouds'].astype(str).str.strip('{}').apply(lambda x:x.split(": ")[-1])
In [960]:
def replace_nan(data):
    if pd.isna(data):
        return 0
    else:
        return 1
In [961]:
data.rain = data.rain.apply(replace_nan)
In [962]:
data.rain.value_counts()
Out[962]:
0    19231
1      113
Name: rain, dtype: int64
In [963]:
data.head(1)
Out[963]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 1 2018-01-01 01:00:00 0 1 2018 0 'temp': 14.99 'temp_min': 13 'temp_max': 18 'feels_like': 13.7 'pressure': 1015 'humidity': 87 'speed': 3.1 'deg': 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
In [964]:
cols1 = ['main_temp',
      'Feels_like',
      'Speed']
cols2 = ['Temp_min','Temp_max','Pressure','Humidity','Deg']
   
for column in cols1:
    data[column] = data[column].str.extract(r'(\d+\.\d+)',expand=False)
for column in cols2: 
    data[column] = data[column].str.extract(r'(\d+)',expand=False)
In [965]:
data[data['main_temp'].isnull()].head(2)
Out[965]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
51 25.07501 55.188761 1 2018-01-03 04:00:00 0 1 2018 2 NaN 13 17 14.34 1015 87 4.1 170 'main': 'Haze' 'description': 'haze' 'icon': '50n'}]
87 25.07501 55.188761 75 2018-01-04 16:00:00 0 1 2018 3 NaN 18 22 18.39 1015 52 4.1 270 'main': 'Clouds' 'description': 'broken clouds' 'icon': '04n'}]
In [966]:
data['main_temp'].fillna(0,inplace=True)
In [968]:
data.head()
Out[968]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 1 2018-01-01 01:00:00 0 1 2018 0 14.99 13 18 13.7 1015 87 3.1 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
1 25.07501 55.188761 1 2018-01-01 02:00:00 0 1 2018 0 14.63 13 17 13.91 1015 93 2.6 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
2 25.07501 55.188761 1 2018-01-01 03:00:00 0 1 2018 0 14.03 12 17 13.89 1016 93 1.5 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
3 25.07501 55.188761 1 2018-01-01 04:00:00 0 1 2018 0 13.78 12 17 13.14 1016 93 2.1 180 'main': 'Mist' 'description': 'mist' 'icon': '50n'}]
4 25.07501 55.188761 1 2018-01-01 05:00:00 0 1 2018 0 14.28 12 18 13.45 1017 93 2.6 160 'main': 'Mist' 'description': 'mist' 'icon': '50d'}]
In [969]:
def temp_aver(temp):
    temp_min = temp[1]
    temp_max = temp[2]
    main_temp = temp[0]
    if temp_min ==0:
        return (temp_min + temp_max)/2
    else:
        return (temp_min + temp_max+main_temp)/3
In [971]:
data[['main_temp','Temp_min','Temp_max']] = data[['main_temp','Temp_min','Temp_max']].astype('float')
data['temp_average'] =data[['main_temp','Temp_min','Temp_max']].apply(temp_aver,axis=1)
In [973]:
data.drop(['main_temp','Temp_min','Temp_max','dt','icon','Description'],axis=1,inplace=True)
In [974]:
data.head(2)
Out[974]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.7 1015 87 3.1 150 'main': 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'main': 'Clear' 14.876667
In [976]:
data['Main'] = data['Main'].str.replace("'main':"," ")
In [977]:
data.head()
Out[977]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.7 1015 87 3.1 150 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'Clear' 14.876667
2 25.07501 55.188761 1 0 1 2018 0 13.89 1016 93 1.5 150 'Clear' 14.343333
3 25.07501 55.188761 1 0 1 2018 0 13.14 1016 93 2.1 180 'Mist' 14.260000
4 25.07501 55.188761 1 0 1 2018 0 13.45 1017 93 2.6 160 'Mist' 14.760000
In [978]:
data.Main.value_counts()
Out[978]:
  'Clear'           12551
  'Clouds'           4626
  'Dust'             1136
  'Haze'              557
  'Rain'              183
  'Mist'              158
  'Fog'                98
  'Thunderstorm'       31
  'Smoke'               4
Name: Main, dtype: int64
In [981]:
data.temp_average
Out[981]:
0        15.330000
1        14.876667
2        14.343333
3        14.260000
4        14.760000
           ...    
19339    22.950000
19340    22.450000
19341    21.506667
19342    21.013333
19343    20.436667
Name: temp_average, Length: 19344, dtype: float64
In [989]:
#Checking for missing information
data.isnull().mean()*100
Out[989]:
lat             0.000000
lon             0.000000
clouds          0.000000
rain            0.000000
month           0.000000
year            0.000000
weekdays        0.000000
Feels_like      0.987386
Pressure        0.000000
Humidity        0.000000
Speed           4.337262
Deg             0.000000
Main            0.000000
temp_average    0.000000
dtype: float64
In [990]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19344 entries, 0 to 19343
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   lat           19344 non-null  float64
 1   lon           19344 non-null  float64
 2   clouds        19344 non-null  object 
 3   rain          19344 non-null  int64  
 4   month         19344 non-null  int64  
 5   year          19344 non-null  int64  
 6   weekdays      19344 non-null  int64  
 7   Feels_like    19153 non-null  object 
 8   Pressure      19344 non-null  object 
 9   Humidity      19344 non-null  object 
 10  Speed         18505 non-null  object 
 11  Deg           19344 non-null  object 
 12  Main          19344 non-null  object 
 13  temp_average  19344 non-null  float64
dtypes: float64(3), int64(4), object(7)
memory usage: 2.1+ MB
  • Filling missing information, since both features with missing information have less than 5% missing values, we will fill with the mean
In [991]:
data['Speed'] = data['Speed'].astype('float')
data['Feels_like'] = data['Feels_like'].astype('float')
data['Speed'].fillna(data['Speed'].mean(),inplace=True)
data['Feels_like'].fillna(data['Feels_like'].mean(),inplace=True)
In [992]:
data.isnull().sum()
Out[992]:
lat             0
lon             0
clouds          0
rain            0
month           0
year            0
weekdays        0
Feels_like      0
Pressure        0
Humidity        0
Speed           0
Deg             0
Main            0
temp_average    0
dtype: int64

Exploratory Data Analysis

After cleaning the data and preparing it, Exploratory Data Analysis (EDA) will be performed to gather insights that will be useful for the model to learn from the data to help improve performance of the model

In [993]:
data.head()
Out[993]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.70 1015 87 3.1 150 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'Clear' 14.876667
2 25.07501 55.188761 1 0 1 2018 0 13.89 1016 93 1.5 150 'Clear' 14.343333
3 25.07501 55.188761 1 0 1 2018 0 13.14 1016 93 2.1 180 'Mist' 14.260000
4 25.07501 55.188761 1 0 1 2018 0 13.45 1017 93 2.6 160 'Mist' 14.760000
In [ ]:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Data Modelling
  • Results
  • Deployment – Storytelling

Datathon2020 – Predicting weather disruption of public transport – provided by Ernst and Young

This Project was inspired from the Business Case of Data Science Society Global 2020 Hackathon hosted from May 15 - 17 , 2020

click here for details about the Business Case and the data dictionary

Data Sources :

The datasets used in this project was provided by the organizers, however the external data sourced for were obtained here

The analysis for this project will follow the CRISP-DM pipeline which are ;

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Data Modelling
  • Results
  • Deployment - Storytelling

Business Understanding

The summary of the project is to predict public transport service disruption in Dubai using the weather data analysis

  • Goal : Can you analyze the weather data to predict public transport service disruption in Dubai? How can we plan for less disruption in the wake of severe weather conditions and leverage the emergency management plan as well as providing uninterrupted services and products to citizens?

Data Understanding and Data Preprocessing

This stage involves loading the data and performing necessary data cleaning, preprocessing and feature engineering on the data to prepare it for analysis and modelling

  • Importing Necessary Libraries
In [56]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

plt.style.use('ggplot')
import plotly.graph_objects as go
  • Loading the datasets into a dataframe
In [939]:
data = pd.read_json('Dubai+Weather_20180101_20200316.txt')
transport = pd.DataFrame(data=None,columns=['year','month','transport_type','trips'])

for i in os.listdir('Transport'):
    month_data = pd.read_csv("Transport/" + i)
    transport = pd.concat([transport,month_data],axis=0)
In [940]:
data.shape
Out[940]:
(19344, 11)
In [941]:
data.tail(3)
Out[941]:
city_name lat lon main wind clouds weather dt dt_iso timezone rain
19341 Dubai 25.07501 55.188761 {'temp': 21.52, 'temp_min': 20, 'temp_max': 23... {'speed': 3.1, 'deg': 60} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584392400 2020-03-16 21:00:00 +0000 UTC 14400 NaN
19342 Dubai 25.07501 55.188761 {'temp': 21.04, 'temp_min': 19, 'temp_max': 23... {'speed': 3.1, 'deg': 70} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584396000 2020-03-16 22:00:00 +0000 UTC 14400 NaN
19343 Dubai 25.07501 55.188761 {'temp': 20.31, 'temp_min': 18, 'temp_max': 23... {'speed': 3.6, 'deg': 60} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584399600 2020-03-16 23:00:00 +0000 UTC 14400 NaN
  • Data Preprocessing and Data Cleaning
In [942]:
transport.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 0 to 1
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   year            95 non-null     object
 1   month           95 non-null     object
 2   transport_type  95 non-null     object
 3   trips           95 non-null     object
dtypes: object(4)
memory usage: 3.7+ KB
In [943]:
transport.reset_index(inplace=True)
transport.drop('index',axis=1,inplace=True)
In [944]:
transport.head()
Out[944]:
year month transport_type trips
0 2018 Feb Marine 141840
1 2018 Feb Tram 528515
2 2018 Feb Bus 11111573
3 2018 Feb Metro 16915232
4 2018 Mar Marine 166561
  • Transforming the date to pandas date format

  • Dropping columns with constant labels such as city_name and timezone

In [945]:
data.drop(['city_name','timezone','dt_iso'],axis=1,inplace=True)
In [946]:
def convert_time(timestamp):
    return datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')
In [947]:
data['dt'] = data['dt'].apply(convert_time)
data['dt'] = pd.to_datetime(data['dt'])
In [948]:
data.head(2)
Out[948]:
lat lon main wind clouds weather dt rain
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN
In [949]:
transport.head(2)
Out[949]:
year month transport_type trips
0 2018 Feb Marine 141840
1 2018 Feb Tram 528515

Feature Engineering

  • Using the date column created to engineer new date time features such as Month and Year
In [950]:
data['month'] = data['dt'].dt.month
data['year'] = data['dt'].dt.year
data['weekdays'] = data['dt'].dt.weekday
In [951]:
data.head()
Out[951]:
lat lon main wind clouds weather dt rain month year weekdays
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN 1 2018 0
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN 1 2018 0
2 25.07501 55.188761 {'temp': 14.03, 'temp_min': 12, 'temp_max': 17... {'speed': 1.5, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 03:00:00 NaN 1 2018 0
3 25.07501 55.188761 {'temp': 13.78, 'temp_min': 12, 'temp_max': 17... {'speed': 2.1, 'deg': 180} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 04:00:00 NaN 1 2018 0
4 25.07501 55.188761 {'temp': 14.28, 'temp_min': 12, 'temp_max': 18... {'speed': 2.6, 'deg': 160} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 05:00:00 NaN 1 2018 0
  • Creating the id in the data to be used to map the Transport data using the Create id function
  • Transforming the Main , Wind, Clouds , weather and rain columns to extract the details into a proper format to be used for analysis
In [952]:
data['main'].iloc[0]
Out[952]:
{'temp': 14.99,
 'temp_min': 13,
 'temp_max': 18,
 'feels_like': 13.7,
 'pressure': 1015,
 'humidity': 87}
In [953]:
main = data['main'].astype(str).str.strip('{}').str.split(', ', expand=True)
wind = data['wind'].astype(str).str.strip('{}').str.split(', ', expand=True)
weather = data['weather'].astype(str).str.strip('{}').str.split(', ', expand=True)
  • Renaming the Columns
In [954]:
main.columns = ['main_temp','Temp_min','Temp_max','Feels_like','Pressure','Humidity']
wind.columns = ['Speed','Deg']
weather.columns = ['id','Main','Description','icon','5','6','7','8']
weather.drop(['5','6','7','8','id'],axis=1,inplace=True)
In [955]:
data.head()
Out[955]:
lat lon main wind clouds weather dt rain month year weekdays
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN 1 2018 0
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN 1 2018 0
2 25.07501 55.188761 {'temp': 14.03, 'temp_min': 12, 'temp_max': 17... {'speed': 1.5, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 03:00:00 NaN 1 2018 0
3 25.07501 55.188761 {'temp': 13.78, 'temp_min': 12, 'temp_max': 17... {'speed': 2.1, 'deg': 180} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 04:00:00 NaN 1 2018 0
4 25.07501 55.188761 {'temp': 14.28, 'temp_min': 12, 'temp_max': 18... {'speed': 2.6, 'deg': 160} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 05:00:00 NaN 1 2018 0
In [956]:
data.drop(['main','wind','weather'],axis=1,inplace=True)
data = pd.concat([data,main,wind,weather],axis=1)
In [957]:
data.head(1)
Out[957]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 {'all': 1} 2018-01-01 01:00:00 NaN 1 2018 0 'temp': 14.99 'temp_min': 13 'temp_max': 18 'feels_like': 13.7 'pressure': 1015 'humidity': 87 'speed': 3.1 'deg': 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
In [958]:
data['clouds'] = data['clouds'].astype(str).str.strip('{}').apply(lambda x:x.split(": ")[-1])
In [960]:
def replace_nan(data):
    if pd.isna(data):
        return 0
    else:
        return 1
In [961]:
data.rain = data.rain.apply(replace_nan)
In [962]:
data.rain.value_counts()
Out[962]:
0    19231
1      113
Name: rain, dtype: int64
In [963]:
data.head(1)
Out[963]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 1 2018-01-01 01:00:00 0 1 2018 0 'temp': 14.99 'temp_min': 13 'temp_max': 18 'feels_like': 13.7 'pressure': 1015 'humidity': 87 'speed': 3.1 'deg': 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
In [964]:
cols1 = ['main_temp',
      'Feels_like',
      'Speed']
cols2 = ['Temp_min','Temp_max','Pressure','Humidity','Deg']
   
for column in cols1:
    data[column] = data[column].str.extract(r'(\d+\.\d+)',expand=False)
for column in cols2: 
    data[column] = data[column].str.extract(r'(\d+)',expand=False)
In [965]:
data[data['main_temp'].isnull()].head(2)
Out[965]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
51 25.07501 55.188761 1 2018-01-03 04:00:00 0 1 2018 2 NaN 13 17 14.34 1015 87 4.1 170 'main': 'Haze' 'description': 'haze' 'icon': '50n'}]
87 25.07501 55.188761 75 2018-01-04 16:00:00 0 1 2018 3 NaN 18 22 18.39 1015 52 4.1 270 'main': 'Clouds' 'description': 'broken clouds' 'icon': '04n'}]
In [966]:
data['main_temp'].fillna(0,inplace=True)
In [968]:
data.head()
Out[968]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 1 2018-01-01 01:00:00 0 1 2018 0 14.99 13 18 13.7 1015 87 3.1 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
1 25.07501 55.188761 1 2018-01-01 02:00:00 0 1 2018 0 14.63 13 17 13.91 1015 93 2.6 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
2 25.07501 55.188761 1 2018-01-01 03:00:00 0 1 2018 0 14.03 12 17 13.89 1016 93 1.5 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
3 25.07501 55.188761 1 2018-01-01 04:00:00 0 1 2018 0 13.78 12 17 13.14 1016 93 2.1 180 'main': 'Mist' 'description': 'mist' 'icon': '50n'}]
4 25.07501 55.188761 1 2018-01-01 05:00:00 0 1 2018 0 14.28 12 18 13.45 1017 93 2.6 160 'main': 'Mist' 'description': 'mist' 'icon': '50d'}]
In [969]:
def temp_aver(temp):
    temp_min = temp[1]
    temp_max = temp[2]
    main_temp = temp[0]
    if temp_min ==0:
        return (temp_min + temp_max)/2
    else:
        return (temp_min + temp_max+main_temp)/3
In [971]:
data[['main_temp','Temp_min','Temp_max']] = data[['main_temp','Temp_min','Temp_max']].astype('float')
data['temp_average'] =data[['main_temp','Temp_min','Temp_max']].apply(temp_aver,axis=1)
In [973]:
data.drop(['main_temp','Temp_min','Temp_max','dt','icon','Description'],axis=1,inplace=True)
In [974]:
data.head(2)
Out[974]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.7 1015 87 3.1 150 'main': 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'main': 'Clear' 14.876667
In [976]:
data['Main'] = data['Main'].str.replace("'main':"," ")
In [977]:
data.head()
Out[977]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.7 1015 87 3.1 150 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'Clear' 14.876667
2 25.07501 55.188761 1 0 1 2018 0 13.89 1016 93 1.5 150 'Clear' 14.343333
3 25.07501 55.188761 1 0 1 2018 0 13.14 1016 93 2.1 180 'Mist' 14.260000
4 25.07501 55.188761 1 0 1 2018 0 13.45 1017 93 2.6 160 'Mist' 14.760000
In [978]:
data.Main.value_counts()
Out[978]:
  'Clear'           12551
  'Clouds'           4626
  'Dust'             1136
  'Haze'              557
  'Rain'              183
  'Mist'              158
  'Fog'                98
  'Thunderstorm'       31
  'Smoke'               4
Name: Main, dtype: int64
In [981]:
data.temp_average
Out[981]:
0        15.330000
1        14.876667
2        14.343333
3        14.260000
4        14.760000
           ...    
19339    22.950000
19340    22.450000
19341    21.506667
19342    21.013333
19343    20.436667
Name: temp_average, Length: 19344, dtype: float64
In [989]:
#Checking for missing information
data.isnull().mean()*100
Out[989]:
lat             0.000000
lon             0.000000
clouds          0.000000
rain            0.000000
month           0.000000
year            0.000000
weekdays        0.000000
Feels_like      0.987386
Pressure        0.000000
Humidity        0.000000
Speed           4.337262
Deg             0.000000
Main            0.000000
temp_average    0.000000
dtype: float64
In [990]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19344 entries, 0 to 19343
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   lat           19344 non-null  float64
 1   lon           19344 non-null  float64
 2   clouds        19344 non-null  object 
 3   rain          19344 non-null  int64  
 4   month         19344 non-null  int64  
 5   year          19344 non-null  int64  
 6   weekdays      19344 non-null  int64  
 7   Feels_like    19153 non-null  object 
 8   Pressure      19344 non-null  object 
 9   Humidity      19344 non-null  object 
 10  Speed         18505 non-null  object 
 11  Deg           19344 non-null  object 
 12  Main          19344 non-null  object 
 13  temp_average  19344 non-null  float64
dtypes: float64(3), int64(4), object(7)
memory usage: 2.1+ MB
  • Filling missing information, since both features with missing information have less than 5% missing values, we will fill with the mean
In [991]:
data['Speed'] = data['Speed'].astype('float')
data['Feels_like'] = data['Feels_like'].astype('float')
data['Speed'].fillna(data['Speed'].mean(),inplace=True)
data['Feels_like'].fillna(data['Feels_like'].mean(),inplace=True)
In [992]:
data.isnull().sum()
Out[992]:
lat             0
lon             0
clouds          0
rain            0
month           0
year            0
weekdays        0
Feels_like      0
Pressure        0
Humidity        0
Speed           0
Deg             0
Main            0
temp_average    0
dtype: int64

Exploratory Data Analysis

After cleaning the data and preparing it, Exploratory Data Analysis (EDA) will be performed to gather insights that will be useful for the model to learn from the data to help improve performance of the model

In [993]:
data.head()
Out[993]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.70 1015 87 3.1 150 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'Clear' 14.876667
2 25.07501 55.188761 1 0 1 2018 0 13.89 1016 93 1.5 150 'Clear' 14.343333
3 25.07501 55.188761 1 0 1 2018 0 13.14 1016 93 2.1 180 'Mist' 14.260000
4 25.07501 55.188761 1 0 1 2018 0 13.45 1017 93 2.6 160 'Mist' 14.760000
In [ ]:

Share this

One thought on “Predicting weather disruption of public transport – provided by Ernst and Young

Leave a Reply