Big DataDatathon 2020 Solutions

Predicting weather disruption of public transport – provided by Ernst and Young

0
votes

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in s3://dss-www-production/uploads/2020/05/Capstone-Project-Predicting-Transport-Disruption-using-Weather-conditions.html, line: 13082 in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in s3://dss-www-production/uploads/2020/05/Capstone-Project-Predicting-Transport-Disruption-using-Weather-conditions.html, line: 13082 in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in s3://dss-www-production/uploads/2020/05/Capstone-Project-Predicting-Transport-Disruption-using-Weather-conditions-1.html, line: 13082 in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in s3://dss-www-production/uploads/2020/05/Capstone-Project-Predicting-Transport-Disruption-using-Weather-conditions-1.html, line: 13082 in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447

Datathon2020 – Predicting weather disruption of public transport – provided by Ernst and Young

This Project was inspired from the Business Case of Data Science Society Global 2020 Hackathon hosted from May 15 - 17 , 2020

click here for details about the Business Case and the data dictionary

Data Sources :

The datasets used in this project was provided by the organizers, however the external data sourced for were obtained here

The analysis for this project will follow the CRISP-DM pipeline which are ;

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Data Modelling
  • Results
  • Deployment - Storytelling

Business Understanding

The summary of the project is to predict public transport service disruption in Dubai using the weather data analysis

  • Goal : Can you analyze the weather data to predict public transport service disruption in Dubai? How can we plan for less disruption in the wake of severe weather conditions and leverage the emergency management plan as well as providing uninterrupted services and products to citizens?

Data Understanding and Data Preprocessing

This stage involves loading the data and performing necessary data cleaning, preprocessing and feature engineering on the data to prepare it for analysis and modelling

  • Importing Necessary Libraries
In [56]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

plt.style.use('ggplot')
import plotly.graph_objects as go
  • Loading the datasets into a dataframe
In [939]:
data = pd.read_json('Dubai+Weather_20180101_20200316.txt')
transport = pd.DataFrame(data=None,columns=['year','month','transport_type','trips'])

for i in os.listdir('Transport'):
    month_data = pd.read_csv("Transport/" + i)
    transport = pd.concat([transport,month_data],axis=0)
In [940]:
data.shape
Out[940]:
(19344, 11)
In [941]:
data.tail(3)
Out[941]:
city_name lat lon main wind clouds weather dt dt_iso timezone rain
19341 Dubai 25.07501 55.188761 {'temp': 21.52, 'temp_min': 20, 'temp_max': 23... {'speed': 3.1, 'deg': 60} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584392400 2020-03-16 21:00:00 +0000 UTC 14400 NaN
19342 Dubai 25.07501 55.188761 {'temp': 21.04, 'temp_min': 19, 'temp_max': 23... {'speed': 3.1, 'deg': 70} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584396000 2020-03-16 22:00:00 +0000 UTC 14400 NaN
19343 Dubai 25.07501 55.188761 {'temp': 20.31, 'temp_min': 18, 'temp_max': 23... {'speed': 3.6, 'deg': 60} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584399600 2020-03-16 23:00:00 +0000 UTC 14400 NaN
  • Data Preprocessing and Data Cleaning
In [942]:
transport.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 0 to 1
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   year            95 non-null     object
 1   month           95 non-null     object
 2   transport_type  95 non-null     object
 3   trips           95 non-null     object
dtypes: object(4)
memory usage: 3.7+ KB
In [943]:
transport.reset_index(inplace=True)
transport.drop('index',axis=1,inplace=True)
In [944]:
transport.head()
Out[944]:
year month transport_type trips
0 2018 Feb Marine 141840
1 2018 Feb Tram 528515
2 2018 Feb Bus 11111573
3 2018 Feb Metro 16915232
4 2018 Mar Marine 166561
  • Transforming the date to pandas date format

  • Dropping columns with constant labels such as city_name and timezone

In [945]:
data.drop(['city_name','timezone','dt_iso'],axis=1,inplace=True)
In [946]:
def convert_time(timestamp):
    return datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')
In [947]:
data['dt'] = data['dt'].apply(convert_time)
data['dt'] = pd.to_datetime(data['dt'])
In [948]:
data.head(2)
Out[948]:
lat lon main wind clouds weather dt rain
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN
In [949]:
transport.head(2)
Out[949]:
year month transport_type trips
0 2018 Feb Marine 141840
1 2018 Feb Tram 528515

Feature Engineering

  • Using the date column created to engineer new date time features such as Month and Year
In [950]:
data['month'] = data['dt'].dt.month
data['year'] = data['dt'].dt.year
data['weekdays'] = data['dt'].dt.weekday
In [951]:
data.head()
Out[951]:
lat lon main wind clouds weather dt rain month year weekdays
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN 1 2018 0
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN 1 2018 0
2 25.07501 55.188761 {'temp': 14.03, 'temp_min': 12, 'temp_max': 17... {'speed': 1.5, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 03:00:00 NaN 1 2018 0
3 25.07501 55.188761 {'temp': 13.78, 'temp_min': 12, 'temp_max': 17... {'speed': 2.1, 'deg': 180} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 04:00:00 NaN 1 2018 0
4 25.07501 55.188761 {'temp': 14.28, 'temp_min': 12, 'temp_max': 18... {'speed': 2.6, 'deg': 160} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 05:00:00 NaN 1 2018 0
  • Creating the id in the data to be used to map the Transport data using the Create id function
  • Transforming the Main , Wind, Clouds , weather and rain columns to extract the details into a proper format to be used for analysis
In [952]:
data['main'].iloc[0]
Out[952]:
{'temp': 14.99,
 'temp_min': 13,
 'temp_max': 18,
 'feels_like': 13.7,
 'pressure': 1015,
 'humidity': 87}
In [953]:
main = data['main'].astype(str).str.strip('{}').str.split(', ', expand=True)
wind = data['wind'].astype(str).str.strip('{}').str.split(', ', expand=True)
weather = data['weather'].astype(str).str.strip('{}').str.split(', ', expand=True)
  • Renaming the Columns
In [954]:
main.columns = ['main_temp','Temp_min','Temp_max','Feels_like','Pressure','Humidity']
wind.columns = ['Speed','Deg']
weather.columns = ['id','Main','Description','icon','5','6','7','8']
weather.drop(['5','6','7','8','id'],axis=1,inplace=True)
In [955]:
data.head()
Out[955]:
lat lon main wind clouds weather dt rain month year weekdays
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN 1 2018 0
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN 1 2018 0
2 25.07501 55.188761 {'temp': 14.03, 'temp_min': 12, 'temp_max': 17... {'speed': 1.5, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 03:00:00 NaN 1 2018 0
3 25.07501 55.188761 {'temp': 13.78, 'temp_min': 12, 'temp_max': 17... {'speed': 2.1, 'deg': 180} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 04:00:00 NaN 1 2018 0
4 25.07501 55.188761 {'temp': 14.28, 'temp_min': 12, 'temp_max': 18... {'speed': 2.6, 'deg': 160} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 05:00:00 NaN 1 2018 0
In [956]:
data.drop(['main','wind','weather'],axis=1,inplace=True)
data = pd.concat([data,main,wind,weather],axis=1)
In [957]:
data.head(1)
Out[957]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 {'all': 1} 2018-01-01 01:00:00 NaN 1 2018 0 'temp': 14.99 'temp_min': 13 'temp_max': 18 'feels_like': 13.7 'pressure': 1015 'humidity': 87 'speed': 3.1 'deg': 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
In [958]:
data['clouds'] = data['clouds'].astype(str).str.strip('{}').apply(lambda x:x.split(": ")[-1])
In [960]:
def replace_nan(data):
    if pd.isna(data):
        return 0
    else:
        return 1
In [961]:
data.rain = data.rain.apply(replace_nan)
In [962]:
data.rain.value_counts()
Out[962]:
0    19231
1      113
Name: rain, dtype: int64
In [963]:
data.head(1)
Out[963]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 1 2018-01-01 01:00:00 0 1 2018 0 'temp': 14.99 'temp_min': 13 'temp_max': 18 'feels_like': 13.7 'pressure': 1015 'humidity': 87 'speed': 3.1 'deg': 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
In [964]:
cols1 = ['main_temp',
      'Feels_like',
      'Speed']
cols2 = ['Temp_min','Temp_max','Pressure','Humidity','Deg']
   
for column in cols1:
    data[column] = data[column].str.extract(r'(\d+\.\d+)',expand=False)
for column in cols2: 
    data[column] = data[column].str.extract(r'(\d+)',expand=False)
In [965]:
data[data['main_temp'].isnull()].head(2)
Out[965]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
51 25.07501 55.188761 1 2018-01-03 04:00:00 0 1 2018 2 NaN 13 17 14.34 1015 87 4.1 170 'main': 'Haze' 'description': 'haze' 'icon': '50n'}]
87 25.07501 55.188761 75 2018-01-04 16:00:00 0 1 2018 3 NaN 18 22 18.39 1015 52 4.1 270 'main': 'Clouds' 'description': 'broken clouds' 'icon': '04n'}]
In [966]:
data['main_temp'].fillna(0,inplace=True)
In [968]:
data.head()
Out[968]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 1 2018-01-01 01:00:00 0 1 2018 0 14.99 13 18 13.7 1015 87 3.1 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
1 25.07501 55.188761 1 2018-01-01 02:00:00 0 1 2018 0 14.63 13 17 13.91 1015 93 2.6 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
2 25.07501 55.188761 1 2018-01-01 03:00:00 0 1 2018 0 14.03 12 17 13.89 1016 93 1.5 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
3 25.07501 55.188761 1 2018-01-01 04:00:00 0 1 2018 0 13.78 12 17 13.14 1016 93 2.1 180 'main': 'Mist' 'description': 'mist' 'icon': '50n'}]
4 25.07501 55.188761 1 2018-01-01 05:00:00 0 1 2018 0 14.28 12 18 13.45 1017 93 2.6 160 'main': 'Mist' 'description': 'mist' 'icon': '50d'}]
In [969]:
def temp_aver(temp):
    temp_min = temp[1]
    temp_max = temp[2]
    main_temp = temp[0]
    if temp_min ==0:
        return (temp_min + temp_max)/2
    else:
        return (temp_min + temp_max+main_temp)/3
In [971]:
data[['main_temp','Temp_min','Temp_max']] = data[['main_temp','Temp_min','Temp_max']].astype('float')
data['temp_average'] =data[['main_temp','Temp_min','Temp_max']].apply(temp_aver,axis=1)
In [973]:
data.drop(['main_temp','Temp_min','Temp_max','dt','icon','Description'],axis=1,inplace=True)
In [974]:
data.head(2)
Out[974]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.7 1015 87 3.1 150 'main': 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'main': 'Clear' 14.876667
In [976]:
data['Main'] = data['Main'].str.replace("'main':"," ")
In [977]:
data.head()
Out[977]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.7 1015 87 3.1 150 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'Clear' 14.876667
2 25.07501 55.188761 1 0 1 2018 0 13.89 1016 93 1.5 150 'Clear' 14.343333
3 25.07501 55.188761 1 0 1 2018 0 13.14 1016 93 2.1 180 'Mist' 14.260000
4 25.07501 55.188761 1 0 1 2018 0 13.45 1017 93 2.6 160 'Mist' 14.760000
In [978]:
data.Main.value_counts()
Out[978]:
  'Clear'           12551
  'Clouds'           4626
  'Dust'             1136
  'Haze'              557
  'Rain'              183
  'Mist'              158
  'Fog'                98
  'Thunderstorm'       31
  'Smoke'               4
Name: Main, dtype: int64
In [981]:
data.temp_average
Out[981]:
0        15.330000
1        14.876667
2        14.343333
3        14.260000
4        14.760000
           ...    
19339    22.950000
19340    22.450000
19341    21.506667
19342    21.013333
19343    20.436667
Name: temp_average, Length: 19344, dtype: float64
In [989]:
#Checking for missing information
data.isnull().mean()*100
Out[989]:
lat             0.000000
lon             0.000000
clouds          0.000000
rain            0.000000
month           0.000000
year            0.000000
weekdays        0.000000
Feels_like      0.987386
Pressure        0.000000
Humidity        0.000000
Speed           4.337262
Deg             0.000000
Main            0.000000
temp_average    0.000000
dtype: float64
In [990]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19344 entries, 0 to 19343
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   lat           19344 non-null  float64
 1   lon           19344 non-null  float64
 2   clouds        19344 non-null  object 
 3   rain          19344 non-null  int64  
 4   month         19344 non-null  int64  
 5   year          19344 non-null  int64  
 6   weekdays      19344 non-null  int64  
 7   Feels_like    19153 non-null  object 
 8   Pressure      19344 non-null  object 
 9   Humidity      19344 non-null  object 
 10  Speed         18505 non-null  object 
 11  Deg           19344 non-null  object 
 12  Main          19344 non-null  object 
 13  temp_average  19344 non-null  float64
dtypes: float64(3), int64(4), object(7)
memory usage: 2.1+ MB
  • Filling missing information, since both features with missing information have less than 5% missing values, we will fill with the mean
In [991]:
data['Speed'] = data['Speed'].astype('float')
data['Feels_like'] = data['Feels_like'].astype('float')
data['Speed'].fillna(data['Speed'].mean(),inplace=True)
data['Feels_like'].fillna(data['Feels_like'].mean(),inplace=True)
In [992]:
data.isnull().sum()
Out[992]:
lat             0
lon             0
clouds          0
rain            0
month           0
year            0
weekdays        0
Feels_like      0
Pressure        0
Humidity        0
Speed           0
Deg             0
Main            0
temp_average    0
dtype: int64

Exploratory Data Analysis

After cleaning the data and preparing it, Exploratory Data Analysis (EDA) will be performed to gather insights that will be useful for the model to learn from the data to help improve performance of the model

In [993]:
data.head()
Out[993]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.70 1015 87 3.1 150 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'Clear' 14.876667
2 25.07501 55.188761 1 0 1 2018 0 13.89 1016 93 1.5 150 'Clear' 14.343333
3 25.07501 55.188761 1 0 1 2018 0 13.14 1016 93 2.1 180 'Mist' 14.260000
4 25.07501 55.188761 1 0 1 2018 0 13.45 1017 93 2.6 160 'Mist' 14.760000
In [ ]:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Data Modelling
  • Results
  • Deployment – Storytelling

Datathon2020 – Predicting weather disruption of public transport – provided by Ernst and Young

This Project was inspired from the Business Case of Data Science Society Global 2020 Hackathon hosted from May 15 - 17 , 2020

click here for details about the Business Case and the data dictionary

Data Sources :

The datasets used in this project was provided by the organizers, however the external data sourced for were obtained here

The analysis for this project will follow the CRISP-DM pipeline which are ;

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Data Modelling
  • Results
  • Deployment - Storytelling

Business Understanding

The summary of the project is to predict public transport service disruption in Dubai using the weather data analysis

  • Goal : Can you analyze the weather data to predict public transport service disruption in Dubai? How can we plan for less disruption in the wake of severe weather conditions and leverage the emergency management plan as well as providing uninterrupted services and products to citizens?

Data Understanding and Data Preprocessing

This stage involves loading the data and performing necessary data cleaning, preprocessing and feature engineering on the data to prepare it for analysis and modelling

  • Importing Necessary Libraries
In [56]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

plt.style.use('ggplot')
import plotly.graph_objects as go
  • Loading the datasets into a dataframe
In [939]:
data = pd.read_json('Dubai+Weather_20180101_20200316.txt')
transport = pd.DataFrame(data=None,columns=['year','month','transport_type','trips'])

for i in os.listdir('Transport'):
    month_data = pd.read_csv("Transport/" + i)
    transport = pd.concat([transport,month_data],axis=0)
In [940]:
data.shape
Out[940]:
(19344, 11)
In [941]:
data.tail(3)
Out[941]:
city_name lat lon main wind clouds weather dt dt_iso timezone rain
19341 Dubai 25.07501 55.188761 {'temp': 21.52, 'temp_min': 20, 'temp_max': 23... {'speed': 3.1, 'deg': 60} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584392400 2020-03-16 21:00:00 +0000 UTC 14400 NaN
19342 Dubai 25.07501 55.188761 {'temp': 21.04, 'temp_min': 19, 'temp_max': 23... {'speed': 3.1, 'deg': 70} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584396000 2020-03-16 22:00:00 +0000 UTC 14400 NaN
19343 Dubai 25.07501 55.188761 {'temp': 20.31, 'temp_min': 18, 'temp_max': 23... {'speed': 3.6, 'deg': 60} {'all': 0} [{'id': 800, 'main': 'Clear', 'description': '... 1584399600 2020-03-16 23:00:00 +0000 UTC 14400 NaN
  • Data Preprocessing and Data Cleaning
In [942]:
transport.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 0 to 1
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   year            95 non-null     object
 1   month           95 non-null     object
 2   transport_type  95 non-null     object
 3   trips           95 non-null     object
dtypes: object(4)
memory usage: 3.7+ KB
In [943]:
transport.reset_index(inplace=True)
transport.drop('index',axis=1,inplace=True)
In [944]:
transport.head()
Out[944]:
year month transport_type trips
0 2018 Feb Marine 141840
1 2018 Feb Tram 528515
2 2018 Feb Bus 11111573
3 2018 Feb Metro 16915232
4 2018 Mar Marine 166561
  • Transforming the date to pandas date format

  • Dropping columns with constant labels such as city_name and timezone

In [945]:
data.drop(['city_name','timezone','dt_iso'],axis=1,inplace=True)
In [946]:
def convert_time(timestamp):
    return datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')
In [947]:
data['dt'] = data['dt'].apply(convert_time)
data['dt'] = pd.to_datetime(data['dt'])
In [948]:
data.head(2)
Out[948]:
lat lon main wind clouds weather dt rain
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN
In [949]:
transport.head(2)
Out[949]:
year month transport_type trips
0 2018 Feb Marine 141840
1 2018 Feb Tram 528515

Feature Engineering

  • Using the date column created to engineer new date time features such as Month and Year
In [950]:
data['month'] = data['dt'].dt.month
data['year'] = data['dt'].dt.year
data['weekdays'] = data['dt'].dt.weekday
In [951]:
data.head()
Out[951]:
lat lon main wind clouds weather dt rain month year weekdays
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN 1 2018 0
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN 1 2018 0
2 25.07501 55.188761 {'temp': 14.03, 'temp_min': 12, 'temp_max': 17... {'speed': 1.5, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 03:00:00 NaN 1 2018 0
3 25.07501 55.188761 {'temp': 13.78, 'temp_min': 12, 'temp_max': 17... {'speed': 2.1, 'deg': 180} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 04:00:00 NaN 1 2018 0
4 25.07501 55.188761 {'temp': 14.28, 'temp_min': 12, 'temp_max': 18... {'speed': 2.6, 'deg': 160} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 05:00:00 NaN 1 2018 0
  • Creating the id in the data to be used to map the Transport data using the Create id function
  • Transforming the Main , Wind, Clouds , weather and rain columns to extract the details into a proper format to be used for analysis
In [952]:
data['main'].iloc[0]
Out[952]:
{'temp': 14.99,
 'temp_min': 13,
 'temp_max': 18,
 'feels_like': 13.7,
 'pressure': 1015,
 'humidity': 87}
In [953]:
main = data['main'].astype(str).str.strip('{}').str.split(', ', expand=True)
wind = data['wind'].astype(str).str.strip('{}').str.split(', ', expand=True)
weather = data['weather'].astype(str).str.strip('{}').str.split(', ', expand=True)
  • Renaming the Columns
In [954]:
main.columns = ['main_temp','Temp_min','Temp_max','Feels_like','Pressure','Humidity']
wind.columns = ['Speed','Deg']
weather.columns = ['id','Main','Description','icon','5','6','7','8']
weather.drop(['5','6','7','8','id'],axis=1,inplace=True)
In [955]:
data.head()
Out[955]:
lat lon main wind clouds weather dt rain month year weekdays
0 25.07501 55.188761 {'temp': 14.99, 'temp_min': 13, 'temp_max': 18... {'speed': 3.1, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 01:00:00 NaN 1 2018 0
1 25.07501 55.188761 {'temp': 14.63, 'temp_min': 13, 'temp_max': 17... {'speed': 2.6, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 02:00:00 NaN 1 2018 0
2 25.07501 55.188761 {'temp': 14.03, 'temp_min': 12, 'temp_max': 17... {'speed': 1.5, 'deg': 150} {'all': 1} [{'id': 800, 'main': 'Clear', 'description': '... 2018-01-01 03:00:00 NaN 1 2018 0
3 25.07501 55.188761 {'temp': 13.78, 'temp_min': 12, 'temp_max': 17... {'speed': 2.1, 'deg': 180} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 04:00:00 NaN 1 2018 0
4 25.07501 55.188761 {'temp': 14.28, 'temp_min': 12, 'temp_max': 18... {'speed': 2.6, 'deg': 160} {'all': 1} [{'id': 701, 'main': 'Mist', 'description': 'm... 2018-01-01 05:00:00 NaN 1 2018 0
In [956]:
data.drop(['main','wind','weather'],axis=1,inplace=True)
data = pd.concat([data,main,wind,weather],axis=1)
In [957]:
data.head(1)
Out[957]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 {'all': 1} 2018-01-01 01:00:00 NaN 1 2018 0 'temp': 14.99 'temp_min': 13 'temp_max': 18 'feels_like': 13.7 'pressure': 1015 'humidity': 87 'speed': 3.1 'deg': 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
In [958]:
data['clouds'] = data['clouds'].astype(str).str.strip('{}').apply(lambda x:x.split(": ")[-1])
In [960]:
def replace_nan(data):
    if pd.isna(data):
        return 0
    else:
        return 1
In [961]:
data.rain = data.rain.apply(replace_nan)
In [962]:
data.rain.value_counts()
Out[962]:
0    19231
1      113
Name: rain, dtype: int64
In [963]:
data.head(1)
Out[963]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 1 2018-01-01 01:00:00 0 1 2018 0 'temp': 14.99 'temp_min': 13 'temp_max': 18 'feels_like': 13.7 'pressure': 1015 'humidity': 87 'speed': 3.1 'deg': 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
In [964]:
cols1 = ['main_temp',
      'Feels_like',
      'Speed']
cols2 = ['Temp_min','Temp_max','Pressure','Humidity','Deg']
   
for column in cols1:
    data[column] = data[column].str.extract(r'(\d+\.\d+)',expand=False)
for column in cols2: 
    data[column] = data[column].str.extract(r'(\d+)',expand=False)
In [965]:
data[data['main_temp'].isnull()].head(2)
Out[965]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
51 25.07501 55.188761 1 2018-01-03 04:00:00 0 1 2018 2 NaN 13 17 14.34 1015 87 4.1 170 'main': 'Haze' 'description': 'haze' 'icon': '50n'}]
87 25.07501 55.188761 75 2018-01-04 16:00:00 0 1 2018 3 NaN 18 22 18.39 1015 52 4.1 270 'main': 'Clouds' 'description': 'broken clouds' 'icon': '04n'}]
In [966]:
data['main_temp'].fillna(0,inplace=True)
In [968]:
data.head()
Out[968]:
lat lon clouds dt rain month year weekdays main_temp Temp_min Temp_max Feels_like Pressure Humidity Speed Deg Main Description icon
0 25.07501 55.188761 1 2018-01-01 01:00:00 0 1 2018 0 14.99 13 18 13.7 1015 87 3.1 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
1 25.07501 55.188761 1 2018-01-01 02:00:00 0 1 2018 0 14.63 13 17 13.91 1015 93 2.6 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
2 25.07501 55.188761 1 2018-01-01 03:00:00 0 1 2018 0 14.03 12 17 13.89 1016 93 1.5 150 'main': 'Clear' 'description': 'sky is clear' 'icon': '01n'}]
3 25.07501 55.188761 1 2018-01-01 04:00:00 0 1 2018 0 13.78 12 17 13.14 1016 93 2.1 180 'main': 'Mist' 'description': 'mist' 'icon': '50n'}]
4 25.07501 55.188761 1 2018-01-01 05:00:00 0 1 2018 0 14.28 12 18 13.45 1017 93 2.6 160 'main': 'Mist' 'description': 'mist' 'icon': '50d'}]
In [969]:
def temp_aver(temp):
    temp_min = temp[1]
    temp_max = temp[2]
    main_temp = temp[0]
    if temp_min ==0:
        return (temp_min + temp_max)/2
    else:
        return (temp_min + temp_max+main_temp)/3
In [971]:
data[['main_temp','Temp_min','Temp_max']] = data[['main_temp','Temp_min','Temp_max']].astype('float')
data['temp_average'] =data[['main_temp','Temp_min','Temp_max']].apply(temp_aver,axis=1)
In [973]:
data.drop(['main_temp','Temp_min','Temp_max','dt','icon','Description'],axis=1,inplace=True)
In [974]:
data.head(2)
Out[974]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.7 1015 87 3.1 150 'main': 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'main': 'Clear' 14.876667
In [976]:
data['Main'] = data['Main'].str.replace("'main':"," ")
In [977]:
data.head()
Out[977]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.7 1015 87 3.1 150 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'Clear' 14.876667
2 25.07501 55.188761 1 0 1 2018 0 13.89 1016 93 1.5 150 'Clear' 14.343333
3 25.07501 55.188761 1 0 1 2018 0 13.14 1016 93 2.1 180 'Mist' 14.260000
4 25.07501 55.188761 1 0 1 2018 0 13.45 1017 93 2.6 160 'Mist' 14.760000
In [978]:
data.Main.value_counts()
Out[978]:
  'Clear'           12551
  'Clouds'           4626
  'Dust'             1136
  'Haze'              557
  'Rain'              183
  'Mist'              158
  'Fog'                98
  'Thunderstorm'       31
  'Smoke'               4
Name: Main, dtype: int64
In [981]:
data.temp_average
Out[981]:
0        15.330000
1        14.876667
2        14.343333
3        14.260000
4        14.760000
           ...    
19339    22.950000
19340    22.450000
19341    21.506667
19342    21.013333
19343    20.436667
Name: temp_average, Length: 19344, dtype: float64
In [989]:
#Checking for missing information
data.isnull().mean()*100
Out[989]:
lat             0.000000
lon             0.000000
clouds          0.000000
rain            0.000000
month           0.000000
year            0.000000
weekdays        0.000000
Feels_like      0.987386
Pressure        0.000000
Humidity        0.000000
Speed           4.337262
Deg             0.000000
Main            0.000000
temp_average    0.000000
dtype: float64
In [990]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19344 entries, 0 to 19343
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   lat           19344 non-null  float64
 1   lon           19344 non-null  float64
 2   clouds        19344 non-null  object 
 3   rain          19344 non-null  int64  
 4   month         19344 non-null  int64  
 5   year          19344 non-null  int64  
 6   weekdays      19344 non-null  int64  
 7   Feels_like    19153 non-null  object 
 8   Pressure      19344 non-null  object 
 9   Humidity      19344 non-null  object 
 10  Speed         18505 non-null  object 
 11  Deg           19344 non-null  object 
 12  Main          19344 non-null  object 
 13  temp_average  19344 non-null  float64
dtypes: float64(3), int64(4), object(7)
memory usage: 2.1+ MB
  • Filling missing information, since both features with missing information have less than 5% missing values, we will fill with the mean
In [991]:
data['Speed'] = data['Speed'].astype('float')
data['Feels_like'] = data['Feels_like'].astype('float')
data['Speed'].fillna(data['Speed'].mean(),inplace=True)
data['Feels_like'].fillna(data['Feels_like'].mean(),inplace=True)
In [992]:
data.isnull().sum()
Out[992]:
lat             0
lon             0
clouds          0
rain            0
month           0
year            0
weekdays        0
Feels_like      0
Pressure        0
Humidity        0
Speed           0
Deg             0
Main            0
temp_average    0
dtype: int64

Exploratory Data Analysis

After cleaning the data and preparing it, Exploratory Data Analysis (EDA) will be performed to gather insights that will be useful for the model to learn from the data to help improve performance of the model

In [993]:
data.head()
Out[993]:
lat lon clouds rain month year weekdays Feels_like Pressure Humidity Speed Deg Main temp_average
0 25.07501 55.188761 1 0 1 2018 0 13.70 1015 87 3.1 150 'Clear' 15.330000
1 25.07501 55.188761 1 0 1 2018 0 13.91 1015 93 2.6 150 'Clear' 14.876667
2 25.07501 55.188761 1 0 1 2018 0 13.89 1016 93 1.5 150 'Clear' 14.343333
3 25.07501 55.188761 1 0 1 2018 0 13.14 1016 93 2.1 180 'Mist' 14.260000
4 25.07501 55.188761 1 0 1 2018 0 13.45 1017 93 2.6 160 'Mist' 14.760000
In [ ]:

Share this

3 thoughts on “Predicting weather disruption of public transport – provided by Ernst and Young

  1. 0
    votes

    ☑️ ☑️DO YOU WANT TO RECOVER YOUR LOST FUNDS ON CAPITAL INVESTMENTS, BITCOIN INVESTMENTS, BINARY OPTIONS AND OTHER SCAM TRADING INVESTMENTS ??? TAKE YOUR TIME TO READ🔘

    ☑️ The COMPOSITE CYBER SECURITY SPECIALISTS have received numerous complaints of fraud associated with websites that offers opportunities for Capital Investments, bitcoin investments and Trading on an Internet-based trading platforms. Most Of The complaints falls into these Two categories:
    1. 🔘Refusal to credit customers accounts or reimburse funds to customers:
    These complaints typically involve customers who have deposited money into their trading or investment account and who are then encouraged by “brokers” over the telephone to deposit additional funds into the customer account. When customers later attempt to withdraw their original deposit or the return they have been promised, the trading platforms allegedly cancels customers’ withdrawal requests, refuse to credit their accounts, or ignore their telephone calls and emails.

    2. 🔘Manipulation of software to generate losing trades:
    These complaints alleged that the Internet-based Investment and trading platforms manipulate the trading software to distort the Trading prices and payouts in order to ensure that the trade results in a Loss. For example, when a customer’s trade is “winning,” the countdown to expiration is extended arbitrarily until the trade becomes a loss.

    ☑️ Most people have lost their hard earned money through Scam Investments And Trading, yet they would go and meet fake recovery Experts unknowingly to help them recover their money and they would end up losing more money in the process. This Is Basically why we (COMPOSITE CYBER SECURITY SPECIALISTS) have come to y’all victim’s rescue. The clue is most of these Brokers have weak Database security, and their vulnerabilities can be exploited easily with the Help of our Special HackTools, Root HackTools And Technical Hacking Strategies because they wouldn’t wanna conduct Bug bounty Programs which would have helped strengthen and protect their Database from Unauthorized accesses, So all our specialists do is to hack into the Broker’s Database Using Technical Hacking Methods and Strategies, Decrypt your Transaction Details, Trace the Routes of your Deposited Funds, Then some Technical Hacking Algorithms & Execution follows then you have your money recovered In Bitcoins. 💰 ✔️

    ☑️All our Specialists are well experienced in their various niches with Great Skills, Technical Hacking Strategies And Positive Online Reputations And Recommendations🔘
    They hail from a proven track record and have cracked even the toughest of barriers to intrude and capture all relevant data needed by our Clients.
    We have Digital Forensic Specialists, Certified Ethical Hackers, Software Engineers, Cyber Security Experts, Private investigators and more. Our Goal is to make your digital life secure, safe and hassle free by Linking you Up With these great Professionals such as JACK CABLE, ARNE SWINNEN, SEAN MELIA, DAWID CZAGAN, BEN SADEGHIPOUR And More. These Professionals are Well Reserved Professionals who are always ready to Handle your job with great energy and swift response so that your problems can be solved very quickly.
    All You Need to Do is to send us a mail and we’ll Assign any of these specialists to Handle your Job immediately.

    ☑️ CONTACT:
    ••• Email:
    [email protected]

    🔘2020 © composite cybersecurity specialists
    🔘Want faster service? Contact us!
    🔘All Rights Reserved ®️

  2. 0
    votes

    INTERNET SCAM ALERT‼️

    The internet today is full of SCAM ADS, mostly in comments of various sites and blogs. A large number of individuals have been victims of scam and lost a lot of money to SCAMMERS. Most of the common scam you can see
    ❌BANK LOAN SCAM. ❌BINARY OPTIONS SCAM.
    ❌LOTTERY SCAM. ❌HACKING SCAM. and lost more……

    ✳️The big Question is “Can someone Recover their money lost to Internet Scam⁉️
    I will say yes, and will tell you how.

    The only way to Recovery your is to hire a Hacker who will help you take back your money from this Scammers and today is your lucky day, you just met the guys perfect for the job.

    WHO ARE WE❔
    We are PYTHONAX, A group of skilled Hackers and have dedicated our time to help individuals get back thier money from INTERNET SCAMMERS. There is a research was carried one to calculate the amount of money lost through Scam, and it was confirmed that more than USD $3billion annually.

    HOW DO WE OPERATE❔
    We first of all study the scammer brought to us by hacking the person device(phone or computer) to get information of How, Where, this person keeps money he/she as defrauded people of ( so many of this scammers don’t actually save the money in banks, they mostly stack the money in a Bitcoin wallet, that way it is safe and untraceable to authorities) and we work on a strategy to get back the money and give it back to whom they have defrauded.

    Contacting us is simple, just give us a message through the email below.
    Email-: [email protected]

    If you a victim of internet scam or you know someone who is, make contact to us immediately. You are 💯 % safe to contact us, our email is very secure.

    We also provide Legit Hacking Services such as-:
    🔸Phone Hacking/Cloning
    🔸Email Hacking & Password Recovery
    🔸Social Media Hacking & Passowrd Recovery
    🔸Deleted Files Recovery 🔸Mobil Tracking
    🔸Virus detection & Elimination. e.t.c

    Email-: [email protected]

    Pythonax.
    2020 © All Right Reserved.

Leave a Reply