Datathon 2020 SolutionsDatathons Solutions

Weather Disruption of Public Transport Analysis Using Python

The Weather Dataset provided has been preprocessed the traffic data ha been appended after preprocessing.The aim is to find the intersection dates available from both the datasets and do a predictive analsyis after combining traffic and weather datasets.so if future weather conditions are given or predicted by time series analysis ,public trasnport disruption could be interpreted using machine learning models.

(Just a small try by an undergrad engineering student,Hope you like it ๐Ÿ™‚ ).

0
votes

semifinal qualification progress:

  • Approach:

datasets have been downloadedย  for public transport(RTA dubai pulse) for the dates which were common with the weather dataset.(01-Jan-2018 to 31-Mar-2018) in a file and have preprocessed the datasets and got the hourly traffic flow numbers on each of the days (between 01-Jan-2018 to 31-Mar-2018) these hourly traffic flows were appended to the weather dataset.

  • Analysis aim:

To analyse how traffic would increase or decrease during various weather conditions by applyingย  various machine learning and time series models.

python code:

In [12]:
import numpy as np
import pandas as pd
import statsmodels as sm
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import datetime
sns.set()

Weather dataset preprocessing

keeping only the necessary features to predict how traffic is affected by various weather conditions. future aim: using machine learning models and time series analysis

In [2]:
main_data=pd.read_csv('C:\\Users\\Taha\\Desktop\\weather_data (2).csv')
data=main_data
data
Out[2]:
city_name lat lon main/temp main/temp_min main/temp_max main/feels_like main/pressure main/humidity wind/speed ... weather/0/icon dt dt_iso timezone rain/1h weather/1/id weather/1/main weather/1/description weather/1/icon rain/3h
0 Dubai 25.07501 55.188761 14.99 13.0 18.00 13.70 1015 87 3.1 ... 01n 1514764800 2018-01-01 00:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN
1 Dubai 25.07501 55.188761 14.63 13.0 17.00 13.91 1015 93 2.6 ... 01n 1514768400 2018-01-01 01:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN
2 Dubai 25.07501 55.188761 14.03 12.0 17.00 13.89 1016 93 1.5 ... 01n 1514772000 2018-01-01 02:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN
3 Dubai 25.07501 55.188761 13.78 12.0 17.00 13.14 1016 93 2.1 ... 50n 1514775600 2018-01-01 03:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN
4 Dubai 25.07501 55.188761 14.28 12.0 18.00 13.45 1017 93 2.6 ... 50d 1514779200 2018-01-01 04:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19339 Dubai 25.07501 55.188761 22.85 21.0 25.45 22.19 1015 64 3.6 ... 01n 1584385200 2020-03-16 19:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN
19340 Dubai 25.07501 55.188761 22.35 21.0 24.00 21.17 1015 68 4.6 ... 01n 1584388800 2020-03-16 20:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN
19341 Dubai 25.07501 55.188761 21.52 20.0 23.36 21.43 1015 72 3.1 ... 01n 1584392400 2020-03-16 21:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN
19342 Dubai 25.07501 55.188761 21.04 19.0 23.36 21.19 1014 77 3.1 ... 01n 1584396000 2020-03-16 22:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN
19343 Dubai 25.07501 55.188761 20.31 18.0 23.28 19.83 1014 77 3.6 ... 01n 1584399600 2020-03-16 23:00:00 +0000 UTC 14400 NaN NaN NaN NaN NaN NaN

19344 rows × 25 columns

In [3]:
data.describe()
Out[3]:
lat lon main/temp main/temp_min main/temp_max main/feels_like main/pressure main/humidity wind/speed wind/deg clouds/all weather/0/id dt timezone rain/1h weather/1/id rain/3h
count 1.934400e+04 1.934400e+04 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 19344.000000 1.934400e+04 19344.0 28.000000 3.000000 85.000000
mean 2.507501e+01 5.518876e+01 28.102823 26.661868 29.810532 27.684793 1009.416098 52.495089 3.879056 188.379239 13.751964 791.048904 1.549582e+09 14400.0 0.415357 674.333333 0.828706
std 1.291090e-11 2.732107e-11 7.329419 7.580049 7.240840 8.309911 8.017003 21.660800 2.098738 106.258945 26.479664 41.348948 2.010339e+07 0.0 0.456707 150.111070 0.615405
min 2.507501e+01 5.518876e+01 10.890000 7.000000 12.000000 6.340000 972.000000 4.000000 0.300000 0.000000 0.000000 200.000000 1.514765e+09 14400.0 0.110000 501.000000 0.130000
25% 2.507501e+01 5.518876e+01 22.030000 20.920000 23.840000 20.750000 1003.000000 35.000000 2.315000 100.000000 0.000000 800.000000 1.532174e+09 14400.0 0.167500 631.000000 0.380000
50% 2.507501e+01 5.518876e+01 28.060000 26.670000 30.000000 27.315000 1011.000000 53.000000 3.600000 180.000000 1.000000 800.000000 1.549582e+09 14400.0 0.215000 761.000000 1.000000
75% 2.507501e+01 5.518876e+01 33.880000 32.810000 35.122500 34.890000 1016.000000 69.000000 5.100000 290.000000 19.000000 800.000000 1.566991e+09 14400.0 0.400000 761.000000 1.000000
max 2.507501e+01 5.518876e+01 45.940000 45.360000 48.000000 47.890000 1026.000000 100.000000 14.900000 360.000000 100.000000 804.000000 1.584400e+09 14400.0 2.030000 761.000000 3.810000
In [4]:
list(data.columns.values)
data.notna().sum()
Out[4]:
city_name                19344
lat                      19344
lon                      19344
main/temp                19344
main/temp_min            19344
main/temp_max            19344
main/feels_like          19344
main/pressure            19344
main/humidity            19344
wind/speed               19344
wind/deg                 19344
clouds/all               19344
weather/0/id             19344
weather/0/main           19344
weather/0/description    19344
weather/0/icon           19344
dt                       19344
dt_iso                   19344
timezone                 19344
rain/1h                     28
weather/1/id                 3
weather/1/main               3
weather/1/description        3
weather/1/icon               3
rain/3h                     85
dtype: int64
In [5]:
data=data.drop(['city_name','dt','lat','lon','main/temp_min','main/temp_max','weather/0/id','weather/1/id','weather/0/icon',
                'weather/1/main','weather/1/description','timezone','weather/1/icon'],axis=1)
In [6]:
(data.columns.values)
data['rain/1h'].fillna(0, inplace=True)
data['rain/3h'].fillna(0, inplace=True)
In [7]:
column_names = ['dt_iso','main/temp', 'main/feels_like', 'main/pressure', 'main/humidity',
       'wind/speed', 'wind/deg', 'clouds/all', 'rain/1h', 'rain/3h', 'weather/0/main',
       'weather/0/description']

data = data.reindex(columns=column_names)
data=data.rename(columns={"main/temp":"temp",
                   "main/feels_like":"temp_feels_like",
                   "main/pressure":"main_pressure",
                   "main/humidity":"humidity",
                   "wind/speed":"wind_speed",
                   "wind/deg":"wind_deg",
                   "clouds/all":"clouds",
                   "rain/1h":"rain_1h",
                   "rain/3h":"rain_3h", 
                   "weather/0/main":"weather_main",
                   "weather/0/description":"weather_description"})
In [8]:
data['dt_iso']=data['dt_iso'].str[0:19]
data
Out[8]:
dt_iso temp temp_feels_like main_pressure humidity wind_speed wind_deg clouds rain_1h rain_3h weather_main weather_description
0 2018-01-01 00:00:00 14.99 13.70 1015 87 3.1 150 1 0.0 0.0 Clear sky is clear
1 2018-01-01 01:00:00 14.63 13.91 1015 93 2.6 150 1 0.0 0.0 Clear sky is clear
2 2018-01-01 02:00:00 14.03 13.89 1016 93 1.5 150 1 0.0 0.0 Clear sky is clear
3 2018-01-01 03:00:00 13.78 13.14 1016 93 2.1 180 1 0.0 0.0 Mist mist
4 2018-01-01 04:00:00 14.28 13.45 1017 93 2.6 160 1 0.0 0.0 Mist mist
... ... ... ... ... ... ... ... ... ... ... ... ...
19339 2020-03-16 19:00:00 22.85 22.19 1015 64 3.6 50 0 0.0 0.0 Clear sky is clear
19340 2020-03-16 20:00:00 22.35 21.17 1015 68 4.6 60 0 0.0 0.0 Clear sky is clear
19341 2020-03-16 21:00:00 21.52 21.43 1015 72 3.1 60 0 0.0 0.0 Clear sky is clear
19342 2020-03-16 22:00:00 21.04 21.19 1014 77 3.1 70 0 0.0 0.0 Clear sky is clear
19343 2020-03-16 23:00:00 20.31 19.83 1014 77 3.6 60 0 0.0 0.0 Clear sky is clear

19344 rows × 12 columns

In [9]:
data["traffic"] = np.nan
data['dt_iso'] = data['dt_iso'].astype('str')
In [13]:
data.plot(kind='scatter',x='temp',y='weather_main',color='magenta')
plt.show()
data.plot(kind='scatter',x='temp',y='weather_description',color='orange')
plt.show()

data.plot(kind='scatter',x='temp',y='wind_speed',color='green')
plt.show()
data.plot(kind='scatter',x='temp',y='clouds',color='blue')
plt.show()

corrmatrix=data.corr()
sns.heatmap(corrmatrix, annot=True)
plt.show()

Traffic data Preprocessing

i have downloaded the datasets for public transport(RTA dubai pulse) for the dates which were common with the weather dataset.(01-Jan-2018 to 31-Mar-2018) in a file and have preprocessed the datasets and got the hourly traffic flow numbers on each of the days (between 01-Jan-2018 to 31-Mar-2018) these hourly traffic flows were appended to the weather dataset.

Future aim:To analyse how traffic would increase or decrease during various weather cconditions using machine learning models

In [11]:
with os.scandir("C:\\Users\\Taha\\Desktop\\traffic data\\") as files:
    for file in files:
        print("Accessing file"+':'+file.name)
        date=str(file.name[22:24])
        month=str(file.name[19:21])
        date='2018-'+month+'-'+date+' '
        
        
        #filter the data with only the dates of the filename and sort them by timestamp.
        week_day = pd.read_csv("C:\\Users\\Taha\\Desktop\\traffic data\\"+file.name)
        weekday=week_day[['txn_date','txn_time']]
        weekday["datetime"] = weekday["txn_date"] +' ' +weekday["txn_time"]
        weekday.drop(['txn_time','txn_date'],axis=1,inplace=True)
        weekday.sort_values(by=['datetime'],inplace=True)
        weekday=weekday.reset_index(drop=True)
        weekday1 =weekday['datetime'].str.contains(date) 
        weekday1=weekday[weekday1]
        weekday1= weekday1.reset_index(drop=True)
        weekday1['datetime'] = pd.to_datetime(weekday1['datetime'])
        weekday1= weekday1.reset_index(drop=True)
            
        #finding the traffic flow during each hour on each of the common days between weather and transport datasets 
        #and appending it to the weather dataset for analysis    
        for i in range(0,24):
            start_s=str(date)+str(i)+':00:00'
            end_s=str(date)+str(i)+':59:59'
            start= datetime.strptime(start_s, '%Y-%m-%d %H:%M:%S')
            end=datetime.strptime(end_s,'%Y-%m-%d %H:%M:%S')
            traffic=weekday1[(weekday1['datetime'] >=start) & (weekday1['datetime'] <=end )].count()
            print(traffic[0])
            data.loc[data['dt_iso'] == str(start_s), ['traffic']]=traffic[0]
Accessing file:bus_ridership_2018-03-01_00-00-00.csv
C:\Users\Taha\anaconda3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
C:\Users\Taha\anaconda3\lib\site-packages\pandas\core\frame.py:3997: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
C:\Users\Taha\anaconda3\lib\site-packages\ipykernel_launcher.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
9191
2798
647
462
898
9169
38606
67596
73220
49366
31752
28982
29140
31227
34172
36461
44691
60996
75503
71249
56532
44094
36042
21997
Accessing file:bus_ridership_2018-03-02_00-00-00.csv
9135
2267
928
689
772
3380
12366
19534
27079
30441
35311
38338
36236
34513
42622
46964
49466
53371
54603
52645
47327
43425
36686
22007
Accessing file:bus_ridership_2018-03-03_00-00-00.csv
5908
1303
855
638
992
6965
24315
45150
53888
44934
32168
29534
30453
32576
33725
33755
40109
50116
58855
51977
42437
35127
29168
15921
Accessing file:bus_ridership_2018-03-04_00-00-00.csv
5235
1225
704
446
977
9787
39560
69791
74657
50226
32799
29745
29987
31251
30820
32879
42009
59589
74186
66900
49134
36553
28826
15496
Accessing file:bus_ridership_2018-03-05_00-00-00.csv
C:\Users\Taha\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (3) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
5040
1224
503
467
980
9293
39603
68373
75173
49792
32187
28566
31701
34231
34431
34149
41091
59585
74783
65631
47630
36266
28674
15773
Accessing file:bus_ridership_2018-03-06_00-00-00.csv
5370
1204
592
410
969
9455
39291
69299
74194
49324
31834
28698
29344
30333
30128
32121
40753
59310
75111
65596
47820
35111
28854
15484
Accessing file:bus_ridership_2018-03-07_00-00-00.csv
5250
1303
639
473
999
9361
39648
68649
74203
50642
31784
28511
28832
29611
30068
31746
39516
59037
75056
64705
47434
35446
28682
15593
Accessing file:bus_ridership_2018-03-08_00-00-00.csv
9421
2659
587
481
996
9049
38997
67938
72984
50263
31565
28840
29062
31429
33659
36037
44125
60117
76562
70345
56569
44714
34910
21285
Accessing file:bus_ridership_2018-03-09_00-00-00.csv
10175
2567
979
741
817
3348
12266
19279
26007
30188
33828
36286
33962
32987
41190
46026
49833
53017
56040
54775
50146
45843
38698
24205
Accessing file:bus_ridership_2018-03-10_00-00-00.csv
15
11
944
740
978
6853
24327
44966
53045
43190
30840
27332
27187
27996
27403
25660
27188
29789
26717
17994
9317
4562
1802
404
Accessing file:bus_ridership_2018-03-11_00-00-00.csv
5122
1172
0
12
22
170
630
1257
1714
1786
1551
2134
2731
3453
4247
5300
9316
21436
38543
41944
36271
30241
26295
14739
Accessing file:bus_ridership_2018-03-12_00-00-00.csv
5135
1156
556
410
923
9409
40170
69586
75011
50401
32731
28939
28733
30557
29978
32311
40697
58456
75005
65223
47564
35112
28013
15600
Accessing file:bus_ridership_2018-03-13_00-00-00.csv
5383
1193
598
506
856
9645
40557
69861
74568
51856
32860
29798
29061
29622
29650
32046
40369
58116
74616
65800
47630
35271
28397
16034
Accessing file:bus_ridership_2018-03-14_00-00-00.csv
5181
1181
560
446
1006
9468
39966
66915
71386
51677
32438
29004
29228
30181
29852
31364
39807
58005
74563
65760
46206
35296
28431
15557
Accessing file:bus_ridership_2018-03-15_00-00-00.csv
9798
2662
623
428
952
9097
39360
68083
73596
50275
30771
28590
28860
31497
33082
35769
44098
60061
75441
72034
57098
43496
35691
21486
Accessing file:bus_ridership_2018-03-16_00-00-00.csv
10088
2710
1033
680
789
3463
12806
20187
26558
30801
34102
36868
33733
33805
40743
45678
49866
53171
57066
55622
50567
45694
38855
24017
Accessing file:bus_ridership_2018-03-17_00-00-00.csv
5755
1343
987
694
886
7025
24932
44608
53622
44222
31669
29662
29516
32044
32724
32697
39331
48432
57315
50443
40761
33198
27572
15404
Accessing file:bus_ridership_2018-03-18_00-00-00.csv
37
0
635
448
1024
9518
39672
69095
74064
48424
32528
27369
27470
26978
25627
25599
27432
32316
34365
22044
10446
6058
2148
575
Accessing file:bus_ridership_2018-03-19_00-00-00.csv
12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
4
24
13
10
0
7
Accessing file:bus_ridership_2018-03-20_00-00-00.csv
5
0
0
0
1
13
99
157
156
135
34
23
33
23
26
51
49
111
257
284
161
143
123
70
Accessing file:bus_ridership_2018-03-21_00-00-00.csv
5245
1233
0
0
6
190
703
1379
1580
1279
1132
2084
2275
3317
4272
5104
8152
17606
35351
43421
37568
30357
26171
14970
Accessing file:bus_ridership_2018-03-22_00-00-00.csv
426
289
557
423
1051
9368
38895
66969
71208
48471
30517
26802
26926
28383
28349
29349
35751
43469
43377
25690
11806
5206
2527
1073
Accessing file:bus_ridership_2018-03-23_00-00-00.csv
9296
2184
177
101
175
130
109
407
545
725
1189
1850
2304
3423
6341
10366
14658
18747
24989
30789
33949
36011
33514
22033
Accessing file:bus_ridership_2018-03-24_00-00-00.csv
59
4
779
606
902
6268
22662
40534
48931
40127
27851
25022
24495
25944
24709
23735
24623
27330
26580
17902
9665
4749
1571
266
Accessing file:bus_ridership_2018-03-25_00-00-00.csv
4841
1283
0
0
21
105
839
1355
1790
1899
2038
2451
2900
3616
4618
6173
11723
23099
39728
41195
35423
30904
26027
14492
Accessing file:bus_ridership_2018-03-26_00-00-00.csv
5446
1362
577
419
1026
9043
38385
69558
74533
49427
31945
28807
28823
29638
28786
31063
39156
57402
74066
63146
46965
34263
28302
15482
Accessing file:bus_ridership_2018-03-27_00-00-00.csv
5217
1314
613
407
1058
8936
37986
69164
73865
50050
32154
28974
28069
29450
28748
30706
39313
57766
75062
63824
46809
35232
29444
15523
Accessing file:bus_ridership_2018-03-28_00-00-00.csv
5552
1248
567
429
796
8985
38145
69193
73674
50330
31773
28411
28928
29404
28284
30168
39044
56837
72358
64103
47820
35854
29526
16802
Accessing file:bus_ridership_2018-03-29_00-00-00.csv
9274
2670
553
435
922
8538
36631
67421
72673
49360
32011
28766
28878
30585
32071
34678
43323
61262
75312
70337
58606
44131
34960
21794
Accessing file:bus_ridership_2018-03-30_00-00-00.csv
10622
2506
941
683
708
3346
12541
21380
27436
30922
33861
34819
31094
32187
38606
41197
46633
54360
53340
52670
50334
43545
37627
25036
Accessing file:bus_ridership_2018-03-31_00-00-00 .csv
6658
1814
997
639
1029
6896
24430
44110
54957
44820
31796
28666
28609
31067
31982
32548
38633
50905
58619
53090
42159
34960
30798
19848

Share this

6 thoughts on “Weather Disruption of Public Transport Analysis Using Python

  1. 1
    votes

    The same segments repeat across article.
    The entire focus of the article is based on data analysis where we are missing all models which are aligning two datasets together and finds appropriate correlation and causality in data.
    I know that time was short, so I would recommend teaming with someone else next time so that work can be split among team members.

    1. 1
      votes

      Also, I would advise using some additional datasets which were not part of the initial dataset, like aggregated daily traffic estimates on an hourly basis provided by some navigation applications because that can additionally help with model precision. We all know that bus driers should be professionals but the majority of โ€œnormalโ€ non-bus driers are not and they are heavily impacted in distracting sensor inputs (thunderstorm, rain, people cutting in, or even forgetting how to drive when weather condition changes). – I’m adding my last sentence about additional dataset to all teams focusing on this problem because no one did even consider it and that is something you can always do on any project – focus not on internal/provided data but find something to augment it ๐Ÿ˜‰

      1. 1
        votes

        my aim was to combine hourly analysis(which i got by processing the dubai traffic datatsets) of traffic to the weather data and later apply machine learning models and time series models to it.but time fell too short for me, since this was my first time ever.
        anyways enjoyed the journey and yes…lesson learnt,always team up !.

        1. 0
          votes

          the long rails of numbers u see in my article in the middle are the hourly dsitribution of traffic for each day of each month (eventhough the intersection dates were only for 3 months between the weather data and traffic data on the dubaipulse site). ๐Ÿ˜

          1. 0
            votes

            Hi taha-junaid3000, the approach was good and I could read through the long rails of numbers, but here when I talk about the same segments repeating, I talk about the page itself. If you do a search (find on page) for the chapter “Traffic data Preprocessing”, you will find it three times with completely and exactly the same text – or at least it is how I see it on my browser ๐Ÿ™

  2. 1
    votes

    Hi, taha-junaid3000 ๐Ÿ™‚
    tomislavk is right… Splitting the work with someone would help you to achive better results and to learn much more while collaborating with others ๐Ÿ™‚
    Keeping in mind your work I would focus more on the analysis and conclusions regarding the data quality, variables for modelling, etc. This would be helpful to make next steps.

Leave a Reply