Datathons Solutions

Datathon Sofia Air Solution – Team Duckelicious Case Telelink

1
votes

 

Business Understanding

In Sofia, air pollution norms were exceeded 70 times in the heating period from October 2017 to March 2018, citizens’ initiative AirBG.info says. The day with the worst air pollution in Sofia was January 27, when the norm was exceeded six times over. Things got so out of control that even the European Court of Justice ruled against Bulgaria in a case brought by the European Commission against the country over its failure to implement measures to reduce air pollution. The two main reasons for the air pollution are believed to be solid fuel heating and motor vehicle traffic.

Particulate matter (PM) is the term used for a mixture of solid particles and liquid droplets found in the air. In particular, fine particles with diameter less than 10µm are called PM10. Prediction of PM10 is an important issue in control and reduction of pollutants in the air.

Predictive models for PM10 vary from extremely simple to extremely complex, but the ability to accurately forecast PM10 concentration index remains elusive. Prediction of particulate matter with diameter less than 10µm (PM10) is an important issue in control and reduction of pollutants in the air.

 

Data Understanding

The area that the research will be focused on and the data is supporting is Bulgaria, where public factors are available to you that consists of : meteorological data: weather, humidity, wind; traffic data; PM10 data, etc. Add more data that you might consider as significant for the more accurate predictions of PM10 concentration levels and forecast of the PM10 pollution.

Available data:

  • Official air quality measurements (5 stations in the city) – as per EU guidelines on air quality monitoring
  • Citizen science air quality measurements, incl. temperature, humidity and pressure (many stations) and topography (gridded data)
  • Meteorological measurements (1 station): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility
  • Topography data

Additional data could be used:

  • All sensor values in the last 5 minutes from the registered stations as JSON files, which are updated every minute
  • Values from the last 5 minutes of a particular sensor

 

Data Preparation

Our first goal is to calculate the distance between the official stations and non-official. Further in the analysis we could compare and try to correct the values of the non-official with its nearest official station values. So, we have started with the first two data sets – Official air quality measurements and Citizen science air quality measurements.

Official air quality measurements

The first thing we can see is that there are a lot of missing data:

Official stations
Month Druzhba Hipodruma IAOS/Pavlovo Mladost Nadezhda Orlov Most Grand Total
2013-01 31 31 25 31 31 149
2013-02 28 28 28 28 28 140
2013-03 31 31 31 31 31 155
2013-04 30 30 30 30 30 150
2013-05 31 30 15 31 31 138
2013-06 26 30 30 30 24 140
2013-07 29 31 31 31 26 148
2013-08 30 31 31 31 123
2013-09 30 30 30 30 20 140
2013-10 31 31 31 29 31 153
2013-11 30 30 30 30 30 150
2013-12 30 31 31 31 31 154
2014-01 31 31 31 31 30 154
2014-02 28 28 28 26 24 134
2014-03 31 31 31 31 27 151
2014-04 30 30 30 30 29 149
2014-05 31 29 31 31 31 153
2014-06 30 30 29 27 29 145
2014-07 30 27 29 31 31 148
2014-08 31 27 31 31 27 147
2014-09 30 23 30 30 21 134
2014-10 31 29 31 31 31 153
2014-11 30 30 30 30 30 150
2014-12 31 29 31 30 31 152
2015-01 28 12 31 31 31 133
2015-02 27 28 28 28 17 128
2015-03 31 31 31 31 31 155
2015-04 30 30 30 30 30 150
2015-05 29 31 31 27 31 149
2015-06 29 30 30 30 30 149
2015-07 31 31 26 31 31 150
2015-08 31 31 31 27 31 151
2015-09 29 30 30 30 30 149
2015-10 31 31 24 31 1 118
2015-11 30 30 28 30 118
2015-12 30 31 31 31 123
2016-01 39 42 66 42 189
2016-02 29 29 29 29 116
2016-03 31 31 31 31 124
2016-04 41 30 30 30 131
2016-05 59 31 54 45 189
2016-06 30 59 30 39 158
2016-07 31 31 49 111
2016-08 50 37 31 42 160
2016-09 39 30 30 25 124
2016-10 31 31 31 16 109
2016-11 30 30 30 128 218
2016-12 53 71 53 86 263
2017-11 56 56 56 54 222
2017-12 720 720 719 697 2856
2018-01 666 692 690 688 693 3429
2018-02 646 649 649 649 649 3242
2018-03 718 744 744 744 724 3674
2018-04 750 750 750 704 749 3703
2018-05 759 758 759 743 759 3778
2018-06 693 719 720 720 713 3565
2018-07 735 736 742 730 719 3662
2018-08 689 720 356 704 716 3185
2018-09 308 328 38 322 328 1324
Grand Total 8280 8378 7743 6004 8393 917 39715

It appеarce that missing data is because of broken equipment, occasional accidental pollution issues and so on. 

There are three types of averaging time – day, hour and var. The answer from the industry expert about the “var” value is:

“var” refers to “variable sample”, which doesn’t really tell us anything. I suspect it’s an issue with the EU EEA automatic QA procedure – if the country’s reported a timestamp that’s not rounded to the whole hour, may classify it as “variable sample” because the don’t know what to do with it.

We saw that we could use it as hourly data, so we transformed it.

After observing the data, we have desided to work with shorter but cleaner data. In order to have the most resent data for the prediction, we have selected the period from 2018-01-01 to 2018-08-18. There are still missing values and aggregation that should be done, but considering that the non-official data is for 2017-2018, the selected period should work best for now.

In the next table we can see the missing hourly data (in each sell the number should be 24). For the next months the table is similar:

Month Druzhba Hipodruma IAOS/Pavlovo Mladost Nadezhda
2018-01-01 23 23 23 23 23
2018-01-02 23 23 23 23 23
2018-01-03 23 23 23 23 23
2018-01-04 23 23 23 23 23
2018-01-05 20 23 23 23 23
2018-01-06 5 23 23 23 23
2018-01-07 23 23 23 23 23
2018-01-08 24 24 24 24 24
2018-01-09 24 24 24 24 24
2018-01-10 24 24 24 24 24
2018-01-11 23 23 23 23 23
2018-01-12 23 23 23 23 23
2018-01-13 23 23 23 23 23
2018-01-14 22 23 23 21 23
2018-01-15 19 19 18 18 19
2018-01-16 23 23 23 23 23
2018-01-17 23 23 23 23 23
2018-01-18 23 23 23 23 23
2018-01-19 21 22 22 22 23
2018-01-20 21 23 23 23 23
2018-01-21 23 23 23 23 23
2018-01-22 23 23 22 22 23
2018-01-23 18 18 18 18 18
2018-01-24 22 23 23 23 23
2018-01-25 19 19 19 19 19
2018-01-26 23 23 23 23 23
2018-01-27 23 23 23 23 23
2018-01-28 23 23 23 23 23
2018-01-29 20 20 20 20 20
2018-01-30 16 16 16 16 16
2018-01-31 23 23 23 23 23

 

There are many different ways and methods for dealing with missing data, but considering that we work with meteo data, we decided to calculate the missing hour as average from the previous and next hour. If there are a lot of missing hours, than we calulate the day value as average of previous and next day.

First we add the missing hours for every station and then make the calculation.

After that we are finally ready to aggregate the data on daily basis.

With a little bit of SQL code the result:

COMMONNAME AIRQUALITYSTATION LATITUDE LONGITUDE YEAR MONTH DAY DAILY_AVG_PM10
Druzhba STA-BG0052A 42.666508 23.400164 2018 2018-01 2018-01-01 78.29125
Druzhba STA-BG0052A 42.666508 23.400164 2018 2018-01 2018-01-02 60.395625
Druzhba STA-BG0052A 42.666508 23.400164 2018 2018-01 2018-01-03 17.32291667

Second source – Citizen science air quality measurements

It contains data about every non-official station in the country.

 Variable Valid N % Valid obs. Mean Minimum Maximum Lower Upper Quartile Std.Dev. Coef.Var.
P1 2958654 100.0000 28.94 0.0 2000.0 8.00 24.00 16.000 86.22 297.8988
P2 2958654 100.0000 16.09 0.0 1000.0 5.00 15.00 10.000 44.01 273.5143
temperature 2958654 100.0000 14.69 -5573.0 435.0 7.00 23.00 16.000 19.47 132.5535
humidity 2958654 100.0000 59.00 -999.0 898.0 44.00 77.00 33.000 24.68 41.8241
pressure 2958654 100.0000 85267.46 -20148.0 254165.0 93646.00 98304.00 4658.000 31150.87 36.5331

We have murged all stations (see a little bit below) and took the stations in Sofia. Their descriptive statistics are almost the same as above.

In this source we see that there a lot of messy data, for example:

time geohash P1 P2 temperature humidity pressure
2018-02-10T17:00:00Z sx8dev34hpv 0 0 -5573 -999 0
2018-02-13T01:00:00Z sx8dev34hpv 0 0 -5573 -999 0
2018-02-19T12:00:00Z sx8dev34hpv 0 0 -5573 -999 0
2018-02-20T22:00:00Z sx8dev34hpv 0 0 -5573 -999 0
2018-04-25T22:00:00Z sx8dfx1j55w 0 0 -5573 10 0
2018-05-05T12:00:00Z sx8dfqtv6rw 0 0 -5573 -999 0
2018-05-06T18:00:00Z sx8dfx1j55w 0 0 -5573 10 0
2018-04-24T18:00:00Z sx8dfx1j55w 0 0 -4174 10 0
2018-02-02T05:00:00Z sx3xspnmqnf 141 119 257 870 0
2018-02-01T18:00:00Z sx3xspnmqnf 114 97 277 756 0
2018-02-01T17:00:00Z sx3xspnmqnf 126 108 322 719 0

We have to clear this data and again add the missing hours for every station and then make the calculation.

 

Meanwhile, we used php code to translate the geohash to coordinates. And using Google API, we have ploted the stations on the map:

All stations available in the source files:

Тopographical data to show only those in Sofia (additional preparation is needed to look not so perfectly arranged : )  ):

 

We used sql to calculate the distance between official and non-official stations:

LATITUDE LONGITUDE GEOHASH NADEZHDA HIPODRUMA DRUZHBA ORLOVMOST PAVLOVO MLADOST Nearest_station
42.58000039 23.35099988 sx8d5r7wmxr 17.29272939 12.06055417 10.45375829 12.36376703 12.08845278 8.822471801 MLADOST
42.62199976 23.33200045 sx8d6zjg5h8 12.41629331 7.138501612 7.474525802 7.627686229 7.456171279 5.62316649 MLADOST
42.58999966 23.39600064 sx8dk3k2wr6 17.32773998 12.96715914 8.536473108 12.22070509 13.73633083 7.375362447 MLADOST

Now we correct the map and voalá:

 

After we aggregate the stations data, we have to compare all non-offical stations to its corresponding official. Then combine all variables that we have and start the prediction.

Because of the time left, we decided to work with only one region – Hipodruma. We take the data from that official station and its corresponding small stations. Looking at the small stations’ data we see another gap :

time geohash P1 P2 temperature humidity pressure
2018-03-26T16:00:00Z sx8dfhvjfsp 39 21 9 72 94366
2018-03-26T16:00:00Z sx8dfhvjfsp 38 21 9 73 94376
2018-03-26T17:00:00Z sx8dfhvjfsp 42 22 0 0 0
2018-03-26T17:00:00Z sx8dfhvjfsp 40 23 9 72 94407

These 3 stations – sx8dfhvjfsp, sx8ddjthfex, sx8dcctv18x have double amount of entries per day for almost the entire period.

For further analysis we take only these non-official station: sx8ddw08x86, sx8ddsfjw1v,  sx8dcbjy3tr, sx8ddmr30yx, sx8df6dzvpj, sx8df1q937g, sx8dduums3u. They have almost full data entries for the selected period.

We aggregated the data the same way we already did with the official stations.

 

Part two

We wanted to start with classification prediction – will there be polution or not. So we make a flag based on the PM10 value: above 50µg/m3  there is a polution and below 50µg/m3 means there is not.

 

 

Share this

3 thoughts on “Datathon Sofia Air Solution – Team Duckelicious Case Telelink

  1. 1
    votes

    Hi! Good to see your progress, looking forward to seeing the end-result. Some comments in the interim:

    – In the 1st table (official stations #observations), note that gaps in Mladost and Orlov most are due to the fact that the latter station is currently out of operation. Mladost is the station that replaced it. Hence, some of the ‘data gaps’ you see are not gaps per-se.
    – Good to focus on data quality and gaps – make sure to show this in the final write-up!
    – When filling missing values for citizen science stations – there perhaps some stations which lack many measurements. How do you fill missing values there? Or perhaps I am misunderstanding and you are simply filling in the missing timestamps with NaNs?

  2. 1
    votes

    Working with data for 2018 only is a good solution so as to deliver more quickly consistent representation. Also, focusing on one main station is a good choice with respect to the timing of the task. I like the presented maps. I would like to see at least part of your code.

Leave a Reply