Business Understanding
In Sofia, air pollution norms were exceeded 70 times in the heating period from October 2017 to March 2018, citizens’ initiative AirBG.info says. The day with the worst air pollution in Sofia was January 27, when the norm was exceeded six times over. Things got so out of control that even the European Court of Justice ruled against Bulgaria in a case brought by the European Commission against the country over its failure to implement measures to reduce air pollution. The two main reasons for the air pollution are believed to be solid fuel heating and motor vehicle traffic.
Particulate matter (PM) is the term used for a mixture of solid particles and liquid droplets found in the air. In particular, fine particles with diameter less than 10µm are called PM10. Prediction of PM10 is an important issue in control and reduction of pollutants in the air.
Predictive models for PM10 vary from extremely simple to extremely complex, but the ability to accurately forecast PM10 concentration index remains elusive. Prediction of particulate matter with diameter less than 10µm (PM10) is an important issue in control and reduction of pollutants in the air.
Data Understanding
The area that the research will be focused on and the data is supporting is Bulgaria, where public factors are available to you that consists of : meteorological data: weather, humidity, wind; traffic data; PM10 data, etc. Add more data that you might consider as significant for the more accurate predictions of PM10 concentration levels and forecast of the PM10 pollution.
Available data:
- Official air quality measurements (5 stations in the city) – as per EU guidelines on air quality monitoring
- Citizen science air quality measurements, incl. temperature, humidity and pressure (many stations) and topography (gridded data)
- Meteorological measurements (1 station): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility
- Topography data
Additional data could be used:
- All sensor values in the last 5 minutes from the registered stations as JSON files, which are updated every minute
- Values from the last 5 minutes of a particular sensor
Data Preparation
Our first goal is to calculate the distance between the official stations and non-official. Further in the analysis we could compare and try to correct the values of the non-official with its nearest official station values. So, we have started with the first two data sets – Official air quality measurements and Citizen science air quality measurements.
Official air quality measurements
The first thing we can see is that there are a lot of missing data:
Official stations | |||||||
Month | Druzhba | Hipodruma | IAOS/Pavlovo | Mladost | Nadezhda | Orlov Most | Grand Total |
2013-01 | 31 | 31 | 25 | 31 | 31 | 149 | |
2013-02 | 28 | 28 | 28 | 28 | 28 | 140 | |
2013-03 | 31 | 31 | 31 | 31 | 31 | 155 | |
2013-04 | 30 | 30 | 30 | 30 | 30 | 150 | |
2013-05 | 31 | 30 | 15 | 31 | 31 | 138 | |
2013-06 | 26 | 30 | 30 | 30 | 24 | 140 | |
2013-07 | 29 | 31 | 31 | 31 | 26 | 148 | |
2013-08 | 30 | 31 | 31 | 31 | 123 | ||
2013-09 | 30 | 30 | 30 | 30 | 20 | 140 | |
2013-10 | 31 | 31 | 31 | 29 | 31 | 153 | |
2013-11 | 30 | 30 | 30 | 30 | 30 | 150 | |
2013-12 | 30 | 31 | 31 | 31 | 31 | 154 | |
2014-01 | 31 | 31 | 31 | 31 | 30 | 154 | |
2014-02 | 28 | 28 | 28 | 26 | 24 | 134 | |
2014-03 | 31 | 31 | 31 | 31 | 27 | 151 | |
2014-04 | 30 | 30 | 30 | 30 | 29 | 149 | |
2014-05 | 31 | 29 | 31 | 31 | 31 | 153 | |
2014-06 | 30 | 30 | 29 | 27 | 29 | 145 | |
2014-07 | 30 | 27 | 29 | 31 | 31 | 148 | |
2014-08 | 31 | 27 | 31 | 31 | 27 | 147 | |
2014-09 | 30 | 23 | 30 | 30 | 21 | 134 | |
2014-10 | 31 | 29 | 31 | 31 | 31 | 153 | |
2014-11 | 30 | 30 | 30 | 30 | 30 | 150 | |
2014-12 | 31 | 29 | 31 | 30 | 31 | 152 | |
2015-01 | 28 | 12 | 31 | 31 | 31 | 133 | |
2015-02 | 27 | 28 | 28 | 28 | 17 | 128 | |
2015-03 | 31 | 31 | 31 | 31 | 31 | 155 | |
2015-04 | 30 | 30 | 30 | 30 | 30 | 150 | |
2015-05 | 29 | 31 | 31 | 27 | 31 | 149 | |
2015-06 | 29 | 30 | 30 | 30 | 30 | 149 | |
2015-07 | 31 | 31 | 26 | 31 | 31 | 150 | |
2015-08 | 31 | 31 | 31 | 27 | 31 | 151 | |
2015-09 | 29 | 30 | 30 | 30 | 30 | 149 | |
2015-10 | 31 | 31 | 24 | 31 | 1 | 118 | |
2015-11 | 30 | 30 | 28 | 30 | 118 | ||
2015-12 | 30 | 31 | 31 | 31 | 123 | ||
2016-01 | 39 | 42 | 66 | 42 | 189 | ||
2016-02 | 29 | 29 | 29 | 29 | 116 | ||
2016-03 | 31 | 31 | 31 | 31 | 124 | ||
2016-04 | 41 | 30 | 30 | 30 | 131 | ||
2016-05 | 59 | 31 | 54 | 45 | 189 | ||
2016-06 | 30 | 59 | 30 | 39 | 158 | ||
2016-07 | 31 | 31 | 49 | 111 | |||
2016-08 | 50 | 37 | 31 | 42 | 160 | ||
2016-09 | 39 | 30 | 30 | 25 | 124 | ||
2016-10 | 31 | 31 | 31 | 16 | 109 | ||
2016-11 | 30 | 30 | 30 | 128 | 218 | ||
2016-12 | 53 | 71 | 53 | 86 | 263 | ||
2017-11 | 56 | 56 | 56 | 54 | 222 | ||
2017-12 | 720 | 720 | 719 | 697 | 2856 | ||
2018-01 | 666 | 692 | 690 | 688 | 693 | 3429 | |
2018-02 | 646 | 649 | 649 | 649 | 649 | 3242 | |
2018-03 | 718 | 744 | 744 | 744 | 724 | 3674 | |
2018-04 | 750 | 750 | 750 | 704 | 749 | 3703 | |
2018-05 | 759 | 758 | 759 | 743 | 759 | 3778 | |
2018-06 | 693 | 719 | 720 | 720 | 713 | 3565 | |
2018-07 | 735 | 736 | 742 | 730 | 719 | 3662 | |
2018-08 | 689 | 720 | 356 | 704 | 716 | 3185 | |
2018-09 | 308 | 328 | 38 | 322 | 328 | 1324 | |
Grand Total | 8280 | 8378 | 7743 | 6004 | 8393 | 917 | 39715 |
It appеarce that missing data is because of broken equipment, occasional accidental pollution issues and so on.
There are three types of averaging time – day, hour and var. The answer from the industry expert about the “var” value is:
“var” refers to “variable sample”, which doesn’t really tell us anything. I suspect it’s an issue with the EU EEA automatic QA procedure – if the country’s reported a timestamp that’s not rounded to the whole hour, may classify it as “variable sample” because the don’t know what to do with it.
We saw that we could use it as hourly data, so we transformed it.
After observing the data, we have desided to work with shorter but cleaner data. In order to have the most resent data for the prediction, we have selected the period from 2018-01-01 to 2018-08-18. There are still missing values and aggregation that should be done, but considering that the non-official data is for 2017-2018, the selected period should work best for now.
In the next table we can see the missing hourly data (in each sell the number should be 24). For the next months the table is similar:
Month | Druzhba | Hipodruma | IAOS/Pavlovo | Mladost | Nadezhda |
2018-01-01 | 23 | 23 | 23 | 23 | 23 |
2018-01-02 | 23 | 23 | 23 | 23 | 23 |
2018-01-03 | 23 | 23 | 23 | 23 | 23 |
2018-01-04 | 23 | 23 | 23 | 23 | 23 |
2018-01-05 | 20 | 23 | 23 | 23 | 23 |
2018-01-06 | 5 | 23 | 23 | 23 | 23 |
2018-01-07 | 23 | 23 | 23 | 23 | 23 |
2018-01-08 | 24 | 24 | 24 | 24 | 24 |
2018-01-09 | 24 | 24 | 24 | 24 | 24 |
2018-01-10 | 24 | 24 | 24 | 24 | 24 |
2018-01-11 | 23 | 23 | 23 | 23 | 23 |
2018-01-12 | 23 | 23 | 23 | 23 | 23 |
2018-01-13 | 23 | 23 | 23 | 23 | 23 |
2018-01-14 | 22 | 23 | 23 | 21 | 23 |
2018-01-15 | 19 | 19 | 18 | 18 | 19 |
2018-01-16 | 23 | 23 | 23 | 23 | 23 |
2018-01-17 | 23 | 23 | 23 | 23 | 23 |
2018-01-18 | 23 | 23 | 23 | 23 | 23 |
2018-01-19 | 21 | 22 | 22 | 22 | 23 |
2018-01-20 | 21 | 23 | 23 | 23 | 23 |
2018-01-21 | 23 | 23 | 23 | 23 | 23 |
2018-01-22 | 23 | 23 | 22 | 22 | 23 |
2018-01-23 | 18 | 18 | 18 | 18 | 18 |
2018-01-24 | 22 | 23 | 23 | 23 | 23 |
2018-01-25 | 19 | 19 | 19 | 19 | 19 |
2018-01-26 | 23 | 23 | 23 | 23 | 23 |
2018-01-27 | 23 | 23 | 23 | 23 | 23 |
2018-01-28 | 23 | 23 | 23 | 23 | 23 |
2018-01-29 | 20 | 20 | 20 | 20 | 20 |
2018-01-30 | 16 | 16 | 16 | 16 | 16 |
2018-01-31 | 23 | 23 | 23 | 23 | 23 |
There are many different ways and methods for dealing with missing data, but considering that we work with meteo data, we decided to calculate the missing hour as average from the previous and next hour. If there are a lot of missing hours, than we calulate the day value as average of previous and next day.
First we add the missing hours for every station and then make the calculation.
After that we are finally ready to aggregate the data on daily basis.
With a little bit of SQL code the result:
COMMONNAME | AIRQUALITYSTATION | LATITUDE | LONGITUDE | YEAR | MONTH | DAY | DAILY_AVG_PM10 |
Druzhba | STA-BG0052A | 42.666508 | 23.400164 | 2018 | 2018-01 | 2018-01-01 | 78.29125 |
Druzhba | STA-BG0052A | 42.666508 | 23.400164 | 2018 | 2018-01 | 2018-01-02 | 60.395625 |
Druzhba | STA-BG0052A | 42.666508 | 23.400164 | 2018 | 2018-01 | 2018-01-03 | 17.32291667 |
Second source – Citizen science air quality measurements
It contains data about every non-official station in the country.
Variable | Valid N | % Valid obs. | Mean | Minimum | Maximum | Lower | Upper | Quartile | Std.Dev. | Coef.Var. |
P1 | 2958654 | 100.0000 | 28.94 | 0.0 | 2000.0 | 8.00 | 24.00 | 16.000 | 86.22 | 297.8988 |
P2 | 2958654 | 100.0000 | 16.09 | 0.0 | 1000.0 | 5.00 | 15.00 | 10.000 | 44.01 | 273.5143 |
temperature | 2958654 | 100.0000 | 14.69 | -5573.0 | 435.0 | 7.00 | 23.00 | 16.000 | 19.47 | 132.5535 |
humidity | 2958654 | 100.0000 | 59.00 | -999.0 | 898.0 | 44.00 | 77.00 | 33.000 | 24.68 | 41.8241 |
pressure | 2958654 | 100.0000 | 85267.46 | -20148.0 | 254165.0 | 93646.00 | 98304.00 | 4658.000 | 31150.87 | 36.5331 |
We have murged all stations (see a little bit below) and took the stations in Sofia. Their descriptive statistics are almost the same as above.
In this source we see that there a lot of messy data, for example:
time | geohash | P1 | P2 | temperature | humidity | pressure |
2018-02-10T17:00:00Z | sx8dev34hpv | 0 | 0 | -5573 | -999 | 0 |
2018-02-13T01:00:00Z | sx8dev34hpv | 0 | 0 | -5573 | -999 | 0 |
2018-02-19T12:00:00Z | sx8dev34hpv | 0 | 0 | -5573 | -999 | 0 |
2018-02-20T22:00:00Z | sx8dev34hpv | 0 | 0 | -5573 | -999 | 0 |
2018-04-25T22:00:00Z | sx8dfx1j55w | 0 | 0 | -5573 | 10 | 0 |
2018-05-05T12:00:00Z | sx8dfqtv6rw | 0 | 0 | -5573 | -999 | 0 |
2018-05-06T18:00:00Z | sx8dfx1j55w | 0 | 0 | -5573 | 10 | 0 |
2018-04-24T18:00:00Z | sx8dfx1j55w | 0 | 0 | -4174 | 10 | 0 |
2018-02-02T05:00:00Z | sx3xspnmqnf | 141 | 119 | 257 | 870 | 0 |
2018-02-01T18:00:00Z | sx3xspnmqnf | 114 | 97 | 277 | 756 | 0 |
2018-02-01T17:00:00Z | sx3xspnmqnf | 126 | 108 | 322 | 719 | 0 |
We have to clear this data and again add the missing hours for every station and then make the calculation.
Meanwhile, we used php code to translate the geohash to coordinates. And using Google API, we have ploted the stations on the map:
All stations available in the source files:
Тopographical data to show only those in Sofia (additional preparation is needed to look not so perfectly arranged : ) ):
We used sql to calculate the distance between official and non-official stations:
LATITUDE | LONGITUDE | GEOHASH | NADEZHDA | HIPODRUMA | DRUZHBA | ORLOVMOST | PAVLOVO | MLADOST | Nearest_station |
42.58000039 | 23.35099988 | sx8d5r7wmxr | 17.29272939 | 12.06055417 | 10.45375829 | 12.36376703 | 12.08845278 | 8.822471801 | MLADOST |
42.62199976 | 23.33200045 | sx8d6zjg5h8 | 12.41629331 | 7.138501612 | 7.474525802 | 7.627686229 | 7.456171279 | 5.62316649 | MLADOST |
42.58999966 | 23.39600064 | sx8dk3k2wr6 | 17.32773998 | 12.96715914 | 8.536473108 | 12.22070509 | 13.73633083 | 7.375362447 | MLADOST |
Now we correct the map and voalá:
After we aggregate the stations data, we have to compare all non-offical stations to its corresponding official. Then combine all variables that we have and start the prediction.
Because of the time left, we decided to work with only one region – Hipodruma. We take the data from that official station and its corresponding small stations. Looking at the small stations’ data we see another gap :
time | geohash | P1 | P2 | temperature | humidity | pressure |
2018-03-26T16:00:00Z | sx8dfhvjfsp | 39 | 21 | 9 | 72 | 94366 |
2018-03-26T16:00:00Z | sx8dfhvjfsp | 38 | 21 | 9 | 73 | 94376 |
2018-03-26T17:00:00Z | sx8dfhvjfsp | 42 | 22 | 0 | 0 | 0 |
2018-03-26T17:00:00Z | sx8dfhvjfsp | 40 | 23 | 9 | 72 | 94407 |
These 3 stations – sx8dfhvjfsp, sx8ddjthfex, sx8dcctv18x have double amount of entries per day for almost the entire period.
For further analysis we take only these non-official station: sx8ddw08x86, sx8ddsfjw1v, sx8dcbjy3tr, sx8ddmr30yx, sx8df6dzvpj, sx8df1q937g, sx8dduums3u. They have almost full data entries for the selected period.
We aggregated the data the same way we already did with the official stations.
Part two
We wanted to start with classification prediction – will there be polution or not. So we make a flag based on the PM10 value: above 50µg/m3 there is a polution and below 50µg/m3 means there is not.
3 thoughts on “Datathon Sofia Air Solution – Team Duckelicious Case Telelink”
Hi! Good to see your progress, looking forward to seeing the end-result. Some comments in the interim:
– In the 1st table (official stations #observations), note that gaps in Mladost and Orlov most are due to the fact that the latter station is currently out of operation. Mladost is the station that replaced it. Hence, some of the ‘data gaps’ you see are not gaps per-se.
– Good to focus on data quality and gaps – make sure to show this in the final write-up!
– When filling missing values for citizen science stations – there perhaps some stations which lack many measurements. How do you fill missing values there? Or perhaps I am misunderstanding and you are simply filling in the missing timestamps with NaNs?
Working with data for 2018 only is a good solution so as to deliver more quickly consistent representation. Also, focusing on one main station is a good choice with respect to the timing of the task. I like the presented maps. I would like to see at least part of your code.
Prototyping for one region is a good plan of action. you should have thought more about scaling the solution.