Business Understanding for Sofia AirPollution Prediction
Background and MotivationÂ
In Sofia, Bulgaria, air pollution norms have exceeded exponentially in period from October 2017 to March 2018. Things got so out of control that even the European Court of Justice ruled against Bulgaria in a case brought by the European Commission against the country over its failure to implement measures to reduce air pollution.Â
The two main reasons for the air pollution are believed to be:
- solid fuel heating and,
- motor vehicle traffic
Sofia has 5 metropolitan weather stations that captures weather data on an hourly intervals. In addition, a new initiative was launched to help gather lot more data from more locations across Bulgaria.Â
The AirBG.info project was launched on April 5, 2017, founded by Nikolay Luchev and Stefan Dimitrov and is part of the global project Luftdaten.info, which started in Stuttgart in 2015. This is an independent civil project aimed at bringing more light to these :
- excessive contamination with PM;
- gasification of settlements;
- missing or insufficient measurement of air quality by the authorities;
- insufficient measures taken by the legislative, municipal and executive authorities.
The focus of this study and project is specifically aimed at study of air pollution from particulate matter. Particulate matter (PM), is the term used for a mixture of solid particles and liquid droplets found in the air. Fine particles with diameter between 2.5µm and 10µm are called PM10, while particles with diameter less than 2.5 are referred to as PM2.5.Â
Prediction of PM10 in the atmosphere over Sofia, is an important issue in control and reduction of pollutants in the air. European Commission has described a threshold of 50 micrograms/cubic meter as the critical threshold above which air above Sofia is considered highly polluted.Â
Business ObjectiveÂ
The goal of this project is to forecast concentration for the PM10 pollutants in next 24 hour period given the data from the Citizen Data AirGB.info project and the official weather data from the official sites with a high degree of accuracy (90%+).Â
Business Success Criteria:Â
The desired result is a map that can cover any location of the city and be able to forecast the PM10 concentration in the next 24 hours with a high degree of accuracy(90%+). An example of map from AirBG.info is shown below.
Assumptions:
While not specifically called out, the following assumptions are prudent to make:Â
- Main sources of pollution are assumed to be:Â
- solid fuel heating and,
- motor vehicle traffic
- That measuring PM10 is sufficient to capture the pollution affects of both solid fuel heating and motor vehicle traffic.Â
- There may be gaps in the Official Meteorological data making it insufficient to inform the bigger picture of air pollution across Bulgaria all by itself
- The accuracy of the metrics received from the Citizen Weather Stations does not conform to the strict standards of the equipment used in official stations
- More data is better than higher quality dataÂ
- The cost to quality trade off is acceptable to inform the larger picture
Constraints:
This project is being run under constraints of a data science challenge which is time limited and requires other commitments governed by its learning goals.Â
This project is being run across multiple teams with one or more people per team who may interpret the problem differently and may make different assumptions that aren’t directly validated with the business users.Â
The equipment available to work on the larger data sets may not be sufficient to process and run algorithms in a timely fashion. While sensor data is available at large volumes, the ability to read, store, process and use that data maybe limited by the constraints imposed by the challenge.Â
Defining The Data Mining Problem
The project provides two distinct data sets that could represent multiple versions of the truth.Â
- Official Meteorological Data Â
The official data is used for law suits, policy creation etc. With the far reaching implications, the official data is gathered only from 5 stations, named after neighborhoods and provides meteorological measurements such as temperature; humidity; pressure etc. This data has longer history, but it’s not spread out across the country. AirBG.info brings to question the quality of this data by suggesting this may have missing data and insufficient measures on the part of the authorities to provide a full representation of Sofia’s air pollution problem.Â
- Citizen Meteorological Data
The Citizen data is gathered from the AirBG.info initiative that is not a government funded and is run by volunteers and citizens of Bulgaria. Each citizen that wishes to participate builds a weather monitoring kit from standardized parts. These citizen weather stations upload data every 5 minutes via an onboard WIFI connectivity and is voluminous in nature. This data has shorter history but is spread across a lot more than 5 stations.
In addition it provides data topography data includes Sofia urban area + some areas nominally external to the city (toward the mountains, note large elevation numbers). No particular effort has been made to include entirety of Sofia Capital’s area as per administrative boundaries. This topographical data includes lat/long and elevations for several areas in and around Sofia.Â
Last but not least, the project allows access to API’s that would allow it to gather, inspect and mine data from Citizen Weather station sensors.Â
Data Understanding
This section focuses on understanding the data, uncover data quality issues and discover first insights and detect interesting subsets of data.
Data Sources
Data was originally distributed from the following data source (which is used as the single version of truth for the raw data set)Â
https://storage.cloud.google.com/global-datathon-2018/sofia-air/air-sofia.zip
Additional data available for consideration:
- All sensor values in the last 5 minutes from the registered stations as JSON files, which are updated every minute ->Â http://api.luftdaten.info/static/v1/data.json
- Values from the last 5 minutes of a particular sensor -> http://api.luftdaten.info/v1/sensor/sensorid/
The zip files consists of several different sources of data gathered in different ways.
Figure: The figure above shows the directory structure of the root folder after unzipping the source data
The Air Tube Data
This data is sourced from a citizen initiative. Sofia citizen use non-portable equipments to report the following information in the columns:
- time
- geohash
- P1 – PM10 concentration
- P2 – PM 2.5 concentration
- humidity
- pressure
- temperature
The citizen data is available in 2 files one for each year – 2017 and 2018. Presumably this initiative is new and data from the initiative would be reflective of the adoption of the initiative among the citizens of Sofia and its ramp, dispersion etc.
File Name:Â ./Air Tube/data_bg_2018.csv ==> Citizen Data gathered in 2018
File Name:Â ./Air Tube/data_bg_2017.csv. ==> Citizen data gathered in 2017
File Name: Â ./Air Tube/sample_data_bg_2018.csv ==> sample data file
Across each of these files there are 1254 unique geohashe’s, each representing a location for the citizen data. Since these unit’s are not portable, we can assume then that this represents the sample size of the number of Sofia citizens participating in the initiative. While this is significantly better than the 5 meteorological measurements sites officially maintained, it is to be seen if these citizen generated data points are reliable .
Notable Point on Reliability.Â
How much variance do they have form the norm? If high variance, does it show bad instrument design, placement decision of the instruments or perhaps it is a valid measurement that shows actual and correct measurement for the PM10 pollutant. What if the citizen measurement device was placed on a fire place near a high humidity environment where wood is being burnt? Just a simple choice of placement for such devices could deliver alarming data into the data set that perhaps is not completely correct.
Data CharacteristicsÂ
After reindexing the data on time, the following shows the size and the number of records in the data set in each of the file:
File Name:Â ./Air Tube/data_bg_2018.csv
Pre Geo Hash To Lat/Long Conversion | Post Geohas to Lat/Long Conversion |
Pre Data wrangle raw data analysis Data Shape: (2958654, 7) Show column counts with null/nans time      0 geohash    4 P1       0 P2       0 temperature  0 humidity    0 pressure    0 dtype: int64
Total geo hashses 2958654 Unique geo hashses 1254 Total time stamps 2958654 Unique time stamps 5461 data_wrangle_citizen_data: shape: (2958654, 7)
|
Post Wrangling Data Shape: (2958654, 9) Show column counts with null/nans time      0 geohash    4 P1       0 P2       0 temperature  0 humidity    0 pressure    0 Latitude    0 Longitude   0 dtype: int64 Total geo hashses 2958654 (consistent with raw) Unique geo hashses 1254 (consistent with raw) Total time stamps 2958654 (consistent with raw) Unique time stamps 5461 (consistent with raw) |
File Name:Â ./Air Tube/data_bg_2017.csv
File Name:Â ./Air Tube/data_bg_2017.csv
Pre Geo Hash To Lat/Long Conversion | Post Geohas to Lat/Long Conversion |
Pre Data wrangle Data Shape: (651492, 7) Show column counts with null/nans time      0 geohash    0 P1       0 P2       0 temperature  0 humidity    0 pressure    0 dtype: int64
Total geo hashses 651492 Unique geo hashses 383 Total time stamps 651492 Unique time stamps 2774 data_wrangle_citizen_data: shape: (651492, 7) |
Post Wrangling Data Shape: (651492, 9) Show column counts with null/nans time      0 geohash    0 P1       0 P2       0 temperature  0 humidity    0 pressure    0 Latitude    0 Longitude   0 dtype: int64 Total geo hashses 651492 Unique geo hashses 383 Total time stamps 651492 Unique time stamps 2774 |
Insights: Growth of the Citizen Program & Reliability
227% growth in Citizen Stations:
From the data, it seem really positive that the citizen program has grown from 383 unique points reported in 2017 to 1254 stations in 2018 a 227% growth. That is phenomenal adoption by citizens of Sofia. However, it’s yet not clear if the data from the program is reliable.
354% growth in data signals
Total data signals recorded across all geohashcodes and across all time stamps grew from 651,492 to 2,958,654 an explosive 354% growth. Some of variance in the number of data signals may correspond to when the program was started.
Early adoption starting in Sept 2017
The first data point recorded in 2017 is : Â 2017-09-06T20:00:00 Z
The start day of the first recorded data signal is September 6th. We can infer that the program was started late in the year and it was in the infancy with some early adoption of 383 remote citizen stations. Based personal experience, a program with such magnitude may have challenges around onboarding, training, technical glitches before it can be in steady state.
49% reduction in reported time stamps for data signal
Unique time stamps reported changed from 5441 to 2774
While hard to say without looking more at the data, this reduction in data signals may look bad because we had more months during which the program was running in 2018 than in 2017.
A longer project perhaps would analyze the “reliability” of data signals from the originating geohash by examining features like consistency of reporting, data gaps, variance from norm to infer a set of geohashcodes that maintain a “high quality” of data gathering and perhaps be rewarded by the government. While geohash codes with significant data gaps or intermittent and consistent patterns of data signals may be put in a tier 2 bucket on reliability. Perhaps the every 5 minute data is more appropriate for this type of analysis.
Data Quality
In this section, I examine the quality of the data provided by exploring for bad data, gaps in data and informing next steps.
On using panda’s to explore this data closer, there were a few gaps.
Unparseable/bad geohash
File : AirTube\data_bg_2017 – Decoding Geohash’s using Pygeohash library
index: 133972: Encoded GeoHash: sxevuw2d6z6Â Decoded Tuple: (43.233, 27.972)
index: 133973: Encoded GeoHash: sx8ddncmwhr Decoded Tuple: (42.665, 23.293)
index: 133974: Encoded GeoHash: sx8df3rr0yp Decoded Tuple: (42.679, 23.312)
index: 133975: Encoded GeoHash: sx8dcsn8b8n Decoded Tuple: (42.693, 23.278)
The next index throws a KeyError with “-” as the key. Likely unavailable data or data errorÂ
Since Excel and Apple Numbers only loads 65K rows of data. Finding the exact data point required iteration.
Iterating in pandas found the key as below:Â
index: 133974: Encoded GeoHash: sx8df3rr0ypÂ
index: 133975: Encoded GeoHash: sx8dcsn8b8nÂ
index: 133976: Encoded GeoHash: m-2105171Â
index: 133977: Encoded GeoHash: sx8dsqct9h7Â
On inputting this Geohash in a converted, it was obvious this was not good data point.