Datathons Solutions

Datathon Air Sofia Solution – Telelink Televised by Teleloonies


Techonnology and methods used:

R – plyr, dplyr, tidyverse, stringr, data.table, geohash, ggmap, maps, robustbase, geosphere, pracma, Hmisc, ggplot2, tidyquant, reshape2, pastecs

Python – s3fs, pandas, numpy, matplotlib, plotly, geohash2, folium, geopy

OLS Regression, Ridge Regression, Decision Trees


Air pollution beyond the norms is a common problem in many locations. Examining the causes behind and being able to predict it would help control and reduce pollution, facilitate solving environmental issues and providing a better life to citizens.

Business Understanding

Air pollution seems to be recurring problem in our city – Sofia. However, modern technology can help us predict and control the problem. Air pollution is measured via particulate matter (PM). This term represents fine particles with diameter less than 10µm. This matter can be affected by climate, topography and anthropogenic factors as well. Human behaivor can be moderated and controled, but in order to do so natural premises should be examined first.

Based on data from official and citizens measurment stations, our aim is to find predictive signs for the pollution issue.

Data Understanding

The data we are provided with consists of:

  •  Data on temperature, humidity, pressure and PM(P1) pollution from September 6th 2017 to August 16th 2018 for different citizen scientific stations. The location of the stations is marked with geohash. We happen to notice that those stations where not only in Sofia city, but are located on different places in Bulgaria. Observations are provided on hourly basis. This set consist of around 3.600 million rows.
  •  Data from official measurement stations from January 1st 2013 to September 14th 2018. Stations are 5, however one of the stations (Orlov most) stops functioning in 2015. Data contains two timestamps – for begining of the observation and for the end. Some measurments are performed for a day, other for an hous. There are also other time periods for wihch the time for conducting the observation differs from the other two generic types. The dimensions of this dataset were: 39,715 rows and 17 columns.
  •  Official meteorological data from 2012 to 2018 for Sofia city
  •  Topological data – containg GPS coordinates for different points in Sofia.

Data Preparation

After merging the two main datasets – official and citizen, the next necessary step was too unify the main dimensions: location and time. The citizens dataset was defined with geohashs. That is why we had to turn them into GPS coordinates.

Geohash transformation

This was the moment when we realized that the citizen stations are located not only in Sofia but on multiple places in Bulgaria. Some more precise selection was needed on this stage. Using the Earth’s radius (we had to go big at some point 🙂 ), we measured the distance from the center of Sofia city toward eash point in our dataframe. That is how we shrinked the dataset – cutting all of the location that were further than 20km from city center (calculated based on Sofia’s city area – 492km2.) Only 749 stations were left.


Another step is to transform the data towards unified time stamp, namely – day. Since most of the data from the official stations (excluding 2018) is daily, we average in order to obtain observation for a day.  We also average the data from citizen science stations on a daily basis. When there is more than one observation for a particular station, we choose to take the average of all observations (as a fastest and relativly sufficent method). Thus we obtain a dataset with a single observation for each station and each date, both for the citizen scientific stations and the official air quality data. 


After performing all of this we needed to somehow form groups around the official stations, so we can compare them with the citizen data. What we did is to define the distance between each official station and each citizen station. Again we used our “radical” approach from above.

Then we checked what the data looks like on a map:


However, the official data that we use is cituated in Druzhba, Nadezhda, Hipodruma, Pavlovo and Mladost, which makes some of the points of the map uncomparable with the official statements. This is where we decided to define the nearest points to the above mentioned stations. For the purpose of creating our future model, we decided to pick those stations that lie nearest to the officials. At first, we were thinking about assaingning 3km as ultimate distance (well, Nadezhda is big district after all). We than decided to be more reasonable and we stayed in the range of 500meters from the official sources. After performing some test, we decided to assign all of the stations towards the nearest official station. This was done with the following transformations in Python:

Define Nearest stations

Afterwards, we pooled the date from those “nearest” citizen stations and officials stations. Those are the observations that we will use latter in our model.

But first, we checked what our data looks like plotted.

The climate factors measured from the citizen stations and the particulate matter measured from both official and unofficial where combined in one dataset. Based on the correlations and histograms that we generated, some statements can be defined.

  • Definetly colder weather means more pollution. (Based on tempereture/PM correlation)
  • The data from citizens stations is relativly close to the data from the official stations.
  • Temperature and pressure are correlated also (Obviously some physical laws enter the playground)


1.Removing Bias: Distance from official air quality stations to citizen science stations

As we know, there is some bias in the data from citizen science stations. In order to benchmark the results for the citizen science stations and examine the possible bias, we already calculated the distance between each citizen station and each of the official measurement points. Relying on some academic sources, we first compared the trends of the two datasets. We plotted the PM concentration through time from both sources for Nadezdha. 


As seen on the graphic, the time variation of the observations has close trend. However, some serious differences occur. That is why we plotted the differences between air pollution measurments from both types of sources.

Most of the citizens measurements definitely appear to be reasonable – 50% have difference between 150 points. But we needed to remove the bias from the remaining 50%.

In order to do that, we regress the temperature, humidity, pressure and P1 from citizen stations data on the concentration from the official air quality data. We do that to control for the bias in citizen stations. The fitted values from the regression are the P1 values for each citizen station. The corrected P1 values by the OLS regression have smaller values for the outliers but at the same time their mean is greater than the mean of the official concentration data. The 25th, 50th and 70th percentile of the corrected P1 values are higher than the concentration values and the P1 values. The results we obtain after the ridge regression are almost identical. We test to see if after scaling the data the corrected P1 would be different but we conclude it is the same and there is no change. The mean for all of the corrected P1 versions is way higher than the concentration mean. Running a regression tree generates a  corrected P1 with a mean which is closest to the concentration mean but the IQR here is the greatest comparing to other methods we used for correction.

Descriptive statistics of the fitted P1 are available below.

An idea for future research is to check if averaging the data based on medians and not means would improve the corrected P1 results. Other improvement of the results is possible if we include lags of the variables in the regression and if we remove outliers (or substitute them with appropriate values).

2. ARIMA – can we forecast pollution

We decided that after the data cleaning and wrangling, it is smart to keep it simple when we approach to forecasting. Used ARIMA to predict the next 24h for the citizen stations air pollution data. Firstly we applyed auto ARIMA to ckeck how the timeseries behave. Our first conclusion was that the residuals are large and most of them appear to be random. Iif can be explained with sesonality, human influences and others factors of the model.


3.Map of the air pollution in our city

Based on the bias corrected variables we created map that visualizes the air poluttion in Sofia.


Our ARIMA models are in the oven…The best is yet to come 🙂

(“Ordinary things done consistently produce extraordinary results”  Keith Cunningham)


Share this

4 thoughts on “Datathon Air Sofia Solution – Telelink Televised by Teleloonies

  1. 1

    I am thoroughly amused by your “radial” approach – occasionally, we don’t need fancy packages & many dependencies, just regular old math does the trick. Go big or go home :)!

    Good exploratory analysis so far. If i understand correctly, this is performed on the entire pooled dataset?

    Would be good to see some quantified indicators of agreement between official/citizen science stations & for meteo influence on any differences. Also, please do make sure to note how you define “closest to official” – not at trivial issue 🙂

  2. 0

    I like that you’ve included a section on utilized Technology and Methods 🙂

    The aim in the Business Understanding section is stated clearly.

    The Data Understanding section outlines well the key characteristics of the available datasets. Probably merging the official and the citizen datasets at the very beginning of the research consumed too much of your time for data prep. Application of several filtering rules prior to merging datasets might have helped.

    Modeling: the presented graph and the bullets of findings below it are a promising beginning. Looking forward to reading your full paper!

Leave a Reply