Prediction systems

Datathon Sofia Air Solution – Telelink Case Solution


Telelink Case Solution
Team Dimas

The Team

– apetkov
– desinik
– rdimitrov
– melania-berbatova
– vrategov

Github Repo:

The main workflow happens over at our github page. You can read the latest version of this article here:

## Content

0. Data

We were given the following 4 datasets:

Air – data from personal sensors with PM10 and PM2.5 and weather measurements (whole Bulgaria)
EEA  – data from 6 official stations in Sofia with PM10 measures – historical meteorological data from Sofia – topological data for points around Sofia

 1. Business Understanding

Air pollution is one of the most serious problems in the world. It refers to the contamination of the atmosphere by harmful chemicals or biological materials. To solve the problem of air pollution, it’s necessary to understand the issues and look for ways to counter it.

Air pollution can cause health effects in both short-term and long-term health effects.

Air pollution causes damage to crops, animals, forests, and bodies of water. It also contributes to the depletion of the ozone layer, which protects the Earth from the sun’s UV rays.
In Bulgaria, EEA collects air pollution statistics. It’s important to study these statistics because they show how polluted the air has become in various places around the country.

In Sofia, air pollution norms were exceeded a lot of times in the past few years. The day with the worst air pollution in Sofia the norm was exceeded six times over. The two main reasons for the air pollution are believed to be solid fuel heating and motor vehicle traffic.

The main objectives are to predict the air pollution in Sofia for the next 24 hour period and to predict the chance of exceedance of 50 µg/m3 average for the day (EU air quality limit).

2. Data Understanding

Official stations – EEA_DATA

We were provided with PM10 daily measurements from 5 stations in Sofia for the time years 2013-2018 from 5 stations in Sofia. Unfortunately, the data was quite inconsistent. Some of the inconsistencies were:
– for the station with id #9484 we have measurements for only the first 3 years and for station #60881 only for 2018. We decided to exclude these stations, so not to add additional bias to the dataset.
– for 2013, 2014 and 2015 the granularity of observations was per day, whereas for the new years it was per hour. There were between 2 and 48 observations per day. We aggregated the hourly observations gathering the minimum, maximum, mean and  variance of the observations per day per one station
– we had on average around 15 missing days for years 2013 – 2016. We filled the missing values using linear interpolation.
– for stations #9421, #9572, #9616, #9642 observations for year 2017 started on November 27, totally missing the values from the previous days. For year 2018, observed day are up to September 14.

Daily mean values of PM vs the number of observations per day for station 9421 can be explored in the following graphics:

Personal sensors – Air Tube

Files: several csv-s with identical structure.

Granularity: hourly, for each measurement station

  • date
  • geohash
  • p1 – whole number
  • p2 – whole number
  • humidity
  • pressure

The data from Air tube datasets is hourly from 6 September 2017 20:00 UTC to 16 August 2018 12:00 UTC. We suspect there are a lot of either measurment errors or special values used when measurement is missing.

Visualization on PM10 on the map of Sofia

The following visualization is showing the stations in a radius around the center of Sofia. This visualization is for PM10 and the colors are the following:
– no color: there are any measured values
– green: there is data with low values
– yellow: there are values with middle values
– red: there are values, which are really high

 Sofia PM10 Pollution Heat Map Throughout Time

2. Data Understanding/DU_400-pm10-heat-map-over-time.gif

After using the threshold of 50 µg/m3, the red color is equivalent to every value, which exceeds the EU air quality limit. See in Chart DU_200 02 below.

 PM10 Pollution Above Treshold of 50 µg/m3.

The following are the Graphics of the different parameters. They show that the data has a big variety in the values, so it needs a lot of preparation before being used in the modeling.


The data is scraped from Wunderground. Has its own Readme file with a description.

Meteorological data may have a lot of missing values.

Granularity: year, month, day, all in the same city – Sofia Airport.

Idea: get additional measurements via an API – e.g. DarkSkyAPI

The data is from 1.Jan. 2012 until 17.Sep.2018. All of the missing values (-9999) are in the data for Daily minimum and maximum precipitation amount, and in Daily average visibility, so they will not be used for the prediction.

Daily temperature:


Daily dew point temperature

Daily relative humidity


Daily wind speed

Daily surface pressure

TOPOLOGICAL DATA for every station – TOPO-data

In the Topological data there is information about the latitude,  longitude and elevation for 196 points. No time component.

 3. Data Preparation

Data selection

The custom stations were located all over Bulgaria. To select only the relevant stations, we decoded each station’s geohash to coordinates, we drove an imaginary circle with the radius from the center of Sofia to German village and selected only the stations which coordinates were falling into the circle.

Data aggregation

We decided to not differentiate between official stations and personal sensors. All data was aggregated to the same format of unique (station_id, date) tuple. Personal sensors were given surrogate ids.
As for some stations, the measurements were daily and for other hourly, where the granularity was per hour, we aggregated data by minimum, maximum, average and standard deviation.
Official stations did not have weather data, so we merged their data with the weather data from the Sofia Airport meteo-station.

Feature engineering

We used the dates of the records to extract additional time information as if the corresponding day is a weekday or weekend, if it falls in the official holiday calendar and weather falls into a long weekend (between a holiday and a weekend). We expect that a lot of people leave Sofia during holidays/long weekends and PM levels decrease.

Data validation

Our algorithm for validating the results from the custom station was:

  1. For every custom station, computing the distance from it to all official station
  2. Filter only the stations that have an official station with distance in the first quartile.
  3. Select the nearest official station.
  4. Check if the PM10 from the sensor is no more than 3 times bigger than the official one:
  • if it falls in the limit – take the data as valid;
  • if not – replace with the official stations.

Data enrichment.

We added missing historical values of the dataset using linear interpolation. We also looked if we can add traffic data to our data set, but was unable to find relative information for the time period 2013-2018).


4. Modeling

The main challenge of the case is that we have two dimensions of the observations – time and geographical position. Having more than 650 sensors, a traditional Time Series approach such as ARIMA cannot properly model the data. One possible solution can be to build 650models predicting the average PM10 values for the following day for every station and then aggregate the using ensemble techniques such as voting to gather a total prediction for Sofia. Alternatively, we can employ a different type of model which can incorporate all weather station data into a single predictive model.

Model Baseline

A baseline model is useful to determine how much a more advanced model can contribute to improving the overall prediction accuracy. The exact type of model dpeends on which approach to aggregating the model is used.

Traditional time series models can make use of a “naive” model. The simplest approach is to create a random walk model, which translates to an ARIMA(0, 1, 0) with an optional drift component. Seasonal ARIMA models can be modeled similarly by employing an ARIMA(0, 0, 0)(0, 1, 0)m where m is the seasonal component.

MOdels, which incorporate spatial and temporal components into a single prediction, do allow a simple baseline to be computed. For these models, we only evaluate the model accuracy as per the evaluation metrics, mentioned in section 5. Evaluation.

Ensemble approach

The ensemble approach requires that the dataset be split into separate datasets for each unique measurement station. We then compute different time series models for comparison purposes. The end result is that for each model type we need the sam enumber of models as there are valid stations in the dataset. Time series models include the following:

  1. Naive model – random walk
  2. ARIMA – Auto-Regressive Integrated Moving Average
  3. ARIMAX – Auto-Regressive Integrated Moving Average with external non-temporal variables

To arrive at a single prediction for the entire city of Sofia we then aggregate the predictions of all models based on a simple arithmetic mean of the predictied values. One possible enhancement of the ensemble is to apply weights to each individual model that are proportional to that model’s overall accuracy. For example, we can divide the current model’s Root Mean-Squared Error by the sum of Root Mean-Squared Errors of all individual models.

Single-model approach

We can avoid the hurdle of aggregating individual models by applying a different model type that incorporates both spacial and temporal data into its algorithm. Traditional statistics models that allow this revolve around panel regression, specifically using Fixed effects or Random effects models, where each individual station is tracked over time and temporal observations are compared to an average value of that specific station over time. The final result arrives at a single set of regressor weights that are valid for all stations at the same time.

Simple linear regression

We have tried simple linear regression, too. Further data transformation was performed. We have transformed datetime values to their sine and cosine equivalents. By this, we tried to control for seasonality in the results. We have introduced also Year. Further meteo data was used to model the PM10 avg values. These were the daily average temperature, average precipitation, average surface pressure, and average wind speed. We have achieved adjusted R^2 0.2818.

 5. Evaluation

We split the data into a training set and a testing set. For the testing data set, we use a fixed period (e.g. 20%) of the most recent time series data for each station.

For the linear regression model, R-Squared and Root-Mean-Squared-Error are used as metrics to evaluate the model.

For the time series prediction models, the following metrics are used for evaluation of the model:

  • MAE – Mean Absolute Error
  • MAPE – Mean Absolute Percentage Error
  • MASE – Mean Absolute Scaled Error
  • RMSE – Root Mean Squared Error

 6. Deployment and future improvements

Getting weather forecast

If our model gets into production, we can use AccuWether’s API to get daily weather forecast data for the next 24 hours. Here is the data we can get from the API:

Retraining the model

Once the model is into production, it can be retrained with new data every 3 months, so it can be monitored and validated over time.

Future improvements

Articles for similar cases for other big cities as Madrid, Spain and Beijing, China show that it is useful to add traffic information as predictive values in the model. Right now, we could not find publicly available historical data for traffic in Sofia. We found data for the number of daily vehicle accidents in Sofia, which can be used as an approximation for the traffic, but unfortunately, it is only available for years 203-2014.

Our solution for the problem of the lack of the traffic data will be to start gathering hourly information for the traffic in several regions in Sofia, using the Google Distance Matrix API.



7. Documentation


[1] Air Pollution: Understanding the Problem and Ways to Help Solve It ,
[2] CHRISTOPHER NICASTRO, Bulgaria’s Air Pollution Problem: Could it get any Worse?,
[3] Dan Wei, Predicting air pollution level in a specific city,,%20Predicting%20air%20pollution%20level%20in%20a%20specific%20city.pdf
[4] Predicting Air Pollution in Madrid,


Share this

5 thoughts on “Datathon Sofia Air Solution – Telelink Case Solution

  1. 0

    Hi all!

    Good focus on data quality so far and cool use of maps for exploratory analysis. Well-spotted on missing values in the ‘official’ (EEA) dataset!

    For data enriching, did you have to deal with cases where there are large chunks of missing hours/days – e.g. how would you fill values missing values between 2PM and 8PM when all values are missing? Does your approach aim to handle this?

    Good notes on future improvements to be made. Esp. like the potential use of Google’s traffic data – too often folk focus on reinventing the wheel with primary measurements, when more and more often these days, our benevolent corporate overlords have already done the legwork for us!

  2. 0

    Hi, guys, good job!

    Business Understanding: the text is relevant and the research objectives are stated clearly.

    Data Understanding: I like very much application of heatmaps so as to visualize the air pollution information contained in the citizen’s data set. You’ve did a good analysis on the issues related to the data in the set with official measurements. Yet, pay attention that citizen’s dataset spans from 2017 to 2018. Therefore, taking years 2013-2016 as training set and 2017-2018 as a test set would work only if you are to predict air pollution at the main stations. However, the objective is to deliver forecasts for citizen stations.

    Including a section on future improvements and a list of references is an advantage.

  3. 0

    @all I do not completely understand this part:

    Chech if the PM10 from the sensor is no more than 3 times bogger than the official one:

    if it falls in the limit – take the data ai valid;
    if not – replace with the official stations.

    What do you mean by “replace”?

    1. 0

      Hi, there are some typos here.

      This is regarding the task to validate if the citizens data are valid and trusted. So, we have decoded the geohashes and ended up with the locations of the citizens’ stations. After that we calculated the distances between a station and all official stations. We grouped by datee and station and checked if the mean measure for a day at a particular station is 3 times bigger than the official mean measure. If it is, then we assumed that there is a measurement error and we replace this value with the official measurement value. Here, the cutoff is somewhat subjective. We were thinking to compare with the 3 standard deviations intervals but aproach will not be appropriate as we are not working with a normal distribution (we did not run a formal tests, but the data cannot be with a value below 0, so another distribution should be better, probably gamma). So this constant, 3, is just arbitrary.

Leave a Reply