Datathon cases

Sofia Air Case

Air pollution is one of the most serious problems in the world. To counter the problem, precautious measurements should be met.

The main objectives to the Sofia Air case were to predict the air pollution in Sofia, Bulgaria for the next 24 hour period and to predict the chance of exceedance of 50 µg/m3 average for the day (EU air quality limit).



The main objective of the Sofia Air Case, an initiative launched by our Data Science Society, was to predict the air pollution levels in Sofia, Bulgaria, over a 24-hour period. The case also attempted to predict the chance of these levels exceeding the 50 µg/m3 EU limit for acceptable air quality.


Recently, the European Court of Justice ruled against Bulgaria in a case brought by the European Commission against the country over its failure to implement measures to reduce air pollution. In Sofia, air pollution limits are frequently exceeded – in the most worrying example yet, air pollution quantities on the day of the worst measured air pollution exceeded the recommended norm six times. Air pollution can cause a range of health issues in humans and also affects crops, animal life, forests, and water basins. It also contributes to the depletion of the ozone layer, which protects the Earth from the sun’s UV rays. Air pollution is one of the most serious problems facing the world today.

Case Summary

To attempt to solve the problem of air pollution, we must begin by understanding its causes and effects in more depth, and look for feasible ways to counter it.

With this in mind, the Data Science Society contacted several institutional and business entities to build the case of air pollution with a sufficient amount of data coming from various sources. The data we collected can be found in Datathon Air Sofia Case – One step closer to a better air quality and city.

The data above has been taken on 17th of September 2018 from, a civil initiative website which measures air pollution in Sofia and visualises it on maps such as this one, using hundreds of sensors around the capital of Bulgaria.

With the guidance of industry experts Ekaterina Marinova (Data Science Strategist at Telelink) and Ivan Paspaldzhiev  (Consultant at Denkstatt), the data was collected and grouped in a data set comprising 4 files.

These files consist some of the most important information needed for this approach:

  • data from personal meteorological sensors with PM2.5 and PM10 and weather measurements from all parts of Bulgaria. (PM2.5 and PM10 are from the coarse particles, which are with a diameter between 2.5 and 10 micrometers (μm))
  • data from 6 official stations in Sofia with PM10 measures
  • historical meteorological data from Sofia
  • topological data for points around Sofia

The tasks were a prediction of air pollution in Sofia for the next 24 hour period and the participants were challenged to predict the chance of exceedance of 50 µg/m3 average for the day (EU air quality limit).

Solutions and approach

The Case was open to the global community as a prediction challenge presented at two of the Data Science Society initiatives – Global Datathon 2018 and October Data Monthly Challenge.

After providing the raw data to the participants, they were tasked with analyzing and visualizing the data, finding missing information and choosing the best features which would help them predict the levels of pollution on the following day.

2. Data Understanding/DU_400-pm10-heat-map-over-time.gif

The following visualization shows the private meteorological stations in a radius around the center of Sofia. This visualization is from a PM10 and the colors represent the following:

– no color: no data available

– green: areas with low levels of pollution

– yellow: areas with medium levels of pollution

– red: areas with very high levels of pollution

The next step for the participants was to find the best way to predict air pollution on the following day based on the previous day’s data.

The main idea was to use the data from 1-10 September to predict what will happen on the 11th, then use the developed algorithms to verify those assumptions on the following day (the 11th) and continue forming an expectation for the 12th date based on an updated 11-day dataset, and so on.

After verifying the algorithm, a couple of tests were needed. They were with the same time window as the verification, but this time the results were used to calculate the accuracy of the model and present the final results.

The data was split into a training set and a testing set. For the testing data set, a fixed period was used (e.g. last 20% of all data) which was of the most recent time series data for each station.


For the linear regression model, the following were used as metrics to evaluate the model:

  • R-Squared
  • Root-Mean-Squared-Error

For the time series prediction models, the following metrics were used for evaluation of the model:

  • MAE – Mean Absolute Error
  • MAPE – Mean Absolute Percentage Error
  • MASE – Mean Absolute Scaled Error
  • RMSE – Root Mean Squared Error

Time series models included the following:

  • Naive model – random walk
  • ARIMA – Auto-Regressive Integrated Moving Average
  • ARIMAX – Auto-Regressive Integrated Moving Average with external non-temporal variables


147 participants took a part in the Datathon, 19 of whom attempted to solve the Sofia Air Case challenge. Working in teams, they produced 6 solutions using different approaches.

Interest in the case continued to gather pace at the Monthly Challenge, where 176 people wished to participate and 18 teams were learning how to predict air pollution levels. While experimenting and learning by doing, the participants worked on improving the already developed solutions at the Datathon by using new ideas to further their models.

The Datathon is a one of a kind experience. It gives data enthusiasts the opportunity to grow in the ecosystem nurtured by interesting cases and real-world problems. This is an environment where talents can grow and where market maturity accelerator is being formed – the knowledge.

– Ekaterina Marinova, a Data Strategist at Telelink

The participants were required to write an article with the steps performed on the challenge, supported by all code lines used in the process. Check all solutions here

The best articles for each case competed against each other to win the competition. The best teams at the Datathon presented a solution using the Time Series Model ARIMA approach, which achieved approximately 40% accuracy in predicting the air pollution over the next few hours/ days.

As part of the event, we – Telelink – as a company could see the great benefit of supporting and reaching great “dataists” and together with them solve the current society challenges. This autumn we decided to spend some time together and focus on one of the biggest community challenges – the quality of the air we breath in every day. The case is important as it is aiming to give the real explanation behind the case, and clear and accurate forecast of the pollution levels within a 24h time frame. As working on this solution we want to get one step closer to a better air quality and city together.
We believe that together business and data science society can create a mature market where talents can grow.

– Ekaterina Marinova, a Data Strategist at Telelink

These solutions can be further optimized to work alongside sensors not only in Sofia, but in every other country and city in the world where stations are equipped to measure air pollution levels. The implementation of these solutions, especially in the world’s megapolises, could lead to substantial reductions in the damaging effects of those substances on health in both the short- and long-term.

Currently, a community’s group of interests works towards the creation of a user interface for visualization of the air pollution prediction!

Best solutions during the Global Datathon

Best solutions during the Monthly Challange


Share this

Leave a Reply