Datathon cases

Datathon Air Sofia Case – One step closer to a better air quality and city

Dear Society, you should register for the Global Datathon 2018 – in order to see the case descriptions! 🙂


The Telelink Case – One step closer to a better air quality and city

Mentors guidelines for that case.

A study to predict air pollution for the next 24 hour period

Imagine waking up in Sofia in a winter morning. The light of the morning sun holds a promise for a pleasant day. Before going out, you check the weather forecast  – 10% chance of rain or snow, the temperature of 5 C and no wind – and leave your umbrella home. What might possibly go wrong? As soon as you find yourself on the street and take your first breaths of the chilly winter air, you may feel it:

Bulgaria’s issue with air pollution …is not new! In the early 90s & before there has been big issue with sulphur oxides (SOx) emissions from industry. With die-down of large industry since then, issue has moved toward particular matter (PM).  The issue has gained public attention in recent years (mostly in Sofia, but relevant everywhere!).

Source: European Environmental Agency: Bulgaria – air pollution country fact sheet 2017




Business problem formulation

In Sofia, air pollution norms were exceeded 70 times in the heating period from October 2017 to March 2018, citizens’ initiative says. The day with the worst air pollution in Sofia was January 27, when the norm was exceeded six times over. Things got so out of control that even the European Court of Justice ruled against Bulgaria in a case brought by the European Commission against the country over its failure to implement measures to reduce air pollution. The two main reasons for the air pollution are believed to be solid fuel heating and motor vehicle traffic.

AirBg measure the air pollution in Sofia using maps such as this one:

The units in the table on the right are called particulate matter (PM). This is the term used for a mixture of solid particles and liquid droplets found in the air. In particular, fine particles with diameter less than 10µm are called PM10. Prediction of PM10 is an important issue in control and reduction of pollutants in the air.

Source: US EPA



Research problem specification:

Forecast PM Pollution and Predict High Peaks of PMConcentration


Urban air quality is a complicated topic, because it depends on:

– Local meteorology (the weather):

  • Temperature, pressure, rainfall, humidity, wind…
  • Complex interactions (physical, but also chemical w/ other air pollutants)
  • Also upper-atmosphere (prevailing meteorological conditions)

– Local topography: …Sofia is in a valley

– Transboundary pollution: …some pollution is blown in from elsewhere

– National-level policies:

  • Household fuel subsidies
  • Vehicle standards

– Local-level policies and behaviour

  • Fuel use in households
  • Vehicle fleet breakdown


Predictive models for PM10 vary from extremely simple to extremely complex, but the ability to accurately forecast PM10 concentration index remains elusive. Prediction of particulate matter with diameter less than 10µm (PM10) is an important issue in control and reduction of pollutants in the air. In order to do so the dataset chosen should be carefully considered and updated properly with data that research suggests.  (Shen, 2018)

This example aims to achieve results in predicting the PM10 high peaks of concentration and forecast the pollution level. Still, to be as accurate as possible and to have maximum value, we would like to predict those peaks and concentration levels within a 24-hour period.  The area that the research will be focused on and the data is supporting is Bulgaria, where public factors are available to you that consists of : meteorological data: weather, humidity, wind; traffic data; PM10 data, etc. Add more data that you might consider as significant for the more accurate predictions of PM10 concentration levels and forecast of the PM10 pollution.

The case has three levels of difficulty – in order to go to start solving the next level, the participants would need to complete the previous:

PART ONE: Bias correction of citizen science measurements

  • “Official” measurements comply with the EU directives on air quality monitoring can be used for regulatory purposes
  • …but are limited in number (only 5 in the whole city)
    • Funny case: last winter property prices noticeably dropped in the neighbourhoods which had “official” stations named after them!
  • Citizen science stations have very good coverage of the city
  • …but may carry instrumental biases – due to different measurement methods, different interaction with meteorology, etc…

PART TWO: Next-day (24h average) forecast of PM10

  • Based on meteorological parameters (same as what you’d receive from a weather forecast)
  • …predict the 24h average PM10 concentration at each station
  • What is the chance of exceedance of 50 µg/m3 average for the day (EU air quality limit)?

PART THREE: A gridded forecast for Sofia

  • Transfer station-level forecasts to locations not covered by network

Data description

Available data:

  • Citizen science air quality measurements, incl. temperature, humidity and pressure (many stations) and topography (gridded data) (as of 17th Sept):

Download the full dataset HERE

Additional data could be used:

– All sensor values in the last 5 minutes from the registered stations as JSON files, which are updated every minute ->

– Values from the last 5 minutes of a particular sensor ->



See the discussion for this case in the Data.Chat HERE

The industry experts for the Datathon

Ekaterina Marinova, Data Science Strategist at TELELINK

Starting with predicting customer behavior and sales, Ekaterina was excited to explore the world of Big Data in her Master’s degree in Rotterdam School of Management. There, she had the opportunity to work with companies in developing predictive analytics models, marketing retargeting strategy and the analytics process from A-Z using R, Alteryx, R Studio and Tableau. By exploring different statistical models, visualization tools, and starting a project from scratch she could ensure that this was already a passion and a thing to work on in the future.

Currently, as part of the team of TELELINK, Ekaterina ensures that TELELINK as a company provide their customers with the opportunity to take advantage of the best data platform and technology to digitally transform. As a key player in architecting and building the information infrastructure of our customers, TELELINK are looking forward to finding interesting cases and solve them together with the greatest data scientist in the world.

With a passion for big data, Ekaterina is extremely interested in solutions, models and data applications in the Data and AI world.

You may contact Ekaterina regarding the case on our Data.Chat: @ekaterinamarina


Ivan Paspaldzhiev – Consultant at Denkstatt Bulgaria

Ivan is a Consultant at Denkstatt Bulgaria – part of Denkstatt Group and the premier sustainability consultancy in Central and Eastern Europe. His work as a sustainability consultant routinely transgresses disciplinary boundaries, being somewhere at the intersection between data & modelling, business decision making and environmental policy at the Bulgarian and EU level.

Whether it’s Python, R, GIS or just good old Excel, Ivan likes to view data crunching, models and statistics as just tools of the trade for answering interesting questions about the real world.

A taste includes:

  • Quantitative assessment of net (economic, environmental and social) impacts of business activities – for large Bulgarian and global players in the retail, food & beverage and mining sectors
  • The climate forecasting behind Sofia Municipality’s climate adaptation strategy
  • Analysis of the life-cycle impacts of plastics products & their substitutes for the European Commission’s Strategy on Plastics
  • Methodology development for integrating the biophysical impacts of land-use change in life-cycle assessment (under LUC4C FP7)

Prior to Denkstatt, Ivan was a Climate Impacts Scientist at the Met Office Hadley Centre, where he was a developer for the JULES land surface model (used for conducting global climate forecasts). Ivan holds a BSc (with First Class Honours) in Ecological and Environmental Sciences from the University of Edinburgh.

You may contact Ivan regarding the case on our Data.Chat: @paspaldzhiev

Expected Output and Paper

The result should be a clearly presented model that can be realistically implemented.

Article instructions

The main focal point for presenting the results from the Datathon from each team, is the written article. It would be considered by the jury and it would show how well the team has done the job.

Considering the short amount of time and resources in the world of Big Data Analysis it is essential to follow a time-tested and many-project-tested methodology CRISP-DM. You could read more at
The organizing team has tried to do the most work on phases “1. Business Understanding” “2. Data Understanding”, while it is expected that the teams would focus more on phases 3, 4 and 5 (“Data Preparation”, “Modeling” and “Evaluation”), so that the best solutions should have the best results in phase  5. Evaluation.
Phase “6. Deployment” mostly stays in the hand of the case-study providing companies as we aim at continuation of the process after the event. So stay tuned and follow the updates on the website of the event.

1. Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.

2. Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

3. Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

4. Modeling
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

5. Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

6. Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.

Share this

Leave a Reply