Datathon 2019Datathon cases

Datathon Sofia Air 2.0 Case


Sofia Air 2.0

A study to investigate the static factors that are affecting the air pollution levels

Note: Due to the high methodological complexity and the sheer size of the data the experts’ team authoring this case decided to postpone the traffic data for one of our next cases. So, the current case (which is complex enough by its own) focusses on the static polluting factors – households, industrial complexes, and building sites.

Imagine waking up in Sofia in a winter morning. The news is on … and again the levels of air pollution are higher today. You then check the European Environment Agency and see the following picture:

Such statistics make us think how can we protect ourselves from breathing in the low air quality outside. However, there is another point that is loudly discussed, poorly proven and rarely executed at all by us – the citizens – the factors that affect the air quality index levels.

Business problem formulation

According to World Health Organization (WHO), 4.2 million people die yearly as a result of exposure to ambient (outdoor) air pollution, 3.8 million – due to household exposure to smoke from dirty cookstoves and fuels, and 91% of the population lives in places where air quality exceeds WHO guideline limits. (

In Sofia, air pollution norms were exceeded 70 times in the heating period from October 2017 to March 2018, citizens’ initiative says. The day with the worst air pollution in Sofia was January 27, when the norm was exceeded six times over.

Things got so out of control that even the European Court of Justice ruled against Bulgaria in a case brought by the European Commission against the country over its failure to implement measures to reduce air pollution. The two main reasons for air pollution are believed to be solid fuel heating and motor vehicle traffic.

Air Quality Index has been used to determine the level of air pollution across different regions worldwide. As part of it the levels of particulate matter (PM) is measured as well. This is the term used for a mixture of solid particles and liquid droplets found in the air. In particular, fine particles with diameter less than 10µm are called PM10.

Source: US EPA

Research problem specification:

Air pollution is quite a topic today. The municipality is investing a lot of effort and resources in order to measure the exact values of the gases and particulate matter in the air in order to identify its quality. Still, last time we were focusing on supporting this effort in terms of unification of more data sources, calibration and clustering, so that we could have a clearer view on the levels of air pollution within a 24-hour frame ahead.

The topic does not end there, though. Those levels of air pollution are affected by certain variables and factors that we can assume but do not have a clear definition or validated educated-guess. In order to do so we would investigate the different factors that might be affecting the air pollution levels.

The area that the research will be focused on and the data is supporting is Bulgaria, where public factors are available to you that consists of meteorological data: weather, humidity, wind; traffic data; air pollutants information collected from official and unofficial information streams, population’s heating choice, and traffic data.

The goal of the assignment is to achieve an advanced approach into exploring the different dependencies, correlations, and factors that are defining the air pollution. This is the next step towards the completion of a story and holistic view over the data-driven and explained the social topic of unveiling the secrets behind the information about Sofia Air Quality.

Urban air quality is a complicated topic because it depends on:

– Local meteorology (the weather):

Temperature, pressure, rainfall, humidity, wind…

Complex interactions (physical, but also chemical w/ other air pollutants)

Also, upper-atmosphere (prevailing meteorological conditions)

– Local topography: …Sofia is in a valley

– Transboundary pollution: …some pollution is blown in from elsewhere

– National-level policies:

· Household fuel subsidies

· Industrial and building standards

– Local-level policies and behaviour

· Fuel use in households

· Industrial pollutants

· Buildings and constructions


The area that the research will be focused on and the data is supporting is Bulgaria, where public factors are available to you that consists of : meteorological data: weather, humidity, wind; traffic data; PM10 data, etc. Add more data that you might consider as significant for the more accurate predictions of PM10 concentration levels and forecast of the PM10 pollution.

The task

The task is to model the factors that are explanatory to PM10. Please, be advised that each and every factor is important and interesting to the model and each factor has a preparation step and complexity that you should take into consideration before beginning.

There are three static sources of air pollutions considered in the current case:

Complexity level 1. Industrial pollution
Approximate number of pollutants: 70

Methodology to determine pollution: as given by the data

Complexity level 2. Household pollution
Approximate number of pollutants: 40000

Methodology to determine pollution: 3 types of fuel, optional more precise HDD methodology

Complexity level 3. Building sites pollution
Approximate number of pollutants: 1000

Methodology to determine pollution: EMEP/EEA Inventory Guidebook methodology

Your factor modelling should result in geolocation of the resulting pollution. For that matter you could adopt any type of grid, clustering or other approach for geolocation and visualization in each point of time during the inspected period.

The proposed approach has the following steps (you should probably go in order of complexity 3 times through the list):

  1. Factors modelling
    1. Use the Weather stability profile data to determine the parameters for the Dispersion model
    2. Use X, Y, Z, and emission data for each emitting device to determine the pollution intensity by the Dispersion model
    3. Add to the model other specifics such as: working/non-working day; topography; meteorological data
  2. Geolocation of factors
    1. Use longitude and latitude to determine geolocation of each emitting device
    2. Use the results of the Dispersion model to map the pollution of each device (grouped in types) over Sofia, for each day of the set
    3. Make a prediction model for each pollutant factor using the official measurements data at the four stations.
  3. Visualization and research description
    1. Describe all your algorithms, code, analysis and results in the team article
    2. Visualize the industrial pollution on a Map of Sofia for each day (preferably in animation)
  4. Leaderboard submission
    1. Based on your model, predict the level of pollution by each factor for the dates of the test set.
    2. Accumulate the predictions of the emissions of all pollutants of the same type at the geolocations of the four official stations for every day of the test set.
    3. Collect the results from Levels 1 – 3 in one general result matrix, following the methodology guidelines.
    4. Submit the result matrix to the leaderboard of the case and see your current score and current rank.


Methodological information

  1. Dispersion model (needed for Level 1 – 3)

In order to solve level 1, level 2 and level 3, you would need a dispersion model to approximate the spread of the PM10, around their emission device.

Emissions from most emission devices (e.g. industrial installations, households using solid fuels, building sites) are at stack height (e.g. what comes out of the factory chimney). Industrial stacks are positioned high-up so some of the pollution they emit can be diffused away from the ground (so it doesn’t anyone that can be affected). So, the emissions at the ground (reaching people) need to be estimated.

This can be achieved via the Gaussian dispersion equation, commonly used in air quality models:

Basically, this equation tells us how a plume of emissions spreads We don’t have data to provide for historic daily wind direction. Assume that the plume spreads radially from the emissions point.

σy and σx are to be looked up based on the Pasquill stability class of the atmosphere. Basically, the Pasquill class tells us what the state of the atmosphere is in terms of turbulence, which is directly related to its ability to disperse pollutants from the ground.

σy and σx values based on Pasquill stability are given as follows:

Where the stability class can be determined as follows:

In the above table, atmospheric stability is correlated with the vertical temperature gradient of the atmosphere only – this can be calculated from the University of Wyoming radiosonde data (change in temperature with height). This is a simplification given that we only have average wind speed data available. A more precise estimate also depends on the deviation of wind direction during the day. If anyone has ideas on where to scrape historic wind direction data… Otherwise, use the info already provided. Pasquill class G is a subdivision of class F – consider these together.

Data on the state of the upper atmosphere from radiosonde probes of Sofia (collected daily at 12Z) – from the University of Wyoming is used to calculate the temperature profile of the atmosphere with height (dT/dm, where T being temperature, m being height).

The output of the methodology is PM10 emissions in g. To make comparable with measured values from air quality stations (in micrograms per m3 of air), divide by the density of air (1225 g/m3 at standard temperature and sea level)

2. Model household emissions (needed for Level 2)

Uses of different types of fuels for heating can be converted via multiplying by emissions factors as follows, calculated by denkstatt (Based on 2015 data but can be used as representative for an average year):

Assume height of all emission devices = 10 m

Fuel for heating PM10 emissions (grams/household/year)
Solid fuels 1039
Liquid fuels 3682
Gaseous fuels 4017
Biomass 1478

The emission factors given estimate the total yearly PM10 emissions of a household. In order to use these to estimate daily emissions:

  • Simple approach – divide by 190 (length of the heating season for Sofia) to obtain an average daily emissions factor. Sofia’s heating season lasts from the 15th of October to the 23rd of April
  • More detailed (and accurate!) approach – use data on daily temperatures to calculate heating degree days (HDDs – a measure of energy demand based on outside temperatures) and distribute emissions proportionally.

Calculation of HDDs involves determining whether the outside temperature exceeds a base temperature Tbase (think of it as an average temperature inside a building). The following algorithm is used, with Tbase = 19°C for Sofia:

If Tmin>Tbase HDD =0
Else if (Tmax+Tmin)/2>Tbase HDD =(Tbase-Tmin)/4
Else if Tmax>=Tbase HDD =(Tbase-Tmin)/2-(Tmax-Tbase)/4
Else if Tmax<Tbase HDD=Tbase-(Tmax+Tmin)/2

The yearly emissions of a household can then be allocated on a daily basis proportionally to the HDDs.

3. Construction Emission estimation (needed for Level 3)

Constructions emission – to be estimated as per the EMEP/EEA Inventory Guidebook 2016 Tier 1 methodology (pages 5 to 12) (download it here: 2-a-5-b-construction)

Use the Tier 1 algorithm (really, it’s just an equation) to derive estimates of PM10 emissions. Assume sandy soils, as we don’t have a separate dataset for this (last term in the Tier 1 equation).

The Thornthwaite precipitation-evaporation index (PE) can be calculated via the equation given via meteorological data provided (preferred option) or can be assumed based on average climate of Sofia.

The cadastral dataset provided does not have info on construction operations’ footprints (i.e. area affected), so use recommended values from section 3.2.4.

A more in-depth approach would attempt to where possible differentiate different types of building projects (residential, non-residential, etc.) based on the information provided in the cadastral dataset.
The output of the methodology is PM10 emissions in kg. To make comparable with measured values from air quality stations (in micrograms per m3 of air), divide by the density of air (1.225 kg/m3 at standard temperature and sea level)

Assume the following heights for the dispersion model

type height (m.)
Small housing 10
Big housing 15
Non-residental 15
Infrastructure 5

4. Leaderboard instructions

Info on the four offical air quality measurement stations.

AirQualityStationEoICode CommonName Longitude Latitude
BG0040A Nadezhda 23.310972 42.732292
BG0050A Hipodruma 23.296786 42.680558
BG0052A Druzhba 23.400164 42.666508
BG0073A IAOS/Pavlovo 23.268403 42.669797

You should prepare your result matrix, which includes all your scores (see the table below). Note that if you do not have all levels done yet you COULD STILL submit the result matrix and see your score, if you do not have a score for a given level, input 0 in the corresponding fields.

The resulting score from the leaderboard accounts for accuracy of your result but also for the level of complexity of the problem. The score penalizes overfitting.

Also, about two hours before the end of the Datathon a special test set will be released with all data supplied here but for different dates. You must use it to calculate your final leaderboard submission, which would determine your official score for the Datathon. Please, be advised that before the jury makes any decision regarding who qualifies for the final, they would read through the article of each leading team.

A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4


A1 – industrial pollution at station BG0040A

A2 – industrial pollution at station BG0050A

A3 – industrial pollution at station BG0052A

A4 – industrial pollution at station BG0073A

B1 – household pollution at station BG0040A

B2 – household pollution at station BG0050A

B3 – household pollution at station BG0052A

B4 – household pollution at station BG0073A

C1 – building site pollution at station BG0040A

C2 – building site pollution at station BG0050A

C3 – building site pollution at station BG0052A

C4 – building site pollution at station BG0073A

0 – date 1 of the test set

1 – date 2 of the test set

2 – date 3 of the test set

3 – date 4 of the test set

4 – date 5 of the test set


Data description

Access the data here…

And here…

And test set here…


Weather data (needed for Level 1 – 3 )

Data sample:

2016 11 1 12.78 6.67 0.00 2.22 -1.11 -3.33 87.00 58.50 30.00 24.14 12.07 0.00 1026.75 1024.89 1023.03 0.00 5.47
2016 11 2 15.56 6.67 -1.67 2.78 0.56 -2.78 100.00 68.00 36.00 11.27 5.63 0.00 1026.41 1020.83 1015.24 0.00 8.05
2016 11 3 13.33 8.33 3.33 7.22 3.33 -1.11 100.00 71.00 42.00 28.97 14.48 0.00 1023.37 1019.13 1014.90 5.08 7.56

Meteorological measurements (1 station): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility


Latitude: 42.6537 (decimal degree)

Longitude: 23.3829 (decimal degree)

Elevation: 595 metres (Google 592 metres; Google DEM not entirely accurate)

year – Year of measurement

Month – Month of measurement

day – Day of measurement

TASMAX – Daily maximum temperature degrees C

TASAVG – Daily average temperature degrees C

TASMIN – Daily minimum temperature degrees C

DPMAX – Daily maximum dew point temperature degrees C

DPAVG – Daily average dew point temperature degrees C

DPMIN – Daily minimum dew point temperature degrees C

RHMAX – Daily maximum relative humidity %

RHAVG – Daily average relative humidity %

RHMIN – Daily minimum relative humidity %

sfcWindMAX – Daily maximum wind speed km/h

sfcWindAVG – Daily average wind speed km/h

sfcWindMIN – Daily minimum wind speed km/h

PSLMAX – Daily maximum surface pressure hpa

PSLAVG – Daily average surface pressure hpa

PSLMIN – Daily minimum surface pressure hpa

PRCPMAX – Daily maximum precipitation amount mm

PRCPAVG – Daily average precipitation amount mm

PRCPMIN – Daily minimum precipitation amount mm

VISIB – Daily average visibility km

All data is measured at a standard 2-meter height. Pressure data is adjusted to sea-level.


Data have undergone QA and should be without error. In case of suspected inconsistencies,

Topography data (needed for Level 1 – 3 )

Data sample:

Lat Lon Elev
42.62 23.22 1184
42.62 23.2335714286 1333
42.62 23.2471428571 1505

Based on NASA SRTM digital elevation model.
Approx 20m horizontal accuracy, 10m vertical accuracy (as per satellite mission declared parameters)
Includes Sofia urban area + some areas nominally external to the city (toward Vitosha mountain, note large elevation numbers)
No particular effort has been made to include entirety of Sofia Capital’s area as per administrative boundaries

Lat – latitude in decimal degrees. This corresponds to Y on a regular 2D grid.
Lon – longitude in decimal degrees. This corresponds to X on a regular 2D grid.
Elev – Elevation in meters

Weather stability profile data (needed for Level 1 – 3)

Data sample:

Date HGHT(m) TEMP(C)
2016-11-01 595 9.6
2016-11-01 663 7.6
2016-11-01 844 5.4

Data on the state of the upper atmosphere from radiosonde probes of Sofia (collected daily at 12Z) – from the University of Wyoming

Date – date of measurement
HGHT(m) – height of current measurement
TEMP(C) – temperature in celsius of current measurement

Industrial pollution data (needed for Level 1)

Data on emissions from industrial installations collected from emissions permits Provided by UCTM Sofia

Data sample:

X* Y* m t/y
“42°44’16.66″”N” “23°14’28.82″”E” 8 0.38
“42°39’46.01″”N” “23°23’19.70″”E” 15 0.03
“42°39’46.47″”N” “23°23’19.27″”E” 15 0.2

Х* – coordinate DMS Latitude

У* – coordinate DMS Longitude

m – height of emission device from ground zero in meters

t/y – annual debit of PM10

Heating data (needed for Level 2)

Data sample:

X Y NJ16_eq_1 NJ16_eq_2 NJ16_eq_3 NJ17_eq_1 NJ17_eq_3 NJ17_eq_4 NJ17_eq_6 NJ17_eq_7 NJ17_eq_4i NJ17_eq_8 NJ17_eq_9 NN_Jilisht NBROI_LICA
23.3839782 42.6931802 1 0 1 0 1 1 4
23.3839782 42.6931802 1 0 1 0 1 1 2
23.3821195 42.6939119 1 0 1 0 1 1 4

Data is supplied by Green Sofia. It corresponds to a heating map of households by declaration.

J16_eq_1 Number of dwellings in the building according to availability of heating installation – 1. YES, CENTRAL SOURCE
J16_eq_2 Number of dwellings in the building by availability of heating installation – 2. YES, OWN SOURCE
J16_eq_3 Number of dwellings in the building according to the availability of a heating installation – 3. NO
J17_eq_1 Number of dwellings in the building by heating source – 1. SOURCE OF CENTRAL SOURCE
J17_eq_3 Number of dwellings in the building by source of heating – 3. ELECTRICITY
J17_eq_4 Number of dwellings in the building by heating source – 4. DIESEL
J17_eq_6 Number of dwellings in the building by source of heating – 6. COAL
J17_eq_7 Number of dwellings in the building by source of heating – 7. WOOD
J17_eq_4i6i7 Number of dwellings in the building by source of heating – 4,6,7. DIESEL, COAL, WOOD
J17_eq_8 Number of dwellings in the building by source of heating – 8. OTHER
J17_eq_9 Number of dwellings in the building by source of heating – 9. UNUSED HOUSE
N_Jilishta Number of dwellings in the building
BROI_LICA Number of persons in the building

Constructions data (needed for Level 3)

Data for all the construction that has happened in the period of interest, based on official cadastral documents

id start date type district locality address
100 07.1.2016 non-residential OVCHA KUPEL SEKULITSA SUHOL
101 11.1.2016 non-residential KRASNA POLYANA TRUDOVI KAZARMI street SUHOLSKA
102 11.1.2016 non-residential MLADOST MLADOST 2

id – record id
start date -starting dat of the building
type – type of the building*
district – district of the building
locality – locality
address – address

*Types equivalent to the :
non-residential = non residential construction
small housing = single family housing
big housing = aparment housing
infrastructure = road construction

In order to obtain geolocations from the adresses, locallity, and districts, use the geocode api of Google at

Official air quality data (needed for Evaluation and Leaderboard)

Data sample:

Date STA-BG0052A STA-BG0050A STA-BG0073A STA-BG0040A
2016-11-01 692.88 823.44 624 876.24
2016-11-02 1632.96 1756.56 1516.56 2382.288
2016-11-03 953.28 978.48 1086 680.736

Official air quality measurements (4 stations in the city) – as per EU guidelines on air quality monitoring

Date – date of measurement
STA-BG0052A – average daily pollution in micrograms per cubic meter (µg/m3) of station 1
STA-BG0050A – average daily pollution in micrograms per cubic meter (µg/m3) of station 2
STA-BG0073A – average daily pollution in micrograms per cubic meter (µg/m3) of station 3
STA-BG0040A – average daily pollution in micrograms per cubic meter (µg/m3) of station 4

The industry experts for the Datathon

  • Case describing and leading – Telelink and denkstatt – Ivan Paspaldzhiev & Ekaterina Marinova
  • Sofia Municipality – Teodora Polimerova
  • prof Georgi Gadzhev – Institute of Geophysics at BAS
  • Vision for Sofia – Rashid Rashid
  • Green Sofia – Elitsa Panayopova

Expected Output and Paper

The result should be a clearly presented model and corresponding visualisations (as instructed in the case) that can be realistically implemented.

Article instructions

The main focal point for presenting the results from the Datathon from each team, is the written article. It would be considered by the jury and it would show how well the team has done the job.

Considering the short amount of time and resources in the world of Big Data Analysis it is essential to follow a time-tested and many-project-tested methodology CRISP-DM. You could read more at

The organizing team has tried to do the most work on phases “1. Business Understanding” “2. Data Understanding”, while it is expected that the teams would focus more on phases 3, 4 and 5 (“Data Preparation”, “Modeling” and “Evaluation”), so that the best solutions should have the best results in phase 5. Evaluation.

Phase “6. Deployment” mostly stays in the hand of the case-study providing companies as we aim at a continuation of the process after the event. So stay tuned and follow the updates on the website of the event.

1. Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.

2. Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

3. Data Preparation

The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

4. Modeling

In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

5. Evaluation

At this stage in the project, you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

6. Deployment

Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases, it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.

Share this

Leave a Reply