Datathons Solutions

Datathon – Sofia Air 2.0 – Solution – Predikt (Sofia Air 2.0) Github: scopyro

GitHub Accounts: KarimEid1, Marcel344, scopyro , @boudy87

Air pollution is quite a topic today. The municipality is investing a lot of effort and resources in order to measure the exact values of the gases and particulate matter in the air in order to identify its quality.

This is the next step towards the completion of a story and holistic view over the data-driven and explained the social topic of unveiling the secrets behind the information about Sofia Air Quality.

This research differentiates the main sources of pollution in Sofia and tries to predict what are the growing rate of this pollution in order to rise awareness against this danger and visualize, in numbers, its growth rate.


========> Code :Archive <=========

Business Understanding

The purpose of this case is to understand the different factors causing pollution, in Sofia in particular.  The scope is to measure the pollution from the source to different measuring station, and finally attempt to predict what would be the pollution in the future. This would help realize the real danger of the pollution and see how it will scale and increase in the future. The main purpose is to take action against this trend. knowing what are the main causes.

Data Understanding

The data are collected from different weather stations, noting also the sources of pollution.

The Sofia topology data was helpful to identify clusters of the polluting sources and their elevations.


Data Preparation

We used the data specified for Level 1, mapped together to understand the:

  • Average wind speed per day
  • Different pollution sources
  • Elevation of each source
  • Pollution yearly average per source – daily was deduced
  • Humidity
  • Distance from source to measuring stations
  • Stability of the weather


The Gaussian plume model was implemented. It’s output along with the days and pollution source served as input to our predictive mode.

The predictive model:

The input data is compressed into however many neurons desired and the network is forced to rebuild the initial data using the autoencoder. This forces the model to extract key elements of the data, which we can interpret as features. One key thing to note is that this model actually falls under unsupervised learning as there are no input-output pairs, but both input and output is the same.

We used different models but the LTSM model was used to make the prediction because it gave us the best results. It gets its exceptional predictive ability from the existence of the cell state that allows it to understand and learn longer-term trends in the data. Which was perfect in our case because we needed it to predict the weather for the next day based on previous data from 20 days before.

The ADAgrad optimizer essentially uses a different learning rate for every parameter and every time step. The reasoning behind ADAgrad is that the parameters that are infrequent must have larger learning rates while parameters that are frequent must have smaller learning rates. In other words, the stochastic gradient descent update for ADAgrad becomes

  1. The learning rate is different for every parameter and every iteration.
  2. The learning does not diminish as with the ADAgrad.
  3. The gradient update uses the moments of the distribution of weights, allowing for a more statistically sound descent.

All of the analysis above can be implemented with relative ease thanks to keras and their functional API.


The evaluation was done using the provided data set, along with some randomly generated test set by the team.

IN our eval,

the model came close to 86% in accuracy


Share this

3 thoughts on “Datathon – Sofia Air 2.0 – Solution – Predikt (Sofia Air 2.0) Github: scopyro

  1. 0

    Please, upload your .ipynb file here. Also write in the abstract of the article your github usernames. How can you evaluate with “randomly generated test set”? Give more details about this.
    Be more specific on the modeling part: why did you choose LSTM, what was the NN architecture and why … etc.

    1. 0

      Hello i uploaded the code in a zip file Archive however the code is in python language with .py extension since we used our own ide instead of colab. The Archive contains the data you provided both train and test and our results with the program

  2. 0

    Please use some visual aids to communicate state and process of building and evaluating models. Also, I need to agree with comments from Pepe for describing process and why you choosed some algorithms over others (which other algorithms you tried before selecting LTSM). As general comment, clearer visual communication can only help with your future work.

Leave a Reply