Datathons SolutionsPrediction systems

Datathon – Sofia Air 2.0 – Solution – (Virtual) forests to reduce PM10 pollution in Sofia

Sofia is a city with significant concentrations of particulate matter less than 10 micrometers in diameter (PM10.) A high concentration of PM10 is disruptive to life and the climate. The purpose of this project is to predict the concentration of PM10 at a particular day given the climatic conditions. This is important in allowing the making of policies to reduce the pollution in the city. Our contribution consists of a random forest regressor that acheives the purpose with 70 to 80% accuracy.


About the team and the choice of project

Our team (RASS for Roumieh Advanced Software Society) consists of 5 engineering students at the Lebanese University, Faculty of Engineering II. We are all first year students and members of RASS. We love programming and analysing data and we think that the field has a great future.

We decided to work on the Sofia Air 2.0 case because we think that air pollution is a very important problem to solve, so it is essential that we put our efforts into something that could benefit such a large environmental issue.

Understanding the data

Once we received the data, we managed to examine it and determine what kind of features would be useful to our model. It was important to understand the atmospheric and topographical data to understand the dispersion model of the flumes. We did some research in fluid dynamics and in climatology to help us select the features we want. One of the papers we have stumbled upon and that was very helpful to us was a paper discussing the relationship of the different climatic factors with the concentration of PM10 in air by

Data Cleaning

But the data wasn’t exactly convenient to work with, so we decided to clean the data by removing all the missing pieces (including some whole fields.) We also determined that the interpolation of the needed values from the topographical data was way too inaccurate for our purposes (especially in the mountainous areas of Sofia where the variation in the ground level was very big,) so we decided to use the Google Maps API to get more accurate elevation measurements. In order to make sure that Google Maps was indeed an accurate source, we compared the topographical data we had with what Google Maps gave us, and thus verified that it was accurate enough. (We later removed the key for the Google Maps API for safety reasons, but the retrieved data was preserved.)

Choosing the model

Given that the result to predict is a continuous value (and depended in a way that wasn’t complex enough to require using deep learning and neural networks with the parameters,) we opted for a regression algorithm. Our choice included a simple linear regression model and a random forest regressor among others. After tinkering around with the algorithms, it was discovered that the random forests gave better results than linear regression, so that was the path we followed.


As is common with all data science problems, it was necessary to evaluate our model against real test data that the model has not seen yet. In fact, the model did learn well from the training data, and was still able to generalize over test data. We have been able to acheive accuracies ranging between around 70% and 80% (depending on the choice of the station) that were consistent for the test data as well as the training data. Our prediction are given in the table below

Day | Station STA-BG0052A STA-BG0050A STA-BG0073A STA-BG0040A
Day 1 1261.272 1597.152 1607.808 1201.7064
Day 2 1550.592 1854.168 2120.52 1940.9784
Day 3 1699.08 1948.272 2019.096 1266.612
Day 4 1801.08 2261.064 2083.032 2588.148
Day 5 1612.44 2681.616 1890.240 2501.7912


And here is the code on Google Colab:

Share this

Leave a Reply