About the team and the choice of project
Our team (RASS for Roumieh Advanced Software Society) consists of 5 engineering students at the Lebanese University, Faculty of Engineering II. We are all first year students and members of RASS. We love programming and analysing data and we think that the field has a great future.
We decided to work on the Sofia Air 2.0 case because we think that air pollution is a very important problem to solve, so it is essential that we put our efforts into something that could benefit such a large environmental issue.
Understanding the data
Once we received the data, we managed to examine it and determine what kind of features would be useful to our model. It was important to understand the atmospheric and topographical data to understand the dispersion model of the flumes. We did some research in fluid dynamics and in climatology to help us select the features we want. One of the papers we have stumbled upon and that was very helpful to us was a paper discussing the relationship of the different climatic factors with the concentration of PM10 in air by https://www.academia.edu/33390324/Dew_Point_indirect_Particulate_Matter_Pollution_Indicator_in_the_Ciuc_Basin_Harghita_Romania
Data Cleaning
But the data wasn’t exactly convenient to work with, so we decided to clean the data by removing all the missing pieces (including some whole fields.) We also determined that the interpolation of the needed values from the topographical data was way too inaccurate for our purposes (especially in the mountainous areas of Sofia where the variation in the ground level was very big,) so we decided to use the Google Maps API to get more accurate elevation measurements. In order to make sure that Google Maps was indeed an accurate source, we compared the topographical data we had with what Google Maps gave us, and thus verified that it was accurate enough. (We later removed the key for the Google Maps API for safety reasons, but the retrieved data was preserved.)
Choosing the model
Given that the result to predict is a continuous value (and depended in a way that wasn’t complex enough to require using deep learning and neural networks with the parameters,) we opted for a regression algorithm. Our choice included a simple linear regression model and a random forest regressor among others. After tinkering around with the algorithms, it was discovered that the random forests gave better results than linear regression, so that was the path we followed.
Evaluation
As is common with all data science problems, it was necessary to evaluate our model against real test data that the model has not seen yet. In fact, the model did learn well from the training data, and was still able to generalize over test data. We have been able to acheive accuracies ranging between around 70% and 80% (depending on the choice of the station) that were consistent for the test data as well as the training data. Our prediction are given in the table below
Day | Station | STA-BG0052A | STA-BG0050A | STA-BG0073A | STA-BG0040A |
Day 1 | 1261.272 | 1597.152 | 1607.808 | 1201.7064 |
Day 2 | 1550.592 | 1854.168 | 2120.52 | 1940.9784 |
Day 3 | 1699.08 | 1948.272 | 2019.096 | 1266.612 |
Day 4 | 1801.08 | 2261.064 | 2083.032 | 2588.148 |
Day 5 | 1612.44 | 2681.616 | 1890.240 | 2501.7912 |
And here is the code on Google Colab: https://colab.research.google.com/drive/14lNTy6gaCaTDSwA-UglTZnpjbAK_Dy5p