Sofia Air Case 2.0 (Level 1)
Business Understanding:
Sofia, the Capital city of Bulgaria, is a center of attention when it comes to pollution levels on European grounds. The poor citizens have been suffering from a polluted atmosphere reigning over the city. Sofia Air Case 2.0 aims to pinpoint what factor is affecting that pollution the most. We have some weather data as well as some geographical data to help us better visualize how the system is being interacted with. Our team had time to only work on Level 1 so here is what our objective is: For Level 1, in addition to all the weather and atmospherical data, we get an additional set of data which represents all the industries located on Sofia’s grounds. Our goal here is to figure out what factor is affecting the PM10 pollution Sofia is suffering from. We will further explain our methodology in the next paragraphs and why we picked and processed specific values the way we did.
Data Understanding:
The data was provided in the form of CSV files. In our case, we had 4 datasets to work with: Weather data which contained everything related to the weather conditions on a specific day (Day, Month, Year, Temperature, Humidity, Pressure, Wind Speed…), Topography data (Latitude, Longitude, Elevation) which came in handy when trying to calculate the elevation of every industry as well as the elevation of all the 4 stations we have, Weather stability profile data (Date, Hight, Temperature) used to calculate a specific parameter so we can determine the Atmospheric Stability Class of our system and finally an industrial pollution data (Latitude, Longitude, Chimney Height, Annual Debit of PM10) which represents the location of every industry in Sofia along with the height of their chimney and the amount of PM10 it emits throughout the year (Tonnes per Year).
Data Preparation:
We first spent a couple of hours visualizing our data and plotting each dataset in order to find the relations between them and understand where and what to start with. We might have added additional info in that phase but it’s fine as long as we do not end up with missing values at the end. We first pushed each dataset in a distinct DataFrame using Pandas and cleaned missing values (-9999) so that we do not end up with what we call garbage data and feed it to the model later on. We noticed that in order to make a prediction on what the Concentration of pollutants the next day would be in a specific area, you first need to calculate the Concentration of pollutants you have in your data. In other words, our data was a daily reading of weather, atmospherical… values on the course of 20 days. What we needed to do first was to calculate the concentration of pollutants at each of our 4 stations and let our model find a link between the concentration and all the features we have in our datasets in order to predict what would be the concentration the next day and extract the features that most affect that concentration.
Modeling:
Working on the model was a bit tricky as it required doing a hefty amount of research on the topic and picking which assumptions are okay to take and which assumptions would result in a high amount of error. So the pre-modeling phase probably took most of the projects time as we had to come up with a business model from scratch to suit our data and have the most accurate results. Our model consists of calculating the concentration function of every industry, then applying that function to all of the 4 stations located in Sofia. We then sum up all the concentrations that affect a specific station to end up with 4 values for each day, where each of the 4 values represents the concentration of PM10 at one of the stations. To make it clearer, our system looked something like this:
STA-BG0040A | STA-BG0050A | STA-BG0052A | STA-BG0073A | |
Industry 1 | C(1,STA-BG0040A) | C(1,STA-BG0050A) | C(1,STA-BG0052A) | C(1,STA-BG0073A) |
Industry 2 | C(2,STA-BG0040A) | C(2,STA-BG0050A) | C(2,STA-BG0052A) | C(2,STA-BG0073A) |
. . . | ||||
Industry n-1 | C(n-1,STA-BG0040A) | C(n-1,STA-BG0050A) | C(n-1,STA-BG0052A) | C(n-1,STA-BG0073A) |
Industry n | C(n,STA-BG0040A) | C(n,STA-BG0050A) | C(n,STA-BG0052A) | C(n,STA-BG0073A) |
for every day, we then sum up all the rows under a column to get the following:
STA-BG0040A | STA-BG0050A | STA-BG0052A | STA-BG0073A | |
01 – Oct – 2016 | C(STA-BG0040A)
On Day 1 |
C(STA-BG0050A)
On Day 1 |
C(STA-BG0052A)
On Day 1 |
C(STA-BG0073A)
On Day 1 |
02 – Oct – 2016 | C(STA-BG0040A)
On Day 2 |
C(STA-BG0050A)
On Day 2 |
C(STA-BG0052A)
On Day 2 |
C(STA-BG0073A)
On Day 2 |
. . . | ||||
19 – Oct – 2016 | C(STA-BG0040A)
On Day 19 |
C(STA-BG0050A)
On Day 19 |
C(STA-BG0052A)
On Day 19 |
C(STA-BG0073A)
On Day 19 |
20 – Oct – 2016 | C(STA-BG0040A)
On Day 20 |
C(STA-BG0050A)
On Day 20 |
C(STA-BG0052A)
On Day 20 |
C(STA-BG0073A)
On Day 20 |
we do this process for all the 20 days of our datasets.
Following this process, it is time to create a model and train it on our data. we used sci-kit learn for that and we implemented a Regression model for the training.
Evaluation:
To evaluate, we had to split our data before training and test our model with the remaining part. In our Jupyter Notebook, we added a lot of data visualization to be able to see what each process is adding to our study.
Deployment:
At the end of our notebook, we extracted all the features and plotted them on a horizontal bar chart to see which factors were affecting the concentration of PM10 the most.
Improvements we could have implemented because of lack of data and time:
- Google geolocation is a Paid API. If we had access to it, We could have calculated locations, distances, and altitudes more accurately.
- Wind Direction data is missing. If we had wind direction data, we could have clustered each station with the industries that affect it using the concentration function so we get the limit of the elliptical propagation.
- Low amount of data. If we had more than 20 days of data for the training data, we could have trained our model with bigger input, therefore greater accuracy.
5 thoughts on “Datathon – Sofia Air 2.0 – Solution – Internet of Kings”
You declare that you have trained a model, but don’t specify which model. Please, upload your .ipynb file with “media upload” button when editing the article.
Can you please check if it’s working?
Can you elaborate more why you choosed LinearRegression for solving this case? Did you consider using other methods and algorithms? I like “grid” approach of estimating polution over map. It shows congestion yones visually but more work on custom shapes and spread would tell story even better.
Well first because it’s easy and fast and it’s also applicable as we were trying to find a linear relationship between the weather and the concentration of PM10 at each station. We wanted to create a dense neural network but we couldn’t do that with the short amount of data. So I guess we were limited by the amount of data (20 days).
Also, dT/dz (differentiation of T with respect to z) was ranging from -1.71 to 1.66 in the data that was provided (20 days) so in theory, any model trained wouldn’t be able to guess the case where the stability of air is at Class A (Pasquill Stability Class) which corresponds to dT/dz being < -1.9.