Datathons Solutions

Monthly Challenge – Sofia Air – Team Yagoda – Solution


Business understanding

A vibrant and dynamic lifestyle of a big city leads to a necessity its citizens to deal with  particular drawbacks. In case with the capital of Bulgaria – Sofia, unsatisfactory air quality level has become one of the most discussed topics not only among local environmental agencies but even among European regulatory intuitions: the European Commission has recently brought the case into the court because of the Bulgaria’s failure in taking measures in order to reduce air pollution. Indeed, according to the survey of citizens’ initiative, air pollution norms in Sofia were exceeded 70 times during the heating period from October 2017 to March 2018.

In fact, the capital is mainly exposed to such air-polluting sources like improper heating (with coal and wood) and older vehicles’ emissions. Another reason is believed to be industrial emissions which actually is expected to exist in every developing country. But in case the geographical feature makes things even worse.

Data Understanding

Taking into consideration the described issue, the core objective of the further analysis is formulated as follows:

To build a sustainable predictive model for the expected level of air pollution in Sofia in the next 24 hours

But first, we should get familiar with a data provided for the case. It can be grouped in the following way:

  • The official data collected by the European Environmental Agency which captures dynamic of measurements from 5 stations in Sofia. It is located in folder EEA data and consists of 28 data sets and a metadata.
  • Data supplied by citizens consists of two data sets – data_bg_2017.csv and data_bg_2018.csv. The origin of this data is volunteer and is not recognized by the European Commission, therefore we assume that there’s a chance to meet some unreliable measurements during the process of modelling.
  • Meteorological measurements for temperature, humidity, pressure etc.. It is believed that weather conditions have a potential impact on the air quality.
  • Topography data of Sofia which is expected to be useful in determining the area of interest in this survey.

Week 1: Preliminary Analysis I

At this stage of the process we need to import the necessary data sets into the modelling software. We’ve made a decision to use R for the purpose.

1. Importing data

The first step of our analysis consists of importing three data sets – air17, air18 (citizen air quality measurements) and sofiatopo (topography data for Sofia).

2. Inspecting elements of the data sets and checking for inconsistencies

The analysis begins with the check of variables’ class, as the outcome for all the 3 data sets is aggregated into a list. We’ve noticed that the time class is character, which should be converted to “POSIXct” with the help of lubridate  library.






# the time class is character - it should be converted
air17$time=as_datetime(air17$time, tz="UTC")
air18$time=as_datetime(air18$time, tz="UTC")

3.Identifying unique stations and merging the two data sets on citizen science air quality measurements

With the help of dplyr library we’ve managed to mark out 383 unique stations (grouped by geohashes) for air17 and 1254 for air18that are not present in both data sets, but in one of those. Concluding that there is no point in evaluating and inspecting such stations led to omitting them together with their observations from the data sets.  In the end of this step, a merged filtered data set for 2017 and 2018 was created , which allowed us to remove some auxiliary variables to get our code cleaner.


# Getting differences in the geohashes for both years
sum(uniquestations$geo17$obs[uniquestations$geo17$geohash %in% aux])

# Cleaning differences (only stations present both in 2017 and 2018 left)
air17=air17[!(air17$geohash %in% aux),]

# Merging data sets and cleaning up auxiliary variables

4. Summarizing availability of merged data sets

At this step our purpose is to check whether there are any NAs left in the citizens’ data for 2017 and 2018. After removing such inconsistencies we formulated a table, composed of an aggregated information about stations’ working features, where every station is denoted by “geohash”, columns “tmin” and “tmax” correspond to the beginning and the end of the observation period, respectively, and the column “days” is for the number of days recorded by every station. The data frame looks as follows:

5. Summarizing stations and visualizing them on a map

Then, in order to obtain some points of contact with reality, we came up with an idea to illustrate stations distribution within a country. With the help of geohash library our data was prepared for plotting by decoding topography features and then, there appeared a map of Bulgaria with some spread points – actual stations geohashes provided by citizens. The area with a high concentration of brown pigment indicates stations performing within a capital’s boundaries.

6. Filtering stations by space and time

At this stage of our survey we should determine a filtering rule in order to make a decision which stations to be excluded from the further analysis due to their unreliability. Plotting histogram and a box plot helped us to get a feeling of the stations’ performance.

It’s clearly visible that there are enough stations with less than 50 days recorded while only a small amount of latter has been performing for the whole year or at least for more than 250 days. Still, we believe that it’s sensible to see some basic statistics features before making a final decision.

With the help of fBasics library it was discovered that a number of recorded days varies from 0 to 343 while the mean value is equal to 138 days. But in case we’d rather focus on quantile measurements, presented below:

As we can see for in the column with 5% result, the number of observations and recorded days are 238 and 16, respectively. This means  that our potential filter features should be chosen under or equal to measured values in order to keep the data statistically significant. Therefore, our filtering rule excludes from the data set every station with available days ≤10 and recorded observations≤168.


stationsf10d=stationsf %>%
dplyr::filter(day<=10 & obs<=168)
stations=stationsf[!(stationsf$geohash %in% stationsf10d$geohash),]
stfinal=merged[merged$geohash %in% stations$geohash,]

Week 2: Preliminary Analysis II

1. Creating the final list of geo-units for the predictive analysis

At the first stage of the new week we assume that air quality could vary depending on the elevation rate. For the purpose of the further analysis the topography data set with 196 point in the city of Sofia is used. Guided by certain considerations, we have adopted a mentor’s approach of dealing with topography data – at first, we created an auxiliary table which illustrates the distance between particular stations (in rows) and Sofia’s topo points (in columns). Then, we added a column which indicates the nearest point from topography data for every station in Sofia.

After  two data sets had been merged by the feature “pnt” and ordered columns in a logical sequence, we came up with a table below:

Graphic representation of the results is always useful to ensure we’re on the right way. Therefore, here’s what stations’ distributions (blue inverted triangles) within boundaries of Sofia (grey points) look like:

2. Merging cluster information and summarizing the data sets

At this stage we summarized parameters for every of 196 points formulated in previous steps and gave proper names to columns:


df=merge(stations[,c(1:7)], stfinal, by="geohash")

# actually clustering and summarizing parameters
dfgeo=df %>%
  group_by(pnt) %>%
  summarise(elev=unique(Elev),st.num=length(unique(geohash)), obs=n(),t.min=min(time),t.max=max(time),p1.min=min(P1),p1.max=max(P1), p2.min=min(P2),p2.max=max(P2), temp.min=min(temperature),temp.max=max(temperature),press.min=min(pressure),press.max=max(pressure),hum.min=min(humidity),hum.max=max(humidity))%>%


3. Cleaning data of mismeasurements and inconsistencies, comparing them with the official data

We assume that fixing inconsistencies is crucial for obtaining relevant results in every analysis.  That’s why at first we made sure that there’s no NAs left in the data set. Consequently, we moved our focus onto 5 official European stations, where all the early years were previously removed from the data’s folder:


# Importing official data from EEU for 2017 and 2018
setwd("C:\\Users\\Daria\\Desktop\\business analytics\\tuesday\\data for the air case\\euromeasurements")
eeu=list.files(path="C:\\Users\\Daria\\Desktop\\business analytics\\tuesday\\data for the air case\\euromeasurements")
deu=lapply(eeu,read.csv,na.string=c("","NA"," "), stringsAsFactors = F, fileEncoding="UTF-16LE")

# Giving names to lists from the folder
for(i in 1:length(eeu)){

Thus it comes to dealing with a meteorology data. After the necessary data set had been imported and filtered, we created a special vector where minimal and maximal values of weather features and concentration of P10 to be recorded.


# Importing the official meteorology data
setwd("C:\\Users\\Daria\\Desktop\\business analytics\\tuesday\\data for the air case\\METEO-data")
#setwd("D:\\Preslava\\uni\\Business Analytics\\Tuesday class\\Monthly Challenge Oct\\METEO-data")
weather=read.csv("lbsf_20120101-20180917_IP.csv",na.string=c("","NA"," ","-9999"), stringsAsFactors = F)
weather = weather[weather$year>=2017,]

# Creating a separate array for evaluation

#Minimum and maximum of the features
cc["P1min"]=min(sapply(deu,function(x,y) min(y[,x]), x="Concentration")) #0.09
cc["P1max"]=max(sapply(deu,function(x,y) max(y[,x]), x="Concentration")) #689.65

With the help of editrules library we formulated a filtering rule for outliers using extremum values from the cc vector as benchmark in order to move on to localizing potential outliers. This time there were 70259 mismeasurements in the column “P1” from the auxiliary data set.

We found out that 46 stations were removed from the data set due to the overall mismeasurements (by “perc” value).  Then it came to a repeated error localization, this time by features of our interest, i.e. by “P1”, “temperature”, ”humidity”, ”pressure”. Quantile measurements for the pressure had a very weak performance.


# Filtering current data by newly received extremum
edited=editset(c("P1 >= 0.09", "P1 <= 689.65","temperature >= -20.5572", "temperature <= 38.892","humidity >= 5", "humidity <= 100","pressure >= 99018", "pressure <= 105113.5"))

#Localizing errors within the data
# Cheking number of mismeasurements in P1
sum(aux$adapt[,"P1"]) #[1] 70259

dmm=df %>%
  group_by(geohash) %>%
  summarise(freq=n(),mm=sum(mm),perc=mm/freq) %>%

#Removing 46 stations with overall mismeasurements (by perc value)
df=df[!(df$geohash %in% auxlist),]

#Localizing errors again
sum(aux$adapt[,"P1"]) #20962
quantile(df$pressure, probs=c(0.05,0.25,0.50,0.75,0.95))
prout=boxplot.stats(df$pressure, coef=2,do.conf = F, do.out=F)

The final point of this step was excluding all the outliers which were lying between set by cc-vector boundaries. We believe we got a data set with reliable measures after this action.


#Removing all outliers from set by cc-vector boundaries


4. Aggregating data by geo-units

At this step our purpose was to set a time series vector on hourly basis for every feature (“clust.ID”, “Elev”, “time”, “P1”, “temperature”, “humidity”, “pressure”) of every geohash by creating two separate aggregated lists and merging them. In short, we think we’ve done everything to move on to the inspection of statistic characteristics.


#Creating a time series vector
for (i in 1:length(k)){
  timeser[[i]][[i]]$time),to=max(geounits[[i]]$time), by="hour"))

for (i in 1:length(k)){

#Merging the geo-units with the aggregated list
for (i in 1:length(k)){


5. Inspecting and summarizing important characteristics

The final stage of this week started with some basic analysis of statistic coefficients. The results on
Skewness and Kurtosis parameters are presented as follows:

The lack of symmetry in Skewness distribution is clearly visible, all the values are positive and are stored between 3.46 and 3.61, that’s why the asymmetric tail is located on the right. We assume it means that during the observed time interval there was a certain period when the concentration of particulate matters was above the norm.
Kurtosis, for its part, demonstrates inflated values as well, which vary from 13.34 to 14.05. This should tell us that our data is far from the normal distribution.

Then we moved on to the time series stationarity test for our clustered data, in case it was Dickey-Fuller Augmented Test (ADF Test). It was necessary to check whether our series are stationary, because further use of data with unit roots present  may cause building a spurious regression.


# Performing stationarity test
for (i in 1:length(timeser)){
plot(DFT, main = "ADF Stationarity test for clustered data", ylab = "DFT p-value", xlab = "Cluster number", type = "p")
points(DFT, col = "orange", pch = 21, bg = "yellow", cex = 1.55, lwd = 2)

which(DFT>0.05) #[1]  98 102

The test has showed satisfactory results, we found out that only two clusters have statistically insignificant p-values – 98th and 102th. Still this brings no harm to the overall view.


Week 3: Adjustment of PM10 concentration at citizen science stations for instrumental biases 

1.Considering the data sets on official air quality measurements

That week we returned back to our 5 official European stations in order to fix variables’ classes in case of necessity. We were interested in columns “DatetimeBegin”, “DatetimeEnd” and “Concentration”.


#Making a check whether variables' classes are correct 

# Fixing time's class
for (i in 1:length(deu)){
deu[[i]]$DatetimeEnd=ymd_hms(deu[[i]]$DatetimeEnd, tz="Europe/Athens")
deu[[i]]$DatetimeBegin=ymd_hms(deu[[i]]$DatetimeBegin, tz="Europe/Athens")

#Classes fixed

The time series vector for official P10 measurements was created on the same scheme as in the 4th step of the previous week, but with adding a column of “Concentration” to the time series in parallel. Then there were left a few technical moments with binding data sets of two available years are clearing up all the duplicated values.
After all the filtering and cleaning procedures it turned out the time interval of our interest lies between the end of November 2017 and the middle of September 2018.


# Creating time series vector for official P10 measurements
for (i in 1:length(deu)){
  timesereu[[i]][[i]]$time),to=max(deu[[i]]$time), by="hour"))

# Making a list for P10 Concentration for every station
for (i in 1:length(deu)){

#Binding the the official stations datasets for 2017 and 2018
timesereu=timesereu[c("st9421", "st9572","st9616","st9642","st60881")]
for (i in 1:length(timesereu)){

The results from above gave us a base for a visualization of the missing values in the official stations. The graphic illustrates a situation with data distribution both before and after interpolation. It’s noticeable that the highest concentration of missing gaps was in February August and September 2018, which had to be fixed by a linear interpolation.


Unfortunately, our research team’s competence was limited until this stage. But we believe our colleagues will succeed in determining the optimal forecasting model to address such a topical issue as air pollution.

Share this

13 thoughts on “Monthly Challenge – Sofia Air – Team Yagoda – Solution

    1. 0

      We tried, but it wasn’t successful. We are not sure how to use the Jupyter Notebook platform and we tried to upload the our file with .R extension, but there is a message that R files aren’t supported and couldn’t be uploaded. We will try it again for next time. 🙂

  1. 1

    Banana team: Heeey guys & girls :),
    I like that your article is brief, have straight line in it’s structure and in the first sight you can understand every step that you take when you had worked on this phase. I’m into it that you don’t follow exactly the same steps like in the instruction article, because in this way you put your own vision how construction of the analysis should going.
    Another thing that I’m really really love that your code is simple and easy to read and every reader can figured out what happening no matter if he is Master of code (and the Force is with him) or very beginner at writing Padawan 😀
    However I have several questions:
    On one hand, why you didn’t check for N/As ( I think that I did’t see some sort of check) and got rid of them and on the other (I’m sorry that I’m so curious, but there is no hope for me) why you gave up so easily on stations that have observations only in 2017.
    Work hard and don’t forget to have fun ^.^
    P.S. (I promise that’s the last thing) It will be good if there are pictures of the plots & out-puts for some more clarity.

  2. 0

    + The article is structured and there is a paragraph for each step you make.
    + Steps are explained.
    + The task for Week 1 is completed.
    – I would like to see the result of the code. It would be much better if you include some tables (part of) or figures (maps).

  3. 0

    Dear Banana Team,
    I think you have done a great job with your article and your script, both are very clear. As a person who has difficult time writing a code, as I look at yours I have no problem understanding it, which is very good, to keep it simple where it is possible. Follow the steps and don’t forget to use your imagination. I could tell that you are cautious, so my advice is – don’t be afraid to make mistakes, from them you learn the most. Good thing is to put pictures and plots to make you article more interactive and catching.

  4. 0

    Hi team, thanks for this article. It would be great to copy paste your code on a Jupyter notebook and upload it as a media file (Jupyter comes for free with Anaconda Distribution so you can go ahead and install this).
    I was curious about the clustering – based on what do you define to come up with 196 clusters? Furthermore how would you estimate the temperature, pressure, etc. for these clusters, when you reach to the prediction stage, are you going to take some sort of an average?

  5. 0

    Hi Team,
    Great to see your week1 and week2 work. I am not seeing your week3/4 sections, perhaps there is an error. Please let me know when you have it uploaded.


Leave a Reply