Datathons Solutions

Monthly Challenge – Sofia Air – Solution – Kiwi Team


 I. Business understanding

The fast-paced modern lifestyle has dramatic effect on the quality of life. Bulgaria is an example of a developing country that is not yet self-sustainable. Due to a variety of factors there are frequent cases of air pollution.

Open areas often experience more winds and are therefore more likely to have better air quality. The issue that Sofia does not have this advantage is emphasized by the lifestyle of its citizens, that does further damage. Some of the main problems are older vehicles and the use of non-ecological types of heating. Our goal is to build a successful prediction model on the expected air pollution to be measured within 24 hours.


II. Data mining 

Our data research encompasses the first two stages of data modelling. It analyses four types of data:

  • The official air quality measurements are composed by five stations in the capital of Sofia;
  • Citizen air quality measurements that include parameters such as temperature, humidity, pressure and topography.
  • Meteorological measurements – one station for temperature, humidity, wind speed, pressure, rainfall, visibility;
  • Topography data;

The indicators from these components are important to our research as they have a correlational impact. Windy weather causes pollution to be dispersed whilst still weather allows pollution to build up.

First we need to get to know the data we are going to work with. In order with that, we have to import it into the software for modelling – “R”. We direct our attention to citizen science air quality measurements (by Air Tube) and topography data for Sofia. First problem we are facing with reading the data is that all files coming from Air Tube are zipped. We unzip them using R package R.utils and read them making sure that all the strings are imported as strings, rather than as factors.

setwd("") # put the path to your working directory here
if (!require(R.utils)) {

if(file.exists("./Air Tube/data_bg_2017.csv.gz")){
gunzip("./Air Tube/data_bg_2017.csv.gz")
if(file.exists("./Air Tube/data_bg_2018.csv.gz")){
gunzip("./Air Tube/data_bg_2018.csv.gz")
data_bg_2017<-read.csv("./Air Tube/Data_bg_2017.csv", stringsAsFactors = FALSE)
data_bg_2018<-read.csv("./Air Tube/Data_bg_2018.csv", stringsAsFactors = FALSE)

III. Data cleaning 

Now that we have our data imported, it’s time to “feel” the data. We start by checking a sample of our citizen science air quality measurements and create summary for all important statistical metrics for both 2017 & 2018. Here is summary for 2017:

time geohash P1 P2 temperature humidity  pressure
Length:651492 Length:651492 Min.   :   0.00 Min.   :   0.00 Min.   :-148.000 Min.   :  0.00 Min.   :     0
Class :character Class :character 1st Qu.:  10.00 1st Qu.:   7.00 1st Qu.:   4.000 1st Qu.: 51.00 1st Qu.: 94072
Mode  :character Mode  :character Median :  22.00 Median :   8.00 Median :   9.000 Median : 67.00 Median : 95187
Mean   :  43.98 Mean   :  16.09 Mean   :   9.103 Mean   : 62.49 Mean   : 85333
3rd Qu.:  47.00 3rd Qu.:  27.00 3rd Qu.:  14.000 3rd Qu.: 78.00 3rd Qu.: 96416
Max.   :2000.00 Max.   :1000.00 63.000 Max.   :100.00 Max.   :176159

As can be seen, significant outliers are observed in the “temperature” feature – it has a minimum value of -148 and a maximum value of 63, which seem practically not possible. Its values of Mean and Median are pretty close which suggests the possibility of symmetric data distribution.  The P1 and P2 features have a very high maximum value. We also identify the maximum amount for “pressure” as a potential outlier – it should also be looked into at a later stage.

time             geohash P1 P2         temperature humidity       pressure
Length:2958654 Length:2958654 Min.   :   0.00 Min.   :   0.00 Min.   :-5573.00 Min.   :-999 Min.   :-20148
Class :character Class :character 1st Qu.:   8.00 1st Qu.:   5.00 1st Qu.:    7.00 1st Qu.:  44 1st Qu.: 93646
Mode  :character Mode  :character Median :  13.00 Median :   8.00 Median :   17.00 Median :  62 Median : 94874
Mean   :  28.94 Mean   :  16.09 Mean   :   14.69 Mean   :  59 Mean   : 85267
3rd Qu.:  24.00 3rd Qu.:  15.00 3rd Qu.:   23.00 3rd Qu.:  77 3rd Qu.: 98304
Max.   :2000.00 Max.   :1000.00 Max.   :  435.00 Max.   : 898 Max.   :254165

Analogically, we perform the same analysis for 2018: features have a very high maximum value. We also identify the maximum amount for “pressure” as a potential outlier – it should also be looked into at a later stage.

Again, we pay attention to the outliers in “temperature” feature.  Here, it can be suggested that the distribution is not symmetrical. There is significant difference between 1st and 3rd Quartille values – there is increase in temperature. Also, as seen in 2017, P1 and P2 features have a very high maximum value

Next, we focus our attention on the classes of the variables. All variables are of the appropriate class, except for “time”. We fix it using R package “lubridate” and its function ymd_hms().

We follow the same steps in topo data analysis. There is no class inconsistency, all features are of class “numeric”.
# All good - now let's check the topo data
for (i in 1:length(names(sofia_topo))){

Let`s find all unique stations in 2017, using unique() function– they are 383. We address the same issue with 2018’s observations- unique stations are 1254. It is not reasonable to make predictions for stations that were not observed in 2018, so we’ll find and remove data for stations only observed in 2017 with setdiff() function. 11 stations have data only for 2017, so we take them out of the dataset.

Now we are sure that all the stations observed in 2017 are also observed in 2018, we can bind the 2 datasets using rbind().

only_in_2017<-setdiff(data_bg_2017$geohash, data_bg_2018$geohash)
only_in_2017 <-
data_bg_2017_new <- subset(data_bg_2017, !(data_bg_2017$geohash %in% only_in_2017$only_in_2017))
data_bg_full <- rbind(data_bg_2017_new, data_bg_2018)

We need to make sure that our new dataset is ready for further analysis. As we know well, our potential first problem is with finding “NA” or missing values in our data so we check and clean 4 missing geohashes. Now we have dataset of 1253- unique stations.

Our next act is to summarize our data by creating new dataframe with significant features – the first column is filled with unique geohashes, the second column is the number of available observations, the third and the forth columns provide the date and time when the station was observed for first and last time, the fifth column is difference (in days) between the last and the first time the station was observed. The last column would be our starting point in the analysis as stations that are being observed in a very short period of time are not of interest to our purposes.

if (!require(dplyr)) {
sum(data_bg_full$geohash == "")
data_bg_clean <- data_bg_full%>%
filter(data_bg_full$geohash != "")
unique_full_set <- data.frame(unique(data_bg_clean$geohash))
freq <- data.frame(table(data_bg_clean$geohash))
min <- aggregate(time ~ geohash, data = data_bg_clean, min)
max <- aggregate(time ~ geohash, data = data_bg_clean, max)

minmax <- merge(min, max, by="geohash")

colnames(freq)[colnames(freq) == 'Var1'] <- 'geohash'
summary <- merge(freq, minmax, by="geohash")
summary$days <- difftime(summary$time.y, summary$time.x, units = "days")
colnames(summary)[colnames(summary) == 'time.x'] <- 'tmin'
colnames(summary)[colnames(summary) == 'time.y'] <- 'tmax'
colnames(summary)[colnames(summary) == 'Freq'] <- 'obs'
summary <- filter(summary,obs > 4)

IV. Data exploration

It is time to summarize the stations on the map. Our “geohash” features contain encoded information about latitude and longitude. We convert it using gh_decode() from geohash library. Now we are ready to plot our observations on the map. We use leaflet library and add cluster condition so our observations could be grouped by region.

As can be seen, observations are not concentrated only in Sofia. Our case solution is directed to predict air pollution in Sofia, so we need to filter only observations in our capital.

Here is the map to all available stations that we have observations for 2017 & 2018.

if (!require(geohash)) {
geohash_decoded <-$geohash))
summary_decoded <- data.frame(summary, geohash_decoded) if (!require(leaflet)) { install.packages("leaflet") require(leaflet) } summary_decoded %>%
leaflet() %>%
addTiles() %>%
addMarkers(clusterOptions = markerClusterOptions())
colnames(sofia_topo)[colnames(sofia_topo) == 'Lat'] <- 'lat'
colnames(sofia_topo)[colnames(sofia_topo) == 'Lon'] <- 'lng'
sofia_summary <- summary_decoded[which(summary_decoded$lat < max(sofia_topo$lat) & summary_decoded$lat > min(sofia_topo$lat)& summary_decoded$lng < max(sofia_topo$lng) & summary_decoded$lng > min(sofia_topo$lng) ), ]
sofia_summary %>%
leaflet() %>%
addTiles() %>%

addCircleMarkers(weight = 0.1, color = "red") %>%
addRectangles(lat1 = (min(sofia_summary$lat)-0.01), lng1 = (max(sofia_summary$lng)-0.18),
lat2 = (min(sofia_summary$lat)+0.13), lng2 = (min(sofia_summary$lng)+0.18),
fillColor = "transparent")

We have already filtered our observations to have frequency greater than 10 and to be observed for more than 7 days.


V. Feature engineering

Now it is time to analyze the data by stations and make a final list of geo-units. These 113 geo – units will be part of our subsequent steps for building a predictive model for the air pollution in Sofia.

sofia_topo$point <- 1:nrow(sofia_topo)
if (!require(geosphere)) {

dist_mat <- distm(sofia_summary[,c('lng','lat')], sofia_topo[,c('lng','lat')], fun=distVincentyEllipsoid)
sofia_summary$Elev <- sofia_topo$Elev[max.col(-dist_mat)]
sofia_summary$pnt <- sofia_topo$point[max.col(-dist_mat)]
pal <- colorFactor("viridis", domain = sofia_summary$pnt)

Here is distribution of the observations by clusters in Sofia. So far, this plot is continuation of our plot approach for Week 1.

We decided to take a look at our cluster disribution by elevation in a 3d graph.

First of all we need  to clean our stations and to decide which of them represent the final list of geo-units. From Sofia-topo we plot the stations based on the clusters and merge them to the citizen scientific air quality measurements. The new consolidation contains clusters, elevation, observations, minimum and maximum temperature, pressure and humidity values.

We clear out our data of mismatches and inconsistencies – by localizing all errors and cleaning our geo-units. For this steps we use our own rules to localize potential errors and brand new libraries – editrules and deducorrect.

First, we check the minimum and maximum value of all features and we see that there are some serious outliers – impossible temperature records for all of them.

Features Temperature Humidity Pressure P10
Minimum -5 573 110 164 990 2 000
Maximum 63 -999 0 0

Using the descriptive statistics from official stations as benchmark, we create a rule that indicates all the possible errors in our clusters.

E <- editmatrix(c(

""P1 >= 0.09","P1 <= 690",
"temperature >= -25","temperature <= 40",
"pressure >= 99018","pressure <= 105113.5",
"humidity >= 5","humidity <= 100"
#Localize all errors in the data
err <- localizeErrors(E, z4,verbose=TRUE)


Most errors come from pressure, there are over 90% mismeasurements, so we decide that it is inappropriate to be included as an explanatory element in our model because its data would be fully interpolated with the data in official stations in the next step.

After removing the errors we construct time series on air pollution, temperature, humidity and pressure for every geo-unit – so that aggregated data belong to the same unit station.
We create an empty list object to save the time series data for each cluster. There are 113 clusters and we’ll perform this operation using a loop.
We aggregate the data for all stations within the geo unit using the mean value for each observation, aggregated by the time.

Our next step is to inspect and summarize the most important statistical characteristics of the set of Time series subject. For this purpose it is needed to install libraries like tseries and fbasic.

geo_units<-unique(z4$pnt) # z4 FROM STEP 2 IS USED;
length(geo_units) # [1] 113
for (i in 1:length(geo_units)){
cluster_temp<-cluster_temp %>%
complete (time = seq(min(cluster_temp$time), max(cluster_temp$time), by="hour"))
by = list(cluster_temp$time),
colnames(P1_temp) <- c("time","P1")
by = list(cluster_temp$time),
colnames(P2_temp) <- c("time","P2")
by = list(cluster_temp$time),
colnames(temperature_temp) <- c("time","temperature")

by = list(cluster_temp$time),
colnames(humidity_temp) <- c("time","humidity")
by = list(cluster_temp$time),
colnames(pressure_temp) <- c("time","pressure")
ts_list[[i]]<-assign (paste0("cluster_",geo_units[i]),
data.frame(join_all(list(P1_temp, P2_temp, temperature_temp, humidity_temp, pressure_temp),
rm(list=paste0("cluster_",geo_units[i]), cluster_temp, P1_temp, P2_temp, temperature_temp, humidity_temp, pressure_temp)


By now we created our time series by 113 clusters and we are ready to get their descriptive statistics.


VI. Predictive modeling

Our next step is to insert the data set for official stations for 2017 & 2018 only, regarding the fact that unofficial data is including that time frame.
There are 5 official stations-4 of them have observations for the whole period, one is new for 2018.
We merge the 2 sets – for 2017 & 2018 and fix the class of the variables using lubridate library where needed – for DatetimeBegin,DatetimeEnd,Validity, Verification features.

The period for which we have data from the official staitons is from 2017-11-28 13:00:00 UTC to 2018-09-14 21:00:00 UTC which means that if we want to use this dataset’s variables as predictors, we will  have to restrain ourselves in the period November 2017 – August 2018.
Then, we add a metadata for official stations so now we know the exact locations.

As our next step, we create time series for every official station by hourly basis. First, we inspect Nadezhda station. Regarding the plot before and after interpolation, there are some missing values in January. We interpolate missing data using linear interpolation.


# introducing complete time series
nadezhda<-nadezhda %>%
complete (time = seq(min(nadezhda$time), max(nadezhda$time), by="hour"))
# Check for missing values (and visualise it)
sapply(nadezhda, function(x) sum( # 258
plot(nadezhda$time, nadezhda$P1_official, type="l") # there is some missing data in January
# Interpolation of missing values
if (!require(imputeTS)) {
nadezhda$P1_official<-na.interpolation(nadezhda$P1_official, option = "linear")

We do that for every station. Here are our results:

  • Mladost station

There is missing data observed in February March and September and it is interpolated by linear interpolation.

  • Pavlovo station

There are significant gaps in January, February and almost throughout the whole September that need to be fixed using kinear interpolation.

  • Druzhba station

Here, some values are missing in January, February, July and September.

  • Hipodruma station

We address the issue for missing values in January and September.

Next, we include meteorological data for 2017 and 2018 only. Here is a summary for the observations:

By now, we create time series by day. Next, we fix some classes for year, month and day and check for NAs\106\.
Observations in meteorological data set start on 2017-01-01 and end on 2018-09-17, so we have data for the whole period.
We address the issue that it is per day, not hour, and we are going to deal with it in our next steps.
Good thing here is that there is no NAs, so there is no interpolation needed.

Our next step is to plot official stations next to our geo units:

Here’s how the 113 clusters look like on the map:

It can be noted that the official stations are not evenly distributed on the territory of Sofia and therefore can not be used for clusters center.

Next, we perform exploratory analysis on the statistical characteristics of the response variable for each geo-unit. Here are the results:

stat_test <- function(df){
p <- ncol(df)
df_multi <- data.frame(var=names(df),
box.pvalue=sapply(df, function(v) Box.test(ts(v),lag=20,type="Ljung-Box")$p.value),
adf.pvalue=sapply(df, function(v) adf.test(ts(v),alternative = "stationary")$p.value),
kpss.pvalue=sapply(df, function(v) kpss.test(ts(v))$p.value)

df_multi$box <- df_multi$box.pvalue < 0.05
df_multi$adf <- df_multi$adf.pvalue < 0.05

df_multi$kpss <- df_multi$kpss.pvalue > 0.05
row.names(df_multi) <- c()

Now we are going to delete all geo units, for which we have more than 20% missing values. In this way we built our Ns treshold treatment.​ After that we end up with 75 clusters, 38 were removed, since the P1 variable has more than 20% missing values.​

Next, we use linear method for interpolation of missing observations.​ We check the correlation between response variable and the rest with correlation_threshold<-0.75​. There are 3 clusters that have no correlation of the response variable and the rest of the variables above the correlation_threshold and we remove them to have the final cluster list of 72 clusters. They have coverage across the city.

Here is our correlation plot:

For our predictive model we are going to devide our set into training and test. We will test our model with the values from the last day in the dataset, namely 14th September 2018:
#just for verification

dim(train_list[[1]]) #[1] 10422 3

dim(tets_list[[1]]) #[1] 24 3

Next, we create ARIMA model for each one of our 72 clusters.


So, these are the results for first cluster:

Let`s build a predictive model for the chance of exceedance of 50 µg/m3 average for the day adjusted PM10 concentration at every geo-unit​. In the field below is the comparisson summary of our clusters.

Job well done: actal air pollution for 13th September 2018 is equal to 8.848 compared to our data model evaluation equal to 9.232.


Find more about the project and the code HERE.

Share this

14 thoughts on “Monthly Challenge – Sofia Air – Solution – Kiwi Team

  1. 0

    Please upload your code here in the article – we have great capabilities on this platform to import and render jupyter notebook, or at least to snipet code in quotes

  2. 0

    Thank you for sharing with us your solution. The maps are really helpful and they give us a clear vision of the situation. The leaflet library was a smart decision. 🙂 – Team Yagoda

  3. 0

    Fantastic work ! I got a lot of insight about the data which I never had before !

    I was thinking of fixing the error values in temperature and all by taking the forward and previous value of the error value and taking those two’s mean instead of the whole column. For example at 1pm the temperature is 17 and 2pm is 53 and 3pm is 19, so I’ll set a threshold of max temperature which will catch 53 as an anomaly and then then replace 53 by 18 which is the average of 17 and 19. Also the mentioned operation will be done after grouping the geohashes.

    It is just a suggestion and I am just a beginner so this suggestion may not be helpful :p

    Good luck !

  4. 0

    This is great work, in my opinion. The code is very clean and functional, along with the analysis you did beforehand and in the process, it helped understand the task better. And as everyone else said, the final result is just beautiful and very informative. Great work!

  5. 0

    excellent write and great progress from last week. I enjoyed your method of clustering the geohashes into a regions to filter down the Geohashes that are within Sofia boundaries.

    I liked how you visualized the data in 3D to bring in the perspective of elevation.
    All of your temperature, pressure and humidity boundaries seemed like good trade-offs. Given the problem definition scope being within Sofia, is there a reason to keep the data points that are higher in elevations relative to points inside the city boundaries? Given Sofia is at 581m (or somewhere there) should data with elevations about 1000 meters be included at all?

    Overall, fantastic approach. I really liked how your team is pulling this together in a well though you reasoned way. Good luck for next week.

  6. 0

    Great progress so far! As I am not familiar with R, I cannot say anything about the code. The visualizations look great though 🙂
    However, I have some questions:
    How did you come up with the limits for the temp, pressure and humidity? What about the errors in the P10 measurements?
    What is the point of aggregating the data by geo unit? Why don’t we model the data at each sensor location? What do we gain from the aggregation?

  7. 0

    Hello Kiwi, congratulations for this relevant work so far! Particularly appreciate the data viz and explanations surrounding your code. From my understanding , you removed 2017 stations not present in 2018? This is indeed interesting to avoid dropping most of 2018 in the opposite was done (removing 2018 stations not present in 2017); still hesitating what’s best on this point. Not sure to understand what your localizeErrors function does, and how to interpret subsequent figures. Looking forward to your updates for week 3! Best

Leave a Reply