## Business Understanding¶

The amount of particulate matter (PM) in the air is an indicator of air quality which is of interest to ecologists, the European commission, different environmental agencies, and most of all, to the citizens, who are affected on many levels by the extensive amounts of particulate matter and the bad air quality. The question has gathered head in Sofia, as the city has fallen into the category of “dangerous places to breathe.” A study from 2013 finds that Bulgaria’s air is the dirtiest in Europe (*The New York Times, 2013*). And to top it all off, very few people care about environmental issues, which aggravates the problem even further.

The purpose of this study is to find a robust method for prediction of the amount of PM in the next 24 hours in Sofia, based on the citizen science air quality measurements.

## Data Understanding¶

The data from citizens are supplied by AirBG.info. They consists of two datasets – `data_bg_2017.csv`

and `data_bg_2018.csv`

and they are not officially submitted to the European Comission. The measurements are carried out by volunteer citizens with their own equipment which could lead to imperfect measurements, instrumental biases and so on. We were not able to find metadata description with explanation for every columns i.e. what it measures and what unit is being used for every column.

The official data are collected by the European Environmental Agency It is a subject to EU guidelines on air quality monitoring. That is why the data are mostly free of instrumental biases and less error-prone, thus very reliable. It consists of 28 datasets and metadata.

There are also meteorological and topography measurements, which are standardized and reliable.

The data exploration process showed that there is quite a difference between the way the data are provided in the official and the citizen datasets. The official ones include useful information about the units used but lack additional information like temperature, humidity and pressure which are present in the citizens’. During the data preparation process we’d need to unify the data.

# Preliminary Analysis (Week 1)¶

At this stage of the process, we need to get familiar with the data, identify missing observations, fix the inconsistencies, if we find any, and prepare our data for further modelling in R.

### Importing the datasets¶

As a first step, we load the libraries in R, which we are going to use later in our analysis. These are `tidyverse`

, `geohash`

, `geoR`

, `ggmap`

, `DT`

, `knitr`

, `rgdal`

, `lubridate`

, and `maps`

. Then, we import the first three datasets: the two datasets on citizen science air quality measurements from 2017 and 2018, and the topography data for Sofia. We are going to concentrate on them during the first stage of the project, and we are going to utilize the unused available information later on.

```
install.packages("rgeos",repos = "https://ftp.uni-sofia.bg/CRAN/")
```

```
library(tidyverse)
library(geohash)
library(geoR)
library(ggmap)
library(DT)
library(knitr)
library(rgdal)
library(lubridate)
library(maps)
library(corrplot)
library(fBasics)
library(geosphere)
library(rgeos)
library(sp)
```

```
#Import citizen data for 2017
cit17 = read.csv("G:\\Misc\\documents\\stopansko\\masters\\1_semester\\Monthly Challenge\\Air Tube\\data_bg_2017.csv", na.strings=c("",".","NA"," "), stringsAsFactors = FALSE)
cit18 = read.csv("G:\\Misc\\documents\\stopansko\\masters\\1_semester\\Monthly Challenge\\Air Tube\\data_bg_2018.csv", na.strings=c("",".","NA"," "), stringsAsFactors = FALSE)
topo = read.csv("G:\\Misc\\documents\\stopansko\\masters\\1_semester\\Monthly Challenge\\TOPO-DATA\\sofia_topo.csv", na.strings=c("",".","NA"," "), stringsAsFactors = FALSE)
colnames(topo)[1:2] = c("lat", "long")
```

### Inspect the data structure¶

After we make sure that there is nothing alarming in the way the program has interpreted our variables, and string variables are stored as character, and numeric variables – as numeric, we change the class of the time variable to ‘POSIXct’ and ‘POSIXt’, so that we can work easily with it.

```
VarClass17 = data.frame(names(cit17))
VarClass17[,2] = rapply(cit17, class)
VarClass18 = data.frame(names(cit18))
VarClass18[,2] = rapply(cit18, class)
VarClassTopo = data.frame(names(topo))
VarClassTopo[,2] = rapply(topo, class)
cit17$time = ymd_hms(cit17$time) #Changes the class to POSIXct and POSIXt
cit18$time = ymd_hms(cit18$time)
```

### Identify unique stations and merge the two datasets on citizen science air quality measurements¶

After we find out that there are observations on 383 unique stations in 2017, and on 1254 in 2018, we bind the rows of the two datasets, as we leave the stations, which were observed only in 2017 out. It makes little sense to make predictions for stations with no observations from 2018. We do not have any information on whether these stations still exist; their observations do not capture important recent changes in the climate and the air quality, and that is why, we choose to concentrate our analysis on stations, which were observed in 2018.

```
length(unique(cit17[,2])) #383 unique stations in 2017
length(unique(cit18[,2])) #1254 unique geoh in 2018
diff = setdiff(cit17$geohash, cit18$geohash) #finds geohash from '17 not in '18
length(unique(diff)) #11 values
cit17 = cit17[!cit17$geohash %in% diff,] #removes all obs from cit17 w/ geohash in diff
rm(diff)
cit_merged = bind_rows(cit17, cit18) #merge the two datasets together
rm(cit17, cit18)
```

### Summarize availability of merged data by stations¶

We produce a summary table of our observations, grouped by geohash. It shows the earliest and the latest observations, the number of observations, as well as the number of days between the first and the last observation. We then add this information to our new dataset and identify the locations, where fewer than 100 observations were made, or their observation window is shorter than 7 days. These locations’ value of information is not high enough, and therefore, we drop them.

```
sum(is.na(cit_merged$geohash)) #4 missings geohash
cit_merged = cit_merged[!is.na(cit_merged$geohash),] #remove missings
Summary_cit_merged = cit_merged %>% group_by(geohash) %>% summarise(n(),min(time), max(time), max(time)-min(time))
colnames(Summary_cit_merged) = c("geohash", "Freq","tmin", "tmax","days" )
cit_merged = merge(cit_merged, Summary_cit_merged, by = 'geohash')
rm(Summary_cit_merged)
cit_merged$tmin = NULL
cit_merged$tmax = NULL
cit_merged = cit_merged[(cit_merged$days > 7 & cit_merged$Freq >100),] #we delete all obs. w/ fewer than 7 day long obs. window
cit_merged = cit_merged[order(cit_merged$Freq, cit_merged$time, decreasing = TRUE),]
```

### Summarize on the map¶

Now, we want to see where our stations are located in Bulgaria, as we need the observations to be only in Sofia.

```
cit_merged = transform(cit_merged, latlng = gh_decode(cit_merged$geohash))
colnames(cit_merged)[10:13] = c("lat", "long", "lat_err", "lon_err")
Bulgaria = map_data(map = "world",region="Bulgaria")
BulgariaMap = ggplot() + geom_polygon(data = Bulgaria, aes(x=long, y = lat, group = group)) + coord_fixed(1.3)
#get unique stations' coordinates
StationsCoords = data.frame(unique(cit_merged$geohash), stringsAsFactors = FALSE)
colnames(StationsCoords) = "geohash"
StationsCoords = transform(StationsCoords, latlng = gh_decode(StationsCoords$geohash))
colnames(StationsCoords)[2:5] = c("lat", "long", "lat_err", "lng_err")
StationsMapBG = BulgariaMap +
geom_point(data = StationsCoords, aes(x = long, y = lat), color = "white", size = 2) +
geom_point(data = StationsCoords, aes(x = long, y = lat), color = "yellow", size = 1)
SofiaMapBG = StationsMapBG +
geom_point(data = topo, aes(x = long, y = lat), color = "blue", size = 2) +
geom_point(data = topo, aes(x = long, y = lat), color = "red", size = 1)
SofiaMap =ggplot(data = topo) +
geom_point(aes(x = long, y = lat), color = "black", size = 6) +
geom_point(aes(x = long, y = lat), color = "green", size = 4) +
coord_fixed(1.3)
```

```
plot(StationsMapBG)
```

```
plot(SofiaMap)
```

It is visible that the dots form an area with an almost perfect square shape, which means that if we use the maximum and minimum longitude and latitude of these dots, we are going to have a good idea which the stations located in Sofia are. We create two different datasets; one contains the observations from the stations which are located exactly in the area, isolated by the above map, and the other has a margin, which allows for some stations, which are located a little further, to enter the dataset:

```
SofiaStationsCoords = StationsCoords[((StationsCoords$lat+StationsCoords$lat_err) > min(topo$lat) &
(StationsCoords$lat-StationsCoords$lat_err) < max(topo$lat) &
(StationsCoords$long+StationsCoords$lng_err) > min(topo$long) &
(StationsCoords$long-StationsCoords$lng_err) < max(topo$long)), ]
SofiaStationsMap = SofiaMap +
geom_point(data = SofiaStationsCoords, aes(x = long, y = lat), color = "#666666", size = 5.5) +
geom_point(data = SofiaStationsCoords, aes(x = long, y = lat), color = "#FFFF66", size = 3.5)
citSofia = cit_merged[cit_merged$geohash %in% SofiaStationsCoords$geohash,]
```

```
plot(SofiaStationsMap)
```

```
#Creating a dataset with the observations within some margin around Sofia
SofiaStationsCoordsMargin = StationsCoords[((StationsCoords$lat+StationsCoords$lat_err) > (min(topo$lat)-0.025) &
(StationsCoords$lat-StationsCoords$lat_err) < (max(topo$lat)+0.025) &
(StationsCoords$long+StationsCoords$lng_err) > (min(topo$long)-0.025) &
(StationsCoords$long-StationsCoords$lng_err) < (max(topo$long)+0.025)), ]
SofiaStationsMapMargin = SofiaMap +
geom_point(data = SofiaStationsCoordsMargin, aes(x = long, y = lat), color = "#666666", size = 5.5) +
geom_point(data = SofiaStationsCoordsMargin, aes(x = long, y = lat), color = "#FFFF66", size = 3.5)
citSofiaMargin = cit_merged[cit_merged$geohash %in% SofiaStationsCoordsMargin$geohash,]
corr = cor(citSofia[,c(3:7)])
corrplot(corr,method = "square", type = "upper", addCoef.col = "black")
```

# Week 2¶

### Find the closest point from the topography dataset for Sofia to the stations¶

Since the topography dataset contains information on the elevation of the 196 points in Sofia, and we believe elevation might turn out to be a key factor when predicting the particulate matter in the air, we find each station’s closest point and assume that the station’s elevation would not differ too much from the point’s one. As we are doing this, we make clusters of stations which share the same closest point, and we use these clusters in order to proceed with our analysis.

```
sp.SofiaStationsCoords = SofiaStationsCoords
coordinates(sp.SofiaStationsCoords) = ~long+lat
for(i in 1:nrow(topo)){
topo$pnt[i]=i
}
sp.topo = topo
coordinates(sp.topo) = ~long+lat
Distance = gDistance(spgeom1 = sp.SofiaStationsCoords, spgeom2 = sp.topo, byid=T)
minDistance = apply(Distance, 2, function(x) order(x, decreasing=F)[1] )
SofiaStationsCoordsElev = cbind(SofiaStationsCoords, topo[minDistance,], apply(Distance,2, function(x) sort(x, decreasing = F)[1] ))
colnames(SofiaStationsCoordsElev) = c(colnames(SofiaStationsCoords), 'n_lat', 'n_long', 'Elev', 'pnt', 'distance')
SofiaStationsCoordsElev$lat_err = NULL
SofiaStationsCoordsElev$lng_err = NULL
colnames(topo)[1:2] = c('n_lat', 'n_long')
SofiaStationsCoordsElev$distance = NULL
citSofia_elev = merge(citSofia, SofiaStationsCoordsElev, by=c("geohash","lat","long"))
citSofia_elev$lat_err = NULL
citSofia_elev$lon_err = NULL
citSofia_elev = citSofia_elev[,c(1, 15, 14, 2:13)]
citSofia_elev = citSofia_elev[order(citSofia_elev$n_lat, citSofia_elev$n_long, citSofia_elev$geohash),]
colnames(citSofia_elev)[2] = 'Clust.ID'
clusters = SofiaMap +
geom_point(data = SofiaStationsCoordsElev, aes(x = long, y = lat), color = "white", size = 5.5) +
geom_point(data = SofiaStationsCoordsElev, aes(x = long, y = lat), color = SofiaStationsCoordsElev$pnt, size = 3.5)
rm(cit_merged, citSofia, Distance, SofiaStationsCoords, sp.SofiaStationsCoords, clusters, sp.topo, topo, i, minDistance)
```

```
plot(SofiaMap +
geom_point(data = SofiaStationsCoordsElev, aes(x = long, y = lat), color = "white", size = 5.5) +
geom_point(data = SofiaStationsCoordsElev, aes(x = long, y = lat), color = SofiaStationsCoordsElev$pnt, size = 3.5))
```

### Summarize the information by clusters¶

Checking out whether we have any odd values, or even whole clusters with odd observations is very important at such an early stage. That is why, we make a summary table which shows key statistics for the key variables, such as min, max, first quartile, third quartile, etc. From the table, it is visible that we face problem of outliers in almost every band, be it the values of the minimum or maximum temperature, pressure, or humidity. In the next step, we are going to take care of those outliers.

```
Summary_clust = citSofia_elev %>% group_by(Clust.ID) %>% summarise(obs = n(), Elev = mean(Elev), stations = length(unique(geohash)),
min.p1 = min(P1), first.quart.p1 = quantile(P1,.25), third.quart.p1 = quantile(P1,.75), max.p1 = max(P1),
min.p2 = min(P2),first.quart.p2 = quantile(P2,.25), third.quart.p2 = quantile(P2,.75), max.p2 = max(P2),
first.obs = min(time), last.obs = max(time), days = max(time)-min(time),
min.temp = min(temperature),first.quart.temp = quantile(temperature,.25), third.quart.temp = quantile(temperature,.75), max.temp = max(temperature),
min.hum = min(humidity),first.quart.hum = quantile(humidity,.25), third.quart.hum = quantile(humidity,.75), max.hum = max(humidity),
min.press = min(pressure),first.quart.press = quantile(pressure,.25), third.quart.press = quantile(pressure,.75), max.press = max(pressure))
```

```
Summary_clust[1:10,]
```

### Detect and treat outliers¶

In order to detect the outliers of our most important variable - PM10, we are going to use the dataset containing the observations of the official stations and we are going to use them as a benchmark. The meteorology dataset is going to serve as a benchmark for the other variables of interest. We create a vector, which will store the values, indicating the boundaries of our variables: P1, temperature, humidity, and pressure. We localize the outliers in the citizen science air quality measurements, using the readily available formula of boxplot.stats, and then check whether some of the detected outliers are in fact within the boundaries, set by the official meteorology dataset, and we keep them. For the pressure, we allow for more values to enter the final dataset, as the default options of the boxplot.stats are very restricting, and almost all of the values are detected as outliers. Then, we set the outliers in the dataset equal to NA, and we are going to interpolate their values further in the analysis, based on the observations with meaningful values. We decide that it would be best to remove the stations which produce consistently wrong measurements of P10, as their values are unreliable.

```
#Import official European data
setwd("G:\\Misc\\documents\\stopansko\\masters\\1_semester\\Monthly Challenge\\EEA Data")
eu = list.files(path="G:\\Misc\\documents\\stopansko\\masters\\1_semester\\Monthly Challenge\\EEA Data", pattern = "*.csv")
eudatasets = lapply(eu, read.csv, na.string = c("", "NA"," ","-999"),
stringsAsFactors = FALSE, fileEncoding="UTF-16LE")
for (i in 1:length(eu)){
eu[i]=gsub("BG_5_","st", eu[i])
eu[i]=gsub("_timeseries.csv","",eu[i])
names(eudatasets)[i]=eu[i]
}
rm(eu,i)
#Import meteorology data
meteo = read.csv("G:\\Misc\\documents\\stopansko\\masters\\1_semester\\Monthly Challenge\\METEO-data\\lbsf_20120101-20180917_IP.csv", na.strings=c("",".","NA"," ",-9999), stringsAsFactors = FALSE)
#Drop all obs. where year < 2017 because our citizen observations start in 2017
meteo = meteo[meteo$year>=2017,]
#Create a vector, which is going to serve as a "rules vecotor"
rules = rep(NA, 8)
names(rules)=c("P1min", "P1max", "Tmin", "Tmax", "Hmin", "Hmax", "Pmin","Pmax")
rules["P1min"] = min(sapply(eudatasets, function(x,y) min(y[,x]), x = "Concentration"))
rules["P1max"] = max(sapply(eudatasets, function(x,y) max(y[,x]), x = "Concentration"))
rules["Tmin"] = min(meteo$TASMIN)
rules["Tmax"] = max(meteo$TASMAX)
rules["Hmin"] = min(meteo$RHMIN)
rules["Hmax"] = max(meteo$RHMAX)
rules["Pmin"] = min(meteo$PSLMIN)
rules["Pmax"] = max(meteo$PSLMAX)
rm(meteo)
#All values outside 1.5*IQR are treated as outliers
outlier_temp_dd = data.frame(boxplot.stats(citSofia_elev$temperature)$out)
colnames(outlier_temp_dd)="value"
outlier_press_dd = data.frame(boxplot.stats(citSofia_elev$pressure, coef = 2.75)$out)
colnames(outlier_press_dd)="value"
outlier_hum_dd = data.frame(boxplot.stats(citSofia_elev$humidity)$out)
colnames(outlier_hum_dd) = "value"
#Subset only unique outliers to save time in next steps
outlier_temp_dd = subset(outlier_temp_dd, !duplicated(outlier_temp_dd$value))
outlier_press_dd = subset(outlier_press_dd,!duplicated(outlier_press_dd$value))
outlier_hum_dd = subset(outlier_hum_dd, !duplicated(outlier_hum_dd))
#Check whether some of these detected outliers are in fact within the boundaries, determined by the "rules vector"
outlier_temp_dd$check = NA
for(i in 1:nrow(outlier_temp_dd)){
if(outlier_temp_dd$value[i] > rules["Tmin"] &
outlier_temp_dd$value[i] < rules["Tmax"]){
outlier_temp_dd$check[i] = TRUE
}
else outlier_temp_dd$check[i] = FALSE
}
sum(outlier_temp_dd$check) #5 - 5 outliers that are not true outliers: KEEP THEM
outlier_temp_dd=outlier_temp_dd[outlier_temp_dd$check!= TRUE,]
outlier_press_dd$check = NA
for(i in 1:nrow(outlier_press_dd)){
if(outlier_press_dd$value[i] > rules["Pmin"]*100 &
outlier_press_dd$value[i] < rules["Pmax"]*100){
outlier_press_dd$check[i] = TRUE
}
else outlier_press_dd$check[i] = FALSE
}
sum(outlier_press_dd$check) #1140 - 1140 outliers that are not true outliers: KEEP THEM
outlier_press_dd=outlier_press_dd[outlier_press_dd$check!= TRUE,]
outlier_hum_dd$check = NA
for(i in 1:nrow(outlier_hum_dd)){
if(outlier_hum_dd$value[i] > rules["Hmin"] &
outlier_hum_dd$value[i] < rules["Hmax"]){
outlier_hum_dd$check[i] = TRUE
}
else outlier_hum_dd$check[i] = FALSE
}
sum(outlier_hum_dd$check) #0 - all outliers are true outliers
outlier_P1_dd = citSofia_elev[(citSofia_elev$P1<rules["P1min"] |
citSofia_elev$P1>rules["P1max"]),]
outlier_P1_dd = data.frame(outlier_P1_dd[,7])
colnames(outlier_P1_dd) = "value"
outlier_P1_dd = subset(outlier_P1_dd, !duplicated(outlier_P1_dd$value))
#Replace all outliers in the dataset with NA
outlier_hum = as.vector(outlier_hum_dd[,1]) #<= -13
outlier_P1 = as.vector(outlier_P1_dd[,1]) #according to official measurements
outlier_press = as.vector(outlier_press_dd[,1]) # <= 89,802 $ >=105,116
outlier_temp = as.vector(outlier_temp_dd[,1]) #<= -21 & >= 46
rules["Pmin"] = 90000
rules["Pmax"] = 105116
rules["Tmin"] = -21
rules["Tmax"] = 46
rm(outlier_temp_dd, outlier_hum_dd, outlier_P1_dd, outlier_press_dd)
citSofia_clean = citSofia_elev
citSofia_clean$P1 = ifelse((citSofia_clean$P1 <rules["P1min"] | citSofia_clean$P1>rules["P1max"]), NA, citSofia_clean$P1)
sum(is.na(citSofia_clean$P1)) #70,032
citSofia_clean$temperature <- ifelse((citSofia_clean$temperature<rules["Tmin"] | citSofia_clean$temperature >rules["Tmax"]), NA, citSofia_clean$temperature)
sum(is.na(citSofia_clean$temperature)) #11,161
citSofia_clean$pressure = ifelse((citSofia_clean$pressure <= rules["Pmin"] | citSofia_clean$pressure >= rules["Pmax"]),NA,citSofia_clean$pressure)
sum(is.na(citSofia_clean$pressure)) #248,330
citSofia_clean$humidity = ifelse((citSofia_clean$humidity<rules["Hmin"]|citSofia_clean$humidity>rules["Hmax"]), NA, citSofia_clean$humidity)
sum(is.na(citSofia_clean$humidity)) #101,616
#Remove all stations, showing systematic errors w/ P1
citSofia_clean$mism = ifelse(is.na(citSofia_clean$P1), 1, 0)
Summary_clean = citSofia_clean %>%
group_by(geohash) %>%
summarise(freq=n(), mism = sum(mism), perc = mism/freq) %>%
arrange(desc(perc))
list_mism = Summary_clean$geohash[Summary_clean$perc == 1]
citSofia_clean = citSofia_clean[!citSofia_clean$geohash %in% list_mism,]
rm(citSofia_elev, i, list_mism, outlier_hum, outlier_press, outlier_P1, outlier_temp, rules, Summary_clean, Summary_clust)
```

### Aggregate data by geo-units¶

Now that our data is cleaned from outliers, we are ready to aggregate the observations by cluster and by time, as we make sure that the unit of measurement is consistent across the time series - they are observed on an hourly basis, and where NA’s are present, we are going to interpolate in the next step.

```
t = list()
gu = list()
k = unique(citSofia_clean$Clust.ID)
for(i in 1:length(k)){
gu[[i]] = citSofia_clean[citSofia_clean$Clust.ID == k[i],]
t[[i]] = as.data.frame(seq.POSIXt(from = min(gu[[i]]$time), to = max(gu[[i]]$time), by="hour"))
colnames(t[[i]])[1] = "time"
}
gua = list()
for(i in 1:length(k)){
gua[[i]] = aggregate(gu[[i]][,c("Clust.ID", "Elev", "time", "P1", "P2",
"temperature", "humidity", "pressure")],
by = list(gu[[i]]$time),FUN = mean)
}
for(i in 1:length(k)){
t[[i]] = merge(t[[i]], gua[[i]], all.x = T)
t[[i]] = t[[i]][,-2]
}
rm(gu, gua, k)
```

### Summarize the most important statistics by geo-units¶

We check some basic statistics of all variables: percent NA’s, skewness, and kurtosis. P1 is the only variable that turns out to have very fat tails with kurtosis >3, as well as a non-symmetric distribution. We are going to address those issues later on.

We are also interested in the autocorrelation in the variables (mostly P1), therefore, we run a series of autocorrelation tests:

```
#NA's, Skewness and Kurtosis
DS = list()
DS$P1 = list()
DS$temp = list()
DS$press = list()
DS$hum = list()
for(i in 1:length(t)){
DS$P1[[i]] = basicStats(t[[i]][,c("P1")])
DS$temp[[i]] = basicStats(t[[i]][,c("temperature")])
DS$press[[i]] = basicStats(t[[i]][,c("pressure")])
DS$hum[[i]] = basicStats(t[[i]][,c("humidity")])
}
```

```
par(mfrow=c(2,3))
DS$P1 = as.data.frame(DS$P1)
plot(t(DS$P1["NAs",])/t(DS$P1["nobs",]))
plot(t(DS$P1["Skewness",])) #assymetric
plot(t(DS$P1["Kurtosis",])) #fat tails
DS$temp = as.data.frame(DS$temp)
plot(t(DS$temp["NAs",])/t(DS$temp["nobs",]))
plot(t(DS$temp["Skewness",])) #rather symmetric
plot(t(DS$temp["Kurtosis",])) #<= |3| in most of the cases
DS$hum = as.data.frame(DS$hum)
plot(t(DS$hum["NAs",])/t(DS$hum["nobs",]))
plot(t(DS$hum["Skewness",])) #rather symmetric
plot(t(DS$hum["Kurtosis",])) #<= |3| in most of the cases
DS$press = as.data.frame(DS$press)
plot(t(DS$press["NAs",])/t(DS$press["nobs",])) #a lot of missings
plot(t(DS$press["Skewness",])) #rather symmetric
plot(t(DS$press["Kurtosis",])) #<= |3| in most of the cases
```

A very clear pattern, occurring every 24h is visible with all variables, but with pressure, which is intuitive, as meteorological conditions are consistent at similar times during the day.

We also check for stationarity our dependent variable, which, with two exceptions, proves to be stationary, which is beneficial for our further analysis.

### Bind the rows of official stations by station¶

As our official data is available in different datasets, depending on the year, we bind the rows, where the observations belong to the same stations, and we get five distinct datasets with the values of PM10 in Pavlovo, Druzhba, Nadezhda, Mladost, and Krasno selo, which are located on the already familiar map like that:

```
VarClassOfficTime = rep(NA,9)
for(i in 1:length(eudatasets)){
VarClassOfficTime[[i]] = class(eudatasets[[i]][,c("DatetimeEnd")])
} #character - we need it to be POSIXt
#Sample obs: "2018-01-03 03:00:00 +01:00"
for(i in 1:length(eudatasets)){
eudatasets[[i]]$DatetimeEnd = ymd_hms(eudatasets[[i]]$DatetimeEnd, tz="Europe/Athens")
}
#Sample obs: "2018-01-03 04:00:00 EET" - in Eastern-European Time
rm(VarClassOfficTime, i)
#In the EU dataset, leave only the columns we need, i.e. Concentration and DatetimeEnd
for(i in 1:length(eudatasets)){
eudatasets[[i]] = eudatasets[[i]][,c("Concentration","DatetimeEnd")]
}
st9421 = bind_rows(eudatasets["st9421_2017"], eudatasets["st9421_2018"])
colnames(st9421)=c("P1_st9421","time")
st9572 = bind_rows(eudatasets["st9572_2017"], eudatasets["st9572_2018"])
colnames(st9572)=c("P1_st9572","time")
st9616 = bind_rows(eudatasets["st9616_2017"], eudatasets["st9616_2018"])
colnames(st9616)=c("P1_st9616","time")
st9642 = bind_rows(eudatasets["st9642_2017"], eudatasets["st9642_2018"])
colnames(st9642)=c("P1_st9642","time")
st60881 = as.data.frame(eudatasets["st60881_2018"])
for(i in 1:ncol(st60881)){
colnames(st60881)[i] = gsub("st60881_2018.","", colnames(st60881)[i])
}
colnames(st60881)=c("P1_st60881","time")
rm(eudatasets)
t_st60881 = list()
t_st60881=as.data.frame(seq.POSIXt(from=min(st60881$time),to=max(st60881$time), by="hour"))
colnames(t_st60881) = "time"
t_st60881 = left_join(t_st60881, st60881, by="time")
sum(duplicated(t_st60881$time)) #7
t_st60881 = t_st60881[!(duplicated(t_st60881$time)==T),]
sum(is.na(t_st60881$P1_st60881)) / dim(t_st60881)[1] # 2.68% missings
t_st9421 = list()
t_st9421=as.data.frame(seq.POSIXt(from=min(st9421$time),to=max(st9421$time), by="hour"))
colnames(t_st9421) = "time"
t_st9421 = left_join(t_st9421, st9421, by="time")
sum(duplicated(t_st9421$time)) #92
t_st9421 = t_st9421[!(duplicated(t_st9421$time)==T),]
sum(is.na(t_st9421$P1_st9421)) / dim(t_st9421)[1] # 4.59% missings
t_st9572 = list()
t_st9572=as.data.frame(seq.POSIXt(from=min(st9572$time),to=max(st9572$time), by="hour"))
colnames(t_st9572) = "time"
t_st9572 = left_join(t_st9572, st9572, by="time")
sum(duplicated(t_st9572$time)) #92
t_st9572 = t_st9572[!(duplicated(t_st9572$time)==T),]
sum(is.na(t_st9572$P1_st9572)) / dim(t_st9572)[1] # 2.71% missings
t_st9616 = list()
t_st9616=as.data.frame(seq.POSIXt(from=min(st9616$time),to=max(st9616$time), by="hour"))
colnames(t_st9616) = "time"
t_st9616 = left_join(t_st9616, st9616, by="time")
sum(duplicated(t_st9616$time)) #92
t_st9616 = t_st9616[!(duplicated(t_st9616$time)==T),]
sum(is.na(t_st9616$P1_st9616)) / dim(t_st9616)[1] # 12.02% missings
plot(t_st9616$time, t_st9616$P1_st9616, col = "#333399") #looks linear
t_st9642 = list()
t_st9642=as.data.frame(seq.POSIXt(from=min(st9642$time),to=max(st9642$time), by="hour"))
colnames(t_st9642) = "time"
t_st9642 = left_join(t_st9642, st9642, by="time")
sum(duplicated(t_st9642$time)) #92
t_st9642 = t_st9642[!(duplicated(t_st9642$time)==T),]
sum(is.na(t_st9642$P1_st9642)) / dim(t_st9642)[1] # 3.7% missings
#Interpolate missings
t_st60881$P1_st60881int = na.interpolation(t_st60881$P1_st60881, option = "linear")
t_st9421$P1_st9421int = na.interpolation(t_st9421$P1_st9421, option = "linear")
t_st9572$P1_st9572int = na.interpolation(t_st9572$P1_st9572, option = "linear")
t_st9616$P1_st9616int = na.interpolation(t_st9616$P1_st9616, option = "linear")
t_st9642$P1_st9642int = na.interpolation(t_st9642$P1_st9642, option = "linear")
```

In addition, we remove duplicate values without further investigation, as they are only a small part of the merged datasets. We interpolate the missing values, after we create a time vector with values for every hour beginning from the first observation and ending with the last observation of each of the five datasets. We use linear interpolation, as we do not want to overfit and get nonsense values.

### Add this information to our citizen measurements and interpolate missing values¶

After we make sure that the time variable in the citizenship measurements datasets is also in EET and we are going to merge by the same time in the same timezone, we add the hourly observations of the official stations to our citizenship observations and we check the amount of missings of each of the variables of interest.

```
#Since we're going to merge by "time", we make sure t's elements' time is also EET
for(i in 1:length(t)){
t[[i]]$time = ymd_hms(t[[i]]$time, tz = "Europe/Athens")
t[[i]] = t[[i]][!is.na(t[[i]]$time)==T,] #Suumer/ Winter time missing
}
#Sample: "2018-03-09 01:00:00 EET"
#Make a left join, missings from official datasets to appear as NAs in the new dataset
for(i in 1:length(t)){
t[[i]] = merge(t[[i]],t_st9421, by="time", all.x=T)
t[[i]] = merge(t[[i]],t_st60881, by = "time", all.x=T)
t[[i]] = merge(t[[i]],t_st9572, by="time", all.x=T)
t[[i]] = merge(t[[i]],t_st9616, by="time", all.x=T)
t[[i]] = merge(t[[i]],t_st9642, by="time",all.x=T)
}
tep 3.2.2: Interpolate missing values---------------------------------------
#Make sure we have Cluster ID everywhere:
for(i in 1:length(t)){
for(j in 1:dim(t[[i]])[1]){
if(is.na(t[[i]]$Clust.ID[j]) == T){
t[[i]]$Clust.ID[j] = t[[i]]$Clust.ID[j-1]
}
}
}
rm(i,j)
#Make a table which specifies the % missings in each column for each cluster
Missings = as.data.frame(matrix(nrow = length(t), ncol = 11))
colnames(Missings) = c("Clust.ID", "nobs", "Missing P1", "Missing temp", "Missing hum", "Missing press", "Missing P1_st9421", "Missing P1_st60881", "Missing P1_st9572", "Missing P1_st9616", "Missing P1_st9642")
for(i in 1:dim(Missings)[1]){
Missings[i,1] = t[[i]]$Clust.ID[1]
Missings[i,2] = dim(t[[i]])[1]
Missings[i,3] = round((sum(is.na(t[[i]]$P1)) / Missings[i,2]),2)
Missings[i,4] = round((sum(is.na(t[[i]]$temperature)) / Missings[i,2]),2)
Missings[i,5] = round((sum(is.na(t[[i]]$humidity)) / Missings[i,2]),2)
Missings[i,6] = round((sum(is.na(t[[i]]$pressure)) / Missings[i,2]),2)
Missings[i,7] = round((sum(is.na(t[[i]]$P1_st9421int)) / Missings[i,2]),2)
Missings[i,8] = round((sum(is.na(t[[i]]$P1_st60881int)) / Missings[i,2]),2)
Missings[i,9] = round((sum(is.na(t[[i]]$P1_st9572int)) / Missings[i,2]),2)
Missings[i,10] = round((sum(is.na(t[[i]]$P1_st9616int)) / Missings[i,2]),2)
Missings[i,11] = round((sum(is.na(t[[i]]$P1_st9642int)) / Missings[i,2]),2)
}
rm(i)
#Plot the missings in P1 and decide which ones to interpolate
plot(Missings$Clust.ID, Missings$`Missing P1`, col="#333399") #some clusters - a lot of missings
#We don't wish to interpolate a lot of missings, so we check which ones are worrying
sum(with(Missings,`Missing P1` > 0.15 )) #Only 16 clusters with missings more than 15% --> drop them
IDs_to_drop = list()
IDs_to_drop = which(Missings$`Missing P1`>0.15)
for(i in 1:length(IDs_to_drop)){
IDs_to_drop[[i]] = Missings$Clust.ID[(IDs_to_drop[[i]])]
}
for(i in 1:length(test)){
names(t)[i] = test[[i]]$Clust.ID[1]
}
t = t[-(which(names(t) %in% IDs_to_drop))]
rm(i, IDs_to_drop)
#We are going to use linear interpolation to interpolate missing values in the dependent variable
for(i in 1:length(t)){
t[[i]]$P1int = na.interpolation(t[[i]]$P1, option = "linear")
}
rm(i)
```

Since we want to be sure that our interpolation will not change our results drastically, we drop the stations with more than 15% missings, which turn out to be only 16. We use a linear function to interpolate the missing values of our dependent variable: P1.

### Accounting for the instrumental bias¶

Since we know that the citizenship measurements might be suffering from an instrumental bias, we need to adjust for that. First of all, we add the distance from the citizenship station to each official station to our dataset, as we think that the closer an official station is, the closer the measurements of the citizen station would be. We make a correlation matrix between the citizenship measurements and the official ones, and we continue our analysis on the ones, which show correlation with at least one official station more than 75%. We also check whether the closest station is also the one which shows highest correlation and we run linear regressions with both the closest and the most highly correlated staton and we average the adjusted P1, where they are different. We have made an adjustment for the instrumental bias.

```
#Figure out the coordinates of the official stations
coords=rep(NA,5)
coords = as.data.frame(coords)
coords$coords = NULL
coords$stations = c("st60881", "st9421","st9572","st9616","st9642")
coords$name = c("Mladost", "Druzhba", "Krasno selo", "Pavlovo", "Nadezhda")
coords$lat = c(42.655488, 42.6665199,42.6823739, 42.6697911, 42.7323862 )
coords$long = c(23.383271, 23.3979793, 23.2940636 ,23.2677177, 23.3106282 )
OfficStat = SofiaMap +
geom_point(data = coords, aes(x = long, y = lat), color = "white", size = 9) +
geom_point(data = coords, aes(x = long, y = lat), color = "red", size = 8)
#We need the topography data again to determine the relative distance of each cluster to the official stations
topo = read.csv("D:\\SAVE DELL\\D\\SU\\Monthly Challenge\\TOPO-DATA\\sofia_topo.csv", na.strings=c("",".","NA"," "), stringsAsFactors = FALSE)
colnames(topo)[1:2] = c("lat", "long")
for(i in 1:nrow(topo)){
topo$Clust.ID[i]=i
}
sp.topo = topo
coordinates(sp.topo) = ~long+lat
sp.coords = coords
coordinates(sp.coords)= ~long+lat
Distance = gDistance(spgeom1 = sp.coords, spgeom2 = sp.topo, byid=T)
Distance = as.data.frame(Distance)
for(i in 1:nrow(Distance)){
Distance$Clust.ID[i]=i
}
colnames(Distance)[1:5] = c("d_st60881", "d_st9421", "d_st9572", "d_st9616","d_st9642")
for(i in 1:length(t)){
t[[i]]=merge(t[[i]], Distance, by ="Clust.ID", all.x = T)
}
rm(i,Distance, sp.coords, sp.topo, topo)
#Check the correlation between the P1 cit measurements and the official ones
CorrMatrix = data.frame(matrix(NA, length(t), 7))
colnames(CorrMatrix) = c("Clust.ID", "Corr st9421", "Corr st60881", "Corr st9572", "Corr st9616", "Corr st9642", "Max Corr")
for(i in 1:dim(CorrMatrix)[1]){
CorrMatrix[i,1] = names(t[i])
CorrMatrix[i,2:6] = cor(na.omit(t[[i]][,c(9,11,13,15,17,18)]))[1:5, 6]
}
for(i in 1:dim(CorrMatrix)[1]){
CorrMatrix$`Max Corr`[i] = which.max(CorrMatrix[i,2:6])
CorrMatrix$Max[i] = max(CorrMatrix[i,2:6])
}
sum(CorrMatrix$Max>0.75)
CorrMatrixSub = CorrMatrix[CorrMatrix$Max > 0.75,]
rm(t_st60881, t_st9421, t_st9572, t_st9616, t_st9642)
#Check for stationarity of the dependent variable
DFT=rep(NA,dim(CorrMatrixSub)[1])
for(i in 1:dim(CorrMatrixSub)[1]){
a=adf.test(na.omit(t[[CorrMatrixSub$Clust.ID[i]]][,"P1int"]))
DFT[i] = a$p.value
rm(a)
}
plot(DFT) #all stationary at 10%, and all but one stationary at 1% => we can assume stationarity
rm(DFT,i)
closest_offic = as.data.frame(matrix(nrow = dim(CorrMatrixSub)[1]))
colnames(closest_offic)="Clust.ID"
closest_offic$Clust.ID = CorrMatrixSub$Clust.ID
closest_offic$max_corr = CorrMatrixSub$`Max Corr`
for(i in 1:dim(closest_offic)[1]){
for(j in 1:length(t)){
if(names(t)[j] == closest_offic$Clust.ID[i]){
closest_offic$min_dist[i] = which.min(t[[j]][1,19:23])
}
}
}
closest_offic$max_corr = ifelse(closest_offic$max_corr == 1, "st9421", closest_offic$max_corr)
closest_offic$max_corr = ifelse(closest_offic$max_corr == 2, "st60881", closest_offic$max_corr)
closest_offic$max_corr = ifelse(closest_offic$max_corr == 3, "st9572", closest_offic$max_corr)
closest_offic$max_corr = ifelse(closest_offic$max_corr == 4, "st9616", closest_offic$max_corr)
closest_offic$max_corr = ifelse(closest_offic$max_corr == 5, "st9642", closest_offic$max_corr)
closest_offic$min_dist = ifelse(closest_offic$min_dist == 1, "st60881", closest_offic$min_dist)
closest_offic$min_dist = ifelse(closest_offic$min_dist == 2, "st9421", closest_offic$min_dist)
closest_offic$min_dist = ifelse(closest_offic$min_dist == 3, "st9572", closest_offic$min_dist)
closest_offic$min_dist = ifelse(closest_offic$min_dist == 4, "st9616", closest_offic$min_dist)
closest_offic$min_dist = ifelse(closest_offic$min_dist == 5, "st9642", closest_offic$min_dist)
closest_offic$same = ifelse(closest_offic$max_corr == closest_offic$min_dist, T, F)
sum(closest_offic$same) #31 w/ same min dist and max corr
eq_max_corr=list()
eq_min_dist=list()
for(i in 1:(dim(closest_offic)[1])){
data = na.omit(t[[which(names(t)==closest_offic$Clust.ID[i])]][,c("P1int", paste("P1_",closest_offic$max_corr[i],"int",sep=""), paste("P1_",closest_offic$min_dist[i],"int",sep="") )])
colnames(data) = c("P1", "max_corr", "min_dist")
eq_max_corr[[i]] = lm(P1~max_corr, data=data)
eq_min_dist[[i]] = lm(P1~min_dist,data=data)
rm(data)
}
names(eq_max_corr) = closest_offic$Clust.ID
names(eq_min_dist)=closest_offic$Clust.ID
eqsm_max_corr=data.frame(matrix(NA,length(eq_max_corr),3))
colnames(eqsm_max_corr)=c("Estimate","t value","Pr(>|t|)")
for (i in 1:dim(eqsm_max_corr)[1]){
eqsm_max_corr[i,]=summary(eq_max_corr[[i]])$coefficients[2,c("Estimate","t value","Pr(>|t|)")]
}
eqsm_max_corr$Clust.ID = names(eq_max_corr)
windows()
par(mfrow= c(1,2))
plot(eqsm_max_corr$Clust.ID, eqsm_max_corr$Estimate)
plot(eqsm_max_corr$Clust.ID, eqsm_max_corr$`Pr(>|t|)`)
eqsm_min_dist=data.frame(matrix(NA,length(eq_min_dist),3))
colnames(eqsm_min_dist)=c("Estimate","t value","Pr(>|t|)")
for (i in 1:dim(eqsm_min_dist)[1]){
eqsm_min_dist[i,]=summary(eq_min_dist[[i]])$coefficients[2,c("Estimate","t value","Pr(>|t|)")]
}
eqsm_min_dist$Clust.ID = names(eq_min_dist)
windows()
par(mfrow= c(1,2))
plot(eqsm_min_dist$Clust.ID, eqsm_min_dist$Estimate)
plot(eqsm_min_dist$Clust.ID, eqsm_min_dist$`Pr(>|t|)`)
for(i in 1:length(t)){
t[[i]]$P1adj_maxcorr = NA
t[[i]]$P1adj_mindist = NA
}
for (i in 1:dim(closest_offic)[1]){
a=which(is.na(t[[which(names(t)==closest_offic$Clust.ID[i])]][,paste("P1_",closest_offic$max_corr[i],"int",sep="")])==F)[1]
b=tail(which(is.na(t[[which(names(t)==closest_offic$Clust.ID[i])]][,paste("P1_",closest_offic$max_corr[i],"int",sep="")])==F), n=1)
t[[which(names(t)==closest_offic$Clust.ID[i])]]$P1adj_maxcorr=NA
t[[which(names(t)==closest_offic$Clust.ID[i])]]$P1adj_maxcorr[c(a:b)]=eq_max_corr[[which(names(eq_max_corr)==closest_offic$Clust.ID[i])]]$fitted.values
rm(a,b)
}
for (i in 1:dim(closest_offic)[1]){
a=which(is.na(t[[which(names(t)==closest_offic$Clust.ID[i])]][,paste("P1_",closest_offic$min_dist[i],"int",sep="")])==F)[1]
b=tail(which(is.na(t[[which(names(t)==closest_offic$Clust.ID[i])]][,paste("P1_",closest_offic$min_dist[i],"int",sep="")])==F), n=1)
t[[which(names(t)==closest_offic$Clust.ID[i])]]$P1adj_mindist=NA
t[[which(names(t)==closest_offic$Clust.ID[i])]]$P1adj_mindist[c(a:b)]=eq_min_dist[[which(names(eq_min_dist)==closest_offic$Clust.ID[i])]]$fitted.values
rm(a,b)
}
for(i in 1:length(t)){
t[[i]]$P1adj = ifelse(t[[i]]$P1adj_maxcorr == t[[i]]$P1adj_mindist, t[[i]]$P1adj_maxcorr, (t[[i]]$P1adj_maxcorr+t[[i]]$P1adj_mindist)/2)
}
for(i in 1:length(t)){
for(j in 1:dim(t[[i]])[1]){
t[[i]]$avg[j] = mean(c(t[[i]][j,9],t[[i]][j,11],t[[i]][j,13],t[[i]][j,15],t[[i]][j,17]))
}
}
windows()
plot((t[[76]]$avg))
points(t[[76]]$P1adj,col="red")
cor(t[["135"]]$avg, t[["135"]]$P1adj, use = "complete.obs") #94% corelation
corr = cor(t[[i]][,c("P1adj", "temperature", "humidity","pressure")], use="complete.obs")
```

# Week 4¶

## Building a simple predictive model¶

Our team does not have any substantial experience in time series modelling, that is why, we chose to implement a very simple approach to the problem and apply the **snaive** function, which accounts for the *seasonality* in the data. The forecast for time $T+H$ equals:

where $m$ is the term which represents the seasonality. We conclude that indeed, seasonality is preset in the data from the ACF graphs, where a clear pattern which persists every 24 hours is present.

Our training set is all observations, but the last 24 hours of observations, which we try to predict with our model. The results for some of the stations look like this (the blue line represents the true values of the observations, whereas the red line represents the predictions based on the snaive method):

```
for(i in 1:length(t)){
t[[i]]$time_date = as.Date(t[[i]]$time)
}
windows()
ggplot(t[[76]], aes(time_date, P1int)) + geom_line() + scale_x_date('day') + ylab("Daily P1 conc") +
xlab("")
count_ma = ts(na.omit(t[[76]]$P1int), frequency=24)
decomp = stl(count_ma, s.window="periodic")
deseasonal_cnt <- seasadj(decomp)
plot(decomp)
for(i in 1:length(t)){
t[[i]][,c(8:17, 19:25)] = NULL
}
for(i in 1:length(t)){
t[[i]]$ts_var = ts(data=t[[i]]$P1adj, start = 1, freq = 24)
}
t_model = list()
for(i in 1:length(t)){
if(names(t)[i] %in% closest_offic$Clust.ID){
t_model[[i]] = t[[i]][c(which.min(t[[i]]$P1adj!=0):dim(t[[i]])[1]),]
names(t_model)[i] = names(t)[i]
}
}
'%ni%' <- Negate('%in%')
for(i in 1:length(t_model)){
if(names(t_model)[i] %ni% closest_offic$Clust.ID){
t_model[[i]] = NULL
i = i - 1
}
} #you have to run it 4 times for it to work wtf
for(i in 1:length(t_model)){
t_model[[i]]$ts_var = ts(t_model[[i]]$ts_var, start = 1, freq = 24)
}
#set a training set with the last 24 hours left for test
t_train = list()
for(i in 1:length(t_model)){
t_train[[i]] = t_model[[i]][c(1:(dim(t_model[[i]])[1]-24)),]
t_train[[i]]$ts_var = ts(t_train[[i]]$ts_var, start = 1, freq = 24)
}
names(t_train) = names(t_model)
t_fit_snaive = list()
for(i in 1:length(t_train)){
t_fit_snaive[[i]] = snaive(t_train[[i]]$ts_var, h=24, biasadj=TRUE)
}
for(i in 1:length(t_fit_snaive)){
t_fit_snaive[[i]] = as.data.frame(t_fit_snaive[[i]])
}
for(i in 1:length(t_model)){
t_model[[i]]$ts_var_pred[c((dim(t_model[[i]])[1]-24):dim(t_model[[i]])[1])] = t_fit_snaive[[i]]$`Point Forecast`
t_model[[i]]$ts_var_pred = ts(t_model[[i]]$ts_var_pred, freq = 24)
}
windows()
autoplot((t_model[[1]]$ts_var)) +
autolayer(t_model[[1]]$ts_var_pred)
```

It is visible that the model far from ideal, and there are probably some other variables, which could be included, such as temperature and pressure, as well as some lags of the dependent variable, in addition to the seasonality, but it gives a good idea of the minimum and maximum daily values as it is now. However, we suspect that the deviations from the truth will be large during the winter, as the values of P10 are much higher.

We will include more predictors and refine our model, as we gain more experience with time series modelling :)

## 15 thoughts on “Monthly Challenge – Sofia Air – Solution – New!Bees”

You should upload the code here in a jupyter notebook

Will do 🙂

Your assignments to peer review (and give feedback below the coresponding articles) for week 1 of the Monthly challenge are the following teams:

https://www.datasciencesociety.net/sofia-air-quality-eda-exploratory-data-analysis/

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-kiwi-team/

https://www.datasciencesociety.net/sofia-air-week-1/

Brilliant work 🙂

The idea of relative mapping of the co-ordinates of Sofia never occurred to me !

Also. why would it..since I’m just a beginner :p

Thank you!

Over all good work. I liked the way you presented your approach explanation methodically and commented on the thinking behind your decisions. Like @everyoneishigh, I liked how you plotted Sofia topo data and established a grid and then filtered out the locations.

From business side, I would have liked to see the assumptions you are making about this project. Often, I have found people arrive at a project with different assumptions. Writing yours clearly will give someone an opportunity to disagree or correct them.

Is the data bias that the official data are very reliable a correct one (even though it’s accepted as the eu norm?). Is there a process by which data quality and data gaps are vetted in those stations? Something for you to explore and inform as you look at the AirBG.info mission and more data.

You chose to leave out the data available via API. I also thought your choice of excluding data from 2017 that didn’t have data from 2018 an interesting one. It may be statistically insignificant, but just because citizen data recorders that reported previously didn’t report now, doesn’t mean they reported wrong. That said, given the small amount of this data, it may not be worth the effort and instead use the more recent data from the API.

Over all good stuff – thanks for helping me learn.

Some of the meta data you are looking for is available in a round about way if you look at the official meteorological data. These meta data files include hyperlinks to more granular definitions. T

Thank you for your feedback! We agree that it is very important to explain every step thoroughly and we will try to do so with our next update. Good luck!

Pandas you have done great work! You have very bright ideas and imagination and seems that you realize everything perfect. Especially the efforts on your graphs, such as “All stations in Bulgaria from the “unofficial” datasets”. We think that your further steps will even more blow us all, so keep up the good work 🙂

Banana team

Thank you 🙂

Your assignments to peer review (and give feedback below the coresponding articles) for week 2 of the Monthly challenge are the following teams:

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-tomunichandback/

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-sky_data/

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-banana/

It is interesting to understand how many observations were removed from the data set based on your criteria. Removed observations will interrupted the time series. How will you deal with this?

How many clusters do you choose to work with?

A map with the clusters would be nice.

Hi, thank you for your feedback 🙂 Removed observations disrupt the time series indeed, that’s why we are going to interpolate the missing observations in week 3. We are also going to change a part of our strategy and use the observations from the official stations instead of the meteorology dataset, but all of this is about to happen this upcoming week. So stay tuned 😉

Your assignments to peer review (and give feedback below the coresponding articles) for week 3 of the Monthly challenge are the following teams:

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-lone-fighter/

https://www.datasciencesociety.net/air-quality-week-1/

https://www.datasciencesociety.net/air-sofia-pollution-case/

Your assignments to peer review (and give feedback below the coresponding articles) for week 4 of the Monthly challenge are the following teams:

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-kung-fu-panda/

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-jeremy-desir-weber/

https://www.datasciencesociety.net/the-pumpkins/

We can see that the tasks are completed in a very structured and interesting way. The explanations are easy to understand what have you done. Great job! 🙂 -Team Yagoda