I. Week 1
At this stage of the process, we need to get familiar with the data, identify missing observations, fix the inconsistencies, if we find any, and prepare our data for further modelling in
1.Importing the datasets
The first step is to install the packages and load the libraries that we’ll need for the analysis and then to import the data we’ll need – 3 datasets AirQ17, AirQ18 (citizen air quality measurements) and Sofia_Topo (topographic data for Sofia).
2.Inspect the data structure
As a second step we need to inspect the classes of the elements and we see that everything is good, except the class of the time data. In order to continue our work we have to change the class to “POSIXct”
#converting $time to class "POSIXct"
# class(Data_bg_2017$time)  "POSIXct" "POSIXt"
3.Identify unique stations and merge the two datasets on citizen science air quality measurements
By checking the length of the 2 datasets on air quality measurements we find out that there are 383 unique stations for AirQ17 and 1254 for AirQ18. We decided that there is no point in evaluating and inspecting stations that have been working in 2017, but they aren’t in 2018 so we check that using:
Lg17=Lg17[!Lg17$geohash %in% diff]
We notice that there are 11 stations that we have to exclude from our analysis. As a final step of this part we merge the 2 datasets for 2017 and 2018 by their common columns and measurements.
# Binding the rows of the two datasets
4.Summarize availability of merged data by stations
The fourth step is to summarize the merged dataset. Firstly, we inspect if there are any NAs or any empty geohashes. We see that there are no NAs, but 4 empty geohashes and we are removing them and then summarizing the observations.
sumLg=Lg %>% group_by(geohash) %>% summarise(n(),min(time),max(time),max(time)-min(time))
We are creating a new table with the summary for our merged dataset, showing the earliest and the latest observations, the number of observations, as well as the number of days between the first and the last observation.
We then add this information to our new dataset and identify the locations, where fewer than 250 observations were made, or their observation window is shorter than 14 days. These locations’ value of information is not high enough, and therefore, we drop them.
5.Summarizing data on the map
In this step we want to see first the stations in Bulgaria and then in Sofia. Our first try wasn’t successful with the visualizations and looked like this:
BG = map_data(map="world",region = "Bulgaria")
BGmap = ggplot() + geom_polygon(data=BG,aes(x=coord.long,y=coord.lat))+coord_fixed(1.1)
Stations=data.frame(unique(Lg$geohash),stringsAsFactors = FALSE)
colnames(Stations) = "geohash"
Stations = transform(Stations, coord = gh_decode(Stations$geohash))
colnames(Stations)[2:5] = c("lat", "long", "lat_err", "lng_err")
StationsMapBG = BGmap +
geom_point(data = Stations, aes(x = long, y = lat), color = "white", size = 2) +
geom_point(data = Stations, aes(x = long, y = lat), color = "yellow", size = 1)
SofiaMapBG = StationsMapBG +
geom_point(data = SofiaTopo, aes(x = long, y = lat), color = "blue", size = 2) +
geom_point(data = SofiaTopo, aes(x = long, y = lat), color = "red", size = 1)
But here we have to give credit to the Kiwi Team, because by sharing their code on Github they helped us to find the leaflet library and use to to visualize our data and drawing a nice map with the following code:
addCircleMarkers(weight = 0.1, color = "red") %>%
addRectangles(lat1 = (min(Sofia_stations$lat)-0.01), lng1 = (max(Sofia_stations$lng)-0.18),
lat2 = (min(Sofia_stations$lat)+0.13), lng2 = (min(Sofia_stations$lng)+0.18),
fillColor = "transparent")
6.Filter the stations by space
In the last step we filter the stations first by space and the by time with the variables and tables we created in the steps before by subs etting them.
SofiaMap =ggplot(data = SofiaTopo) +
geom_point(aes(x = coord.long, y = coord.lat), color = "black", size = 5) + geom_point(aes(x = long, y = lat), color = "blue", size = 3) + coord_fixed(1.1)your-latex-code-here$
SofiaMap =ggplot(data = SofiaTopo) +
geom_point(aes(x = coord.long, y = coord.lat), color = "black", size = 5) + geom_point(aes(x = long, y = lat), color = "blue", size = 3) +
SofiaStationsCoords = Stations[((Stations$lat+Stations$lat_err) > min(SofiaTopo$lat) &
(Stations$lat-Stations$lat_err) < max(SofiaTopo$lat) & (Stations$long+Stations$lng_err) > min(SofiaTopo$long) &
(Stations$long-Stations$lng_err) < max(SofiaTopo$long)), ]
SofiaStationsMap = SofiaMap +
geom_point(data = SofiaStationsCoords, aes(x = long, y = lat), color = "black", size = 5) +
geom_point(data = SofiaStationsCoords, aes(x = long, y = lat), color = "yellow", size = 3)
citSofia = Lg[Lg$geohash %in% SofiaStationsCoords$geohash,]
The code published here turned out to be very complicated so we wrote it again in a more simplified way, again with the help of the Kiwi Team code.
II. Week 2
At this stage our main purpose is to define the final set of properly organized geo-units that are subject to further predictive analysis. Therefore, we have proceeded through the suggested steps as follows:
1. Final list of geo-units
Processing of this step begins with clustering 1188 stations in Sofia (which are the outcome of the 1st week analysis and are called Stations_decoded) by their latitude and longitude into 196 groups and defining these groups’ centroids:
#clustering stations in Sofia by their latitude and longitude into 196 groups
# discovering centroids and merging clusters to stations
setcolorder(Stations_upd, c('geohash','cluster_ID', 'obs','MinTime','MaxTime','days','lng','lat','lat_error','lng_error'))
After mentioned manipulations, there appears an updated dataset “Stations_upd”, which looks as follows:
Then we need to find the closest point from Sofia topography (the default dataset, which in our case is called “Sofia_Topo”) to the clusters’ centroids. We’ve made it up with the help of “sp” and “geosphere” libraries.