Air Quality in Sofia¶
Air pollution in Sofia, Bulgaria is an ongoing problem, exceeding normal levels 70 times last winter. The purpose of this project is to forecast 24 hrs of particulate matter (PM) concentrations around the city. A longer, more detailed explanation of the problem and the data are detailed in the case description (https://www.datasciencesociety.net/the-telelink-case-one-step-closer-to-a-better-air-quality-and-city/).
The first step is to load in the data and look for data quality issues. For now, I am focusing on the citizen science and topography data. I use the
tidyverse package in order to do this more easily.
#### Air quality challenge #### # Week 1 library(tidyverse) # importing data setwd("/Users/macbook/Documents/Data Science/Air Quality Monitoring Challenge/sofia-air%2Fair-sofia") data_bg_2017 <- read_csv("Air Tube/data_bg_2017.csv.gz") data_bg_2018 <- read_csv("Air Tube/data_bg_2018.csv.gz") sofia_topo <- read_csv("TOPO-DATA/sofia_topo.csv")
Error in read_csv("Air Tube/data_bg_2017.csv.gz", message = FALSE): unused argument (message = FALSE) Traceback:
Once the data are imported, I look at their structure. I find that there are many more unique stations measured in 2018 than 2017, but that almost all stations measured in 2017 were continued into 2018. However, those that don't have ongoing measurements I remove from my dataset. I also remove any datapoints without location data.
# looking at data structure head(data_bg_2018) head(sofia_topo) sapply(data_bg_2017, class) # finding unique stations (geohashes) length(unique(data_bg_2017$geohash)) length(unique(data_bg_2018$geohash)) # locations measured in 2017, but not 2018 unmeasured_18 <- setdiff(data_bg_2017$geohash, data_bg_2018$geohash) # removing these locations data_2017_new <- filter(data_bg_2017, !geohash %in% unmeasured_18) setdiff(data_2017_new$geohash, data_bg_2018$geohash) full_data <- bind_rows(data_2017_new, data_bg_2018) clean_data <- filter(full_data, !is.na(geohash))
Now I am interested in learning more about which stations will actually be useful for predicting future PM levels, so I create some summary statistics.
# creating a summary table by_station <- group_by(clean_data, geohash) station_summary <- by_station %>% summarise(obs = n(), tmin = min(time), tmax = max(time), days = tmax - tmin) %>% arrange(days) head(station_summary)
|sx8dc3dv5bs||1||2018-06-12 16:00:00||2018-06-12 16:00:00||0.000000 days|
|sx8dc74gegu||1||2018-07-26 18:00:00||2018-07-26 18:00:00||0.000000 days|
|sx6ntd094dt||32||2018-03-25 04:00:00||2018-03-26 11:00:00||1.291667 days|
|sx86yynqthy||34||2018-03-25 02:00:00||2018-03-26 11:00:00||1.375000 days|
|sx8dmum9hkd||41||2018-08-10 21:00:00||2018-08-12 21:00:00||2.000000 days|
|sx8em0tftye||58||2018-03-24 02:00:00||2018-03-26 11:00:00||2.375000 days|
Great, almost all of my stations have more than 1 observation! I'll remove those that don't in a minute. In the meantime, let's convert the geohashes to lat/long for plotting purposes, and check out where they are on a map of Bulgaria.
# convert geohashes into lat long library(geohash) station_latlong <- bind_cols(station_summary, gh_decode(station_summary$geohash)) # look at locations on map library(ggmap) library(rworldmap) world_map <- getMap() bulgaria <- map_data(world_map, region = "Bulgaria") plot1 <- ggplot() + geom_polygon(data = bulgaria, aes(x = long, y = lat, group = group), fill = NA, col = "black") + coord_fixed(1.3) + geom_point(data = station_latlong, aes(x = lng, y = lat)) # plotting topographic data (proxy for location of Sofia) on top of other locations plot1 + geom_point(data = sofia_topo, aes(x=Lon, y = Lat), col = "red")
Notice that many of the citizen science observations are outside of Sofia (the red dots on the map are the datapoints from our topographic data, which we will use as a proxy for the location of Sofia). So, the next step is to remove the data that is not in the city, since the city is our location of interest.
# restrict citizen science data to Sofia using elevation data sofia_data <- filter(station_latlong, between(lat, min(sofia_topo$Lat), max(sofia_topo$Lat)) & between(lng, min(sofia_topo$Lon), max(sofia_topo$Lon))) ggplot() + geom_polygon(data = bulgaria, aes(x = long, y = lat, group = group), fill = NA, col = "black") + coord_fixed(1.3) + geom_point(data = sofia_data, aes(x=lng, y=lat), col = "blue")
That looks better. Finally, let's remove any stations which have less than a week worth of data, since we'll want a good dataset in order to predict future values.
# restrict data to only those stations which have more than 1 week of data long_sofia_data <- filter(sofia_data, days >= 7)