We were at the opening of the Datatohn 2018 with bated breath.
The Case of Telelink is a real challenge to solve one of the most important problems of modern society in the urban areas.
Foreknowledge of areas with polluted air would allow many people to take preventive measures and protect themselves from negative effects.
Air pollution is often discussed problem in Sofia city. The main causer of pollution are not only the transport, but also excessive building and reduction of green areas. In the last years, special attention has also been paid to sources of fine particulate matter. Except of the factories in the industrial areas of the city, one of the main factors of pollution is the sources of domestic heating with solid fuels.
Of course, the climate and the topographic map of the city make a huge impact. Air pollution is measured by particles (PM), PM10 is less than 10 micrometers per m3. We have data from national measurement stations. The purpose of this research is to predict areas with a concentration of air pollution with fine particulate matter. This information would help citizens to take measures to limit sources of pollution and to protect themselves.
Used libraries in R for the assignment in Week 1:
As a first step both datasets for Y2017 and Y2017 get imported. It is important to mention that “strings” are imported as “strings”, not as factors which is the default. Missing values follow the latter and get imported as missing values:
d2017 <- read.csv(“E:\\Business Analytics\\ 2017\\data_bg_2017.csv”, stringsAsFactors = FALSE, na.strings=c(“”))
d2018 <- read.csv(“E:\\Business Analytics\\ 2018\\data_bg_2018.csv”, stringsAsFactors = FALSE, na.strings=c(“”))
Checking and cleaning the data for missing values (NA)
The next step is to inspect data structure and correct if any inconsistencies:
It could be observed as per the above extracts that time columns should not be classified as “character”. Applying the function yms_hms() from the library lubridate fixes those issues.
After fixing all variables’ classes, the next move is to obtain the unique stations (geohash) for both data sets regarding Y2017 and Y 2018. Then eliminating geo stations which observations in Y2017 but not in Y2018 are presented as the next step. Functions which helped to solve this challenge: unique() and setdiff(). There has been observed that 11 (eleven) geo stations are not present in the most recent data set as of Y2018 compared to the preceding Y2017. Those 11 geo stations comprise 7 834 observations to be cleaned out from the data set as they do not bring essential up-to-date information. All those stations are excluded with filter rule:
Once the data cleaning for Y2017 and Y2018 is performed, both the data sets get merged. Additionally, a sanity check is performed for any missing geo stations. As a result, there are no missing objects.
So, the data is merged and ready for tweaking. Below there have been illustrated combining, grouping and summarizing geo stations via days and number of observations. The data has been prepared thanks to the functions group_by() and summarise().