Prediction systems

Monthly Challenge – Sofia Air – Solution -[Dirty Minds]



We were at the opening of the Datatohn 2018 with bated breath.
The Case of Telelink is a real challenge to solve one of the most important problems of modern society in the urban areas.
Foreknowledge of areas with polluted air would allow many people to take preventive measures and protect themselves from negative effects.

Business Understanding

Air pollution is often discussed problem in Sofia city. The main causer of pollution are not only the transport, but also excessive building and reduction of green areas. In the last years, special attention has also been paid to sources of fine particulate matter. Except of the factories in the industrial areas of the city, one of the main factors of pollution is the sources of domestic heating with solid fuels.
Of course, the climate and the topographic map of the city make a huge impact. Air pollution is measured by particles (PM), PM10 is less than 10 micrometers per m3. We have data from national measurement stations. The purpose of this research is to predict areas with a concentration of air pollution with fine particulate matter. This information would help citizens to take measures to limit sources of pollution and to protect themselves.

Data Understanding

Used libraries in R for the assignment in Week 1:

As a first step both datasets for Y2017 and Y2017 get imported. It is important to mention that “strings” are imported as “strings”, not as factors which is the default. Missing values follow the latter and get imported as missing values:

d2017 <- read.csv(“E:\\Business Analytics\\ 2017\\data_bg_2017.csv”, stringsAsFactors = FALSE, na.strings=c(“”))

d2018 <- read.csv(“E:\\Business Analytics\\ 2018\\data_bg_2018.csv”, stringsAsFactors = FALSE, na.strings=c(“”))

Checking and cleaning the data for missing values (NA)

The next step is to inspect data structure and correct if any inconsistencies:

It could be observed as per the above extracts that time columns should not be classified as “character”. Applying the function yms_hms() from the library lubridate fixes those issues.

After fixing all variables’ classes, the next move is to obtain the unique stations (geohash) for both data sets regarding Y2017 and Y 2018. Then eliminating geo stations which observations in Y2017 but not in Y2018 are presented as the next step. Functions which helped to solve this challenge: unique() and setdiff(). There has been observed that 11 (eleven) geo stations are not present in the most recent data set as of Y2018 compared to the preceding Y2017. Those 11 geo stations comprise 7 834 observations to be cleaned out from the data set as they do not bring essential up-to-date information. All those stations are excluded with filter rule:

Once the data cleaning for Y2017 and Y2018 is performed, both the data sets get merged. Additionally, a sanity check is performed for any missing geo stations. As a result, there are no missing objects.

So, the data is merged and ready for tweaking. Below there have been illustrated combining, grouping and summarizing geo stations via days and number of observations. The data has been prepared thanks to the functions group_by() and summarise(). 


Share this

11 thoughts on “Monthly Challenge – Sofia Air – Solution -[Dirty Minds]

  1. 0

    Please, upload your code as selectable text, using images is sooo lame…. we have special capabilities for rendering directly jupyter notebook here – so use them… or at least past the code as snippet in quoted field, or as plain text….

    1. 0

      Maybe it’s a good idea to e-mail paricipants or share in the chat instructions on how to upload a Jupyter notebook and what format does the media upload generally accept. Don’t assume that everybody here knows how to use the platform 🙂

      1. 0

        maybe it is a good idea to read the guidelines about the platform which was one of the first instructions in written and spoken form you have received. Profile=>Help=>Website Guidelines

  2. 0

    hi Team, thanks for sharing your code, can you advise why are you removing geohash locations which have measurements in Y2017 but not in Y2018 from the analysis. What is your rationale in doing this?

Leave a Reply