Datathons Solutions

The pumpkins


1. Business Understanding

Particulate matter is considered the air pollutant of most significant concern to the health of the urban population. Researches have shown that exposure to PM can lead to increased days lost from work or school, emergency room visits, hospital stays, and deaths. Both short and long-term exposures to PM can lead to the worsening of heart and lung disease. It can also cause premature death, particularly among people who have a higher risk of being affected by particle pollution.

Particulate matter can be produced from burning materials, road dust, construction, and agriculture. One of the most significant sources of particulate matter is residential wood burning. Wood smoke may come from residential sources such as a fireplace or wood stove in a home, all open burning of vegetative matter or backyard burning. Other sources of particulate matter include forest fires, specific industries, furnaces, tobacco smoke, and all mobile vehicles, especially those with diesel engines. The harmful effects of tobacco smoke are well known. As a result, many countries have placed restrictions on smoking in public places.

We as real urban air quality gurus aim to predict the PM concentration in our capital city – Sofia. We trust the forecast can be used from the business for prevention. The local authorities can reduce the levels of particulate matter pollution by reducing the amount of particulate matter produced through the smoke and by reducing vehicle emissions.

Inspired by that, in this study, we focus on refined modeling for predicting daily pollutant concentrations from historical air pollution data.

2. Data Understanding

As a starting point, we focus on understanding the provided datasets which are as follows:

  • Official air quality measurements
  • Citizen science air quality measurements
  • Meteorology data
  • Topography data

Considering the specifics of the data and its topological, geometric and geographic properties we start our data understanding journey conducting the following preliminary analysis. ( See the draft_nb PDF containing the script and some graphs ).


Outcome from WEEK 1 – a data set with 417 stations situated in Sofia.  The data set was filtered out from the total of 1265 stations situated throughout Bulgaria by geo and time filtering. The outcome of WEEK 1 data set is named “citySofia”.



Step 1 from week 2 consists of 7 sub-steps, titled with latin letters A-G within the code below. Only sub-steps A through E were implemented and tested. For steps F and G only the scope of the envisaged activities is available. The explanations concerning the scope of F and G are provided both in BG and ENG within the code here below.

#Step1: Decide on the final list of geo-units that are subject to predictive analysis====
#A-Clustering stations by their latitude and longitude



#B-cluster pplot
fviz_cluster(km.out, data = Sfstcl)
kcent=data.frame(km.out$centers)#coordinates of the centers of the clusters` centroids

#C-add column in the 417 stations dataset showing which stations belongs to which cluster

#D-calculate the distance between the centroids of the clusters and the dta_topo stations
rdist= data.frame(rdist(kcent[,-c(3)], dta_topo))
#E-min distance between the given cluster`s centroid and the closest station from dta_topo
mindist=data.frame(apply(rdist,1, FUN=min))
colnames(mindist) <- ‘mindist’
#F-identifying to which point from dta_topo pertains the min distance from the previous step
is.num <- sapply(df2, is.numeric)
df2[is.num] <- lapply(df2[is.num], round, 8)

#-here should stand some kind of a  loop which would compare the value of each row from the first column from dataframe df2 with the values from all of the rest columns of the same row
# – tuk triabva da ima niakakuv loop s koito da se sravniava stoinostta na vseki edin red ot purva kolona na df2 sus stoinostite ot vsichki ostanali koloni na suotvetnia red
# – and where the values are equal in one newly created vector is written the index of the column in which is the matching value (from the respective column after the first one)
#-i kudeto stoinostite suvpadnat v edin nov vektor se zapisva indeksa na kolonata v koiato se namira suvpadashata stoinost ot suotvetnata kolona sled purvata
#-the same excersise should be repeated for all rows of df2
#-sushoto nesho tribva da se povtori za vsichki redove na df 2
# – the values within the vector reprsent the index number of the observation station from the dataset dta_topo (see the code for WEEK 1 for dataset dta_topo)which is the closest station to the respective cluster, whereas the cluster index number is equal to the number of the row from the first column of df2
#-stoinostite vuv vektora predstavliavat nomera na stanziata ot dataset-a dta_topo koiato e nai blizo do suotvetnia cluster, kato nomera na clustera suotvetstva na nomera na reda ot pyrva kolona na df2
#-the so created vector is utilized for the next step from the code
#- taka poluchenia vector se polzva za sledvashata stupka ot koda

#G-cbind the information from the previous step to the dataset citySofia1
#- the vector from the previous substep (H)is binded with information for elev from the dataset dta_topo
#-kum vektora ot predishnata stupka se dobavia informazia za elev ot dataset-a dta_topo
# – to the dataset citySofia1 is added the dataframe created executing the above line and thus Step 1 from WEEK 2 is implemented
#-kum dataset-a citySofia1 se dobavia dataframe-a poluchen v gornata pod stupka i s tova e reshena step 1 na WEEK 2

WEEK 2 – some results from Step 1 of Week 2


Share this

10 thoughts on “The pumpkins

    1. 1

      We decided to base our solution on R, RStudio and R Markdown. Hence, we can provide some “decent” html , pdf or Word output, but we still haven’t found a way to import the nice looking standalone html file into the platform. That is way we opted for the pdf format as a substitute for the time being. Any help with importing the html file is appreciated!
      P.S. We know that Jupyter Notebooks can handle R, but it’s too much of a hassle compared to just sticking with RStudio.

      1. 0

        there is an easy way to knit() from R studio to Jupyter notebook…. but even if you do not want to go i-python, at least publish the code here as a formated text. As you can see nobody gives you feedback as it is.

  1. 0

    Amazing article! Not only regarding to the approach to the specific case, but it also presents a constructive way of approaching exploratory analysis as a whole and could be used as an excellent point of reference for the other projects as well.
    As a recommendation, adding visualizations after “basic stats” section would allow easier interpretation of the findings observed.
    Overall, a very powerful, educational and inspirational article, keep doing you work with passion!
    Great job, Pumpkins, we are looking forward to your next steps in Monthly Challenge.
    Team Kiwi (:

  2. 0

    My admiration for your good performance! Your solution published in the attached file really impressed me.
    I will follow your article with great interest.
    Good luck :0)
    The Dirty Minds Team

  3. 0

    Hello pumpkins, congratulations for great work so far. Nice to try the alternative clustering method through k-means, particularly appreciated the map provided in doc file. My concerns are the following : 1/ how did you end up with 417 citizen stations out of 1265 ? Personally, it is 372 common stations in 2017-2018 out of 383 in 2017 and 1253 in 2018. 2/ Could you please add visualization, data set header and/or some explanation around each step your code as it is hard to follow your process conveniently ? Best

Leave a Reply