1. Business Understanding
Particulate matter is considered the air pollutant of most significant concern to the health of the urban population. Researches have shown that exposure to PM can lead to increased days lost from work or school, emergency room visits, hospital stays, and deaths. Both short and long-term exposures to PM can lead to the worsening of heart and lung disease. It can also cause premature death, particularly among people who have a higher risk of being affected by particle pollution.
Particulate matter can be produced from burning materials, road dust, construction, and agriculture. One of the most significant sources of particulate matter is residential wood burning. Wood smoke may come from residential sources such as a fireplace or wood stove in a home, all open burning of vegetative matter or backyard burning. Other sources of particulate matter include forest fires, specific industries, furnaces, tobacco smoke, and all mobile vehicles, especially those with diesel engines. The harmful effects of tobacco smoke are well known. As a result, many countries have placed restrictions on smoking in public places.
We as real urban air quality gurus aim to predict the PM concentration in our capital city – Sofia. We trust the forecast can be used from the business for prevention. The local authorities can reduce the levels of particulate matter pollution by reducing the amount of particulate matter produced through the smoke and by reducing vehicle emissions.
Inspired by that, in this study, we focus on refined modeling for predicting daily pollutant concentrations from historical air pollution data.
2. Data Understanding
As a starting point, we focus on understanding the provided datasets which are as follows:
- Official air quality measurements
- Citizen science air quality measurements
- Meteorology data
- Topography data
Considering the specifics of the data and its topological, geometric and geographic properties we start our data understanding journey conducting the following preliminary analysis. ( See the draft_nb PDF containing the script and some graphs ).
Outcome from WEEK 1 – a data set with 417 stations situated in Sofia. The data set was filtered out from the total of 1265 stations situated throughout Bulgaria by geo and time filtering. The outcome of WEEK 1 data set is named “citySofia”.
Step 1 from week 2 consists of 7 sub-steps, titled with latin letters A-G within the code below. Only sub-steps A through E were implemented and tested. For steps F and G only the scope of the envisaged activities is available. The explanations concerning the scope of F and G are provided both in BG and ENG within the code here below.
#Step1: Decide on the final list of geo-units that are subject to predictive analysis====
#A-Clustering stations by their latitude and longitude
fviz_cluster(km.out, data = Sfstcl)
kcent=data.frame(km.out$centers)#coordinates of the centers of the clusters` centroids
#C-add column in the 417 stations dataset showing which stations belongs to which cluster
#D-calculate the distance between the centroids of the clusters and the dta_topo stations
rdist= data.frame(rdist(kcent[,-c(3)], dta_topo))
#E-min distance between the given cluster`s centroid and the closest station from dta_topo
colnames(mindist) <- ‘mindist’
#F-identifying to which point from dta_topo pertains the min distance from the previous step
is.num <- sapply(df2, is.numeric)
df2[is.num] <- lapply(df2[is.num], round, 8)
#-here should stand some kind of a loop which would compare the value of each row from the first column from dataframe df2 with the values from all of the rest columns of the same row
# – tuk triabva da ima niakakuv loop s koito da se sravniava stoinostta na vseki edin red ot purva kolona na df2 sus stoinostite ot vsichki ostanali koloni na suotvetnia red
# – and where the values are equal in one newly created vector is written the index of the column in which is the matching value (from the respective column after the first one)
#-i kudeto stoinostite suvpadnat v edin nov vektor se zapisva indeksa na kolonata v koiato se namira suvpadashata stoinost ot suotvetnata kolona sled purvata
#-the same excersise should be repeated for all rows of df2
#-sushoto nesho tribva da se povtori za vsichki redove na df 2
# – the values within the vector reprsent the index number of the observation station from the dataset dta_topo (see the code for WEEK 1 for dataset dta_topo)which is the closest station to the respective cluster, whereas the cluster index number is equal to the number of the row from the first column of df2
#-stoinostite vuv vektora predstavliavat nomera na stanziata ot dataset-a dta_topo koiato e nai blizo do suotvetnia cluster, kato nomera na clustera suotvetstva na nomera na reda ot pyrva kolona na df2
#-the so created vector is utilized for the next step from the code
#- taka poluchenia vector se polzva za sledvashata stupka ot koda
#G-cbind the information from the previous step to the dataset citySofia1
#- the vector from the previous substep (H)is binded with information for elev from the dataset dta_topo
#-kum vektora ot predishnata stupka se dobavia informazia za elev ot dataset-a dta_topo
# – to the dataset citySofia1 is added the dataframe created executing the above line and thus Step 1 from WEEK 2 is implemented
#-kum dataset-a citySofia1 se dobavia dataframe-a poluchen v gornata pod stupka i s tova e reshena step 1 na WEEK 2
WEEK 2 – some results from Step 1 of Week 2