## 0. Data and working environment¶

We were given the following 4 datasets:

**atmosphere_profile_train.csv** - data from the University of Wyoming. It consists data for the temperature at certain height.

**construction_sites.csv** - data for all construction sites in Sofia that are relevant for our time period.

**household_heating.csv** - data from a survey by Green Sofia. It corresponds to what type of heating people use.

**industrial_pollution.csv** - Ddta on emissions from industrial installations collected from emissions permits. Data provided by UCTM Sofia.

**sofia_topo.csv** - Topological data of Sofia based on NASA SRTM digital elevation model.

**stations_data_train.csv** - Official air quality measurements (4 stations in the city) – as per EU guidelines on air quality monitoring.

**weather_lbsf_20161101-20161130_IP_train.csv** - Meteorological measurements. Data from 1 station - Sofia airport (LBSF).

We worked on the provided Amazon's virtual machines. However, to properly do our analysis we needed some additional libraries. Install the following libraries and restart the kernel to work properly.

```
# !pip install geopy --user
# !pip install folium --user
```

```
# import the libraries
import pandas as pd
import numpy as np
import math
import geopy.distance
import scipy
import folium
from folium import plugins
```

```
# read the data into memory
folder = "/workspace/vrategov/00.Data/"
atmo_profile = "atmosphere_profile_train.csv"
construction = "csvConstructData.csv"
industry = "industrial_pollution_latlon.csv"
topo = "sofia_topo.csv"
stations_train = "stations_data_train.csv"
meteo = "weather_lbsf_20161101-20161130_IP_train.csv"
stations = "stations.csv" # this is a csv file with the characteristics of the stations
df_atmo = pd.DataFrame(pd.read_csv(folder+atmo_profile))
df_const = pd.DataFrame(pd.read_csv(folder+construction))
df_ind = pd.DataFrame(pd.read_csv(folder+industry))
df_topo = pd.DataFrame(pd.read_csv(folder+topo))
df_stations_train = pd.DataFrame(pd.read_csv(folder+stations_train))
df_meteo = pd.DataFrame(pd.read_csv(folder+meteo))
df_stations = pd.DataFrame(pd.read_csv(folder+stations,sep=";"))
```

## 1. Business Understanding¶

Nowadays, air pollution is considered one of the most serious problems in the world. It refers to the contamination of the atmosphere by harmful chemicals or biological materials. To solve the problem of air pollution, it's necessary to understand the issues and look for ways to counter it.

Air pollution could be a reason for a lot of health issues, both in short-term and long-term.

Air pollution causes damage to crops, animals, forests, and bodies of water. It also contributes to the depletion of the ozone layer, which protects the Earth from the sun's UV rays.

In Bulgaria, EEA collects air pollution statistics. It's important to study these statistics because they show how polluted the air has become in various places around the country.

Sofia has a long history with air polution problems. The norms were exceeded many times in the past few years. The record of highest measurement of air polution exceeded six times the reccomended value. Major contributors to the polution are believed to be the **households heating fuels, industrial and construction sites** contamination.

This is the main objective of the case - **to estimate the effect of these factors** on the total measured polution and to predict the **air pollution in Sofia for the next 24 hour period**.

```
df_atmo.head()
```

```
df_atmo.describe()
```

#### Construction sites¶

There is data for 389 construction sites and a starting date for each of them. Date ranges from 7 January 2016 to 28 October 2016.

```
df_const.head()
```

```
df_const.describe()
```

There are 4 different types of construction sites:

Construction site type | N |
---|---|

big housing | 144 |

infrastructure | 65 |

non-residential | 96 |

small housing | 84 |

All of these were spread accross 20 different districts.

#### Household heating¶

The main interest for us in this dataset was the data in regards to heating with solid fuels - diesel, coal, and wood. This is because for the other types of heating it is already accounted with the industrial pollution. We made the assumtion to count the number of pollutant based on the number of dwelings , rather than the number of people in the building.

Below is a map of the households data.

#### Industrial polution¶

The dataset consisted characteristics of 71 industriall pollutants. We had the coordinates in DMS format and further transformation was needed to get the cartesian values.

```
df_ind.head()
```

```
df_ind.describe()
```

Below is a map of the industrial polluters data.

#### Sofia topo¶

In the Topological data there is information about the latitude, longitude and elevation for 196 points. No time component.

```
df_topo.head()
```

#### Official stations¶

We were provided with PM10 daily measurements from 4 stations in Sofia from 1 November 2016 to 16 November 2016.

```
df_stations_train.head()
```

```
df_stations_train.describe()
```

## 3. Data Preparation¶

#### Industrial pollution¶

In the begining in our preparation of industrial data, we had to convert the provided coordinates from DMS format to Cartesian coordinates. Then, we had to calculate the temperature gradient from the formula provided in the case description (deltaTemperature/deltaHeight). The used data set was atmosphere_profile_train. We found out that the whole sample had a stabillity of class E. We proceded with converting the variables in the required units. For example, the debit variable of the pollutants was in t/y and we converted it to g/s, wind speed was converted to m/s from km/h.

```
df_ind["PM10"] = df_ind["t/y"] * 1000000/(365*24*60*60) # convert the debit to g/s
df_meteo["wind"] = df_meteo["sfcWindAVG"] * 1000/3600 # convert wind speed in m/s
```

#### Household pollution¶

We used the following data tables to calculate household pollution: "weather_lbsf_20161101-20161130_IP_train.csv" and "household heating.csv".

We used combination of the simple approach and the heating degree days approach to estimate PM10 emissions (grams/household/day) for the examined period. We divided the PM10 emissions (grams/household/year) by 190 to calculate average daily emission since we do not have information for temperature ranges across the whole heating period. After that we have calculated total emissions for the period with available data i.e. 01-11-2016 to 20-11-2016 (20 days). This is total yearly emissions divided by 190, multiplied by 20.

We calculated HDD for each day, following the if – else structure described in the case. Then we calculated proportion of total HDD for each day and multiplied that proportion to total emissions for the period to estimate emissions per day.

From the file household heating.csv we focused on the variable number of dwellings in the building by source of heating - 4,6,7. DIESEL, COAL, WOOD, i.e. heating on solid fuel and assumed 1039 PM10 emissions (grams/household/year). Than we summed dwellings grouped by coordinates and deleted duplicates, leaving 33 409 unique rows.

Then we combined the two datasets and estimated total emissions for each of the dates per address.

#### Building sites pollution¶

We used the procedure, provided by the EMEP/EEA Inventory Guidebook 2016 Tier 1 methodology. We used the parameters of the equation: the emission factor for the pollutant emission, affected area, construction duration and efficiency of emission control measures as suggested by the guidebook and assumed sand soil. We calculated PE Index using the suggested formulae and PRCPAVG and TASAVG data from the weather dataset. The outcome of that calculation is PM10 emissions by type of construction.

We have extracted address coordinates by google maps for some of the constructions and district coordinates for all of the rest observations. For those districts without addresses we measured the center of the district by the given district coordinates. Formula: $$ x = \frac{coordinateX_1 + coordinateX_2}{2}$$ $$ y = \frac{coordinateY_1 + coordinateY_2}{2}$$

## 4. Data Modelling¶

The output of the model suggested in the case description can be described with the following function:

```
def concentration(y,q,u,h,sigma_y,sigma_z,z = 0):
"""Estimate the concentration of PM10 in a point in space.
Keyword arguments:
y -- meters crosswind from the emission plume centerline, assumed to be equal to x in our model
z -- position in the z direction, default set to 0 to equal ground level(where are the people)
q -- stack emissions (g/s)
u -- wind speed (m/s)
h -- pollutant release height
sigma_y -- horizontal standard deviation of the emission distribution, in m
sigma_z -- vertical standard deviation of the emission distribution, in m
"""
c = (q/2*np.pi*u*sigma_y*sigma_z) * (np.exp((-y**2)/(2*sigma_y**2))) * (np.exp((-(z-h)**2)/(2*sigma_z**2))) * (np.exp((-(z+h)**2)/(2*sigma_z**2)))
return(c)
```

#### Industrial Factor¶

We folowed the suggested model in the case description. We decided to calculate the contribution of each industrial pollutant to each point from the Sofia Topo data, essentially creating our map grid from these points.

In general, our approach is as follows:

```
topo_polution_industry = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})
for i in range(0,df_meteo.shape[0]):
for j in range(0,df_topo.shape[0]):
c = 0
for k in range(0,df_ind.shape[0]):
a = (df_topo["Lat"][j], df_topo["Lon"][j])
b = (df_ind["Lat"][k],df_ind["Lon"][k])
x = geopy.distance.distance(a, b).km #distance in kilometers
if x < 1:
sigma_y = 50.5 * x**(0.894)
sigma_z = 22.8 * x**(0.675-1.3)
else:
sigma_y = 50.5 * x**(0.894)
sigma_z = 55.4 * x**(0.305-34.0)
c += concentration(y = x ,
q = df_ind["PM10"][k],
u = df_meteo["wind"][i],
h = df_ind["m"][k],
sigma_y = sigma_y,
sigma_z = sigma_z)
topo_polution_industry = topo_polution_industry.append({'Date': i, 'Lat': df_topo["Lat"][j], 'Lon': df_topo["Lon"][j], 'Ind_P10': c}, ignore_index=True)
print('Calculations for day {} are ready.'.format(i))
```

The end result is the total pollution from industrial objects at each point from the Topo data for each day in our sample.

```
topo_polution_industry.head()
```

```
topo_polution_industry.describe()
```

In order to make comparable our results with the official measurements, we had to further divide the estimated pollution by the air density. The following functions helps in doing that.

```
# functions taken from https://python.hydrology-amsterdam.nl/moduledoc/index.html
def ea_calc(airtemp= scipy.array([]),\
rh= scipy.array([])):
'''
Function to calculate actual vapour pressure from relative humidity:
.. math::
e_a = \\frac{rh \\cdot e_s}{100}
where es is the saturated vapour pressure at temperature T.
Parameters:
- airtemp: array of measured air temperatures [Celsius].
- rh: Relative humidity [%].
Returns:
- ea: array of actual vapour pressure [Pa].
Examples
--------
>>> ea_calc(25,60)
1900.0946514729308
'''
# Test input array/value
#airtemp,rh = _arraytest(airtemp, rh)
# Calculate saturation vapour pressures
es = es_calc(airtemp)
# Calculate actual vapour pressure
eact = rh / 100.0 * es
return eact # in Pa
def es_calc(airtemp= scipy.array([])):
'''
Function to calculate saturated vapour pressure from temperature.
For T<0 C the saturation vapour pressure equation for ice is used
accoring to Goff and Gratch (1946), whereas for T>=0 C that of
Goff (1957) is used.
Parameters:
- airtemp : (data-type) measured air temperature [Celsius].
Returns:
- es : (data-type) saturated vapour pressure [Pa].
References
----------
- Goff, J.A.,and S. Gratch, Low-pressure properties of water from -160 \
to 212 F. Transactions of the American society of heating and \
ventilating engineers, p. 95-122, presented at the 52nd annual \
meeting of the American society of \
heating and ventilating engineers, New York, 1946.
- Goff, J. A. Saturation pressure of water on the new Kelvin \
temperature scale, Transactions of the American \
society of heating and ventilating engineers, pp 347-354, \
presented at the semi-annual meeting of the American \
society of heating and ventilating engineers, Murray Bay, \
Quebec. Canada, 1957.
Examples
--------
>>> es_calc(30.0)
4242.725994656632
>>> x = [20, 25]
>>> es_calc(x)
array([ 2337.08019792, 3166.82441912])
'''
# Test input array/value
#airtemp = _arraytest(airtemp)
# Determine length of array
n = scipy.size(airtemp)
# Check if we have a single (array) value or an array
if n < 2:
# Calculate saturated vapour pressures, distinguish between water/ice
if airtemp < 0:
# Calculate saturation vapour pressure for ice
log_pi = - 9.09718 * (273.16 / (airtemp + 273.15) - 1.0) \
- 3.56654 * math.log10(273.16 / (airtemp + 273.15)) \
+ 0.876793 * (1.0 - (airtemp + 273.15) / 273.16) \
+ math.log10(6.1071)
es = math.pow(10, log_pi)
else:
# Calculate saturation vapour pressure for water
log_pw = 10.79574 * (1.0 - 273.16 / (airtemp + 273.15)) \
- 5.02800 * math.log10((airtemp + 273.15) / 273.16) \
+ 1.50475E-4 * (1 - math.pow(10, (-8.2969 * ((airtemp +\
273.15) / 273.16 - 1.0)))) + 0.42873E-3 * \
(math.pow(10, (+4.76955 * (1.0 - 273.16\
/ (airtemp + 273.15)))) - 1) + 0.78614
es = math.pow(10, log_pw)
else: # Dealing with an array
# Initiate the output array
es = scipy.zeros(n)
# Calculate saturated vapour pressures, distinguish between water/ice
for i in range(0, n):
if airtemp[i] < 0:
# Saturation vapour pressure equation for ice
log_pi = - 9.09718 * (273.16 / (airtemp[i] + 273.15) - 1.0) \
- 3.56654 * math.log10(273.16 / (airtemp[i] + 273.15)) \
+ 0.876793 * (1.0 - (airtemp[i] + 273.15) / 273.16) \
+ math.log10(6.1071)
es[i] = math.pow(10, log_pi)
else:
# Calculate saturation vapour pressure for water
log_pw = 10.79574 * (1.0 - 273.16 / (airtemp[i] + 273.15)) \
- 5.02800 * math.log10((airtemp[i] + 273.15) / 273.16) \
+ 1.50475E-4 * (1 - math.pow(10, (-8.2969\
* ((airtemp[i] + 273.15) / 273.16 - 1.0)))) + 0.42873E-3\
* (math.pow(10, (+4.76955 * (1.0 - 273.16\
/ (airtemp[i] + 273.15)))) - 1) + 0.78614
es[i] = pow(10, log_pw)
# Convert from hPa to Pa
es = es * 100.0
return es # in Pa
def rho_calc(airtemp= scipy.array([]),\
rh= scipy.array([]),\
airpress= scipy.array([])):
'''
Function to calculate the density of air, rho, from air
temperatures, relative humidity and air pressure.
.. math::
\\rho = 1.201 \\cdot \\frac{290.0 \\cdot (p - 0.378 \\cdot e_a)}{1000 \\cdot (T + 273.15)} / 100
Parameters:
- airtemp: (array of) air temperature data [Celsius].
- rh: (array of) relative humidity data [%].
- airpress: (array of) air pressure data [Pa].
Returns:
- rho: (array of) air density data [kg m-3].
Examples
--------
>>> t = [10, 20, 30]
>>> rh = [10, 20, 30]
>>> airpress = [100000, 101000, 102000]
>>> rho_calc(t,rh,airpress)
array([ 1.22948419, 1.19787662, 1.16635358])
>>> rho_calc(10,50,101300)
1.2431927125520903
'''
# Test input array/value
#airtemp,rh,airpress = _arraytest(airtemp,rh,airpress)
# Calculate actual vapour pressure
eact = ea_calc(airtemp, rh)
# Calculate density of air rho
rho = 1.201 * (290.0 * (airpress - 0.378 * eact)) \
/ (1000.0 * (airtemp + 273.15)) / 100.0
return rho # in kg/m3
```

```
air_density = pd.DataFrame(rho_calc(df_meteo["TASAVG"], df_meteo["RHAVG"], df_meteo["PSLAVG"]*100), columns=["density"])
air_density["Date"] = air_density.index
```

```
topo_polution_industry = topo_polution_industry.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
```

```
topo_polution_industry["Ind_P10"] = topo_polution_industry["Ind_P10"]/topo_polution_industry["density"]
```

```
topo_polution_industry.describe()
```

And, of course, it is imoprtant to visualize the results which can hint if our analysis is correct. The following code is doing that.

```
def make_heatmap(df, timestamp, metric):
"""Create a Heat Map of Sofia for a given timestamp to visualize a given metric.
For example, to visualize PM10 pollution
The map is interactive.
The map also visualizes clusters of the locations.
Keyword arguments:
df -- the data frame with time, longitude, lattitude and the chosen metric
timestamp -- the point in time for which to visualize the heat map
metric -- the metric, used for visualization
"""
points = df[df["Date_caller"] == int(timestamp)]
folium_map = folium.Map(location=sofia_center,
zoom_start=11,
tiles='Stamen Terrain')
marker_cluster = plugins.MarkerCluster().add_to(folium_map)
for i in range(0, len(points)):
point = points.iloc[i]
folium.Marker(
[point['Lat'], point['Lon']],
popup=str(point['Ind_P10'])
).add_to(marker_cluster)
# folium.Circle(
# radius=10,
# location=[point['latitude'], point['longitude']],
# popup=str(point['P1']),
# color='#333333',
# fill=False
# ).add_to(folium_map)
plugins.MarkerCluster().add_to(folium_map)
# plot heatmap
folium_map.add_child(plugins.HeatMap(
points[['Lat', 'Lon', metric]].as_matrix(),
min_opacity=0.2,
max_val=points[metric].max(),
radius=30, blur=17,
max_zoom=1
))
# You can also save the interactive heat map as an HTML file
# folium_map.save("output-map.html")
return folium_map
```

```
sofia_center = [42.697708, 23.321867] # coordinates assumed to be the center of the city
make_heatmap(topo_polution_industry, '0', 'Ind_P10')
```

In the heatmap above you can see where the biggest pollution from the industry was on November 1st 2016. The visualization is in fact relevent with the locations of the biggest industry factories in Sofia.

The next step in our solution was to estimate the industry pollution at the points where the official stations were located.

```
stations_polution_ind = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})
for i in range(0,df_meteo.shape[0]):
for j in range(0,df_stations.shape[0]):
c = 0
for k in range(0,df_ind.shape[0]):
a = (df_stations["Latitude"][j], df_stations["Longitude"][j])
b = (df_ind["Lat"][k],df_ind["Lon"][k])
x = geopy.distance.distance(a, b).km #distance in kilometers
if x < 1:
sigma_y = 50.5 * x**(0.894)
sigma_z = 22.8 * x**(0.675-1.3)
else:
sigma_y = 50.5 * x**(0.894)
sigma_z = 55.4 * x**(0.305-34.0)
c += concentration(y = x ,
q = df_ind["PM10"][k],
u = df_meteo["wind"][i],
h = df_ind["m"][k],
sigma_y = sigma_y,
sigma_z = sigma_z)
stations_polution_ind = stations_polution_ind.append({'Date': i, 'Lat': df_stations["Latitude"][j], 'Lon': df_stations["Longitude"][j], 'Ind_P10': c}, ignore_index=True)
print('Calculations for day {} are ready.'.format(i))
```

```
stations_polution_ind = stations_polution_ind.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
stations_polution_ind["Ind_P10"] = stations_polution_ind["Ind_P10"]/stations_polution_ind["density"]
```

```
stations_polution_ind.head()
```

```
stations_polution_ind.describe()
```

It is interesting to see that some official stations are not affected by industrial pollution.

#### Construction Sites Factor¶

We followed all steps from the previous step (from the industrial factor). The only think that differs in the analysis is that we assumed arbitrary height of the pollutant (in the industry data set we were provided with the actual hight). Our values are as follows - for infrastructure type we use h equal to 5, for small housing type h is 10, and for the other types it is 15.

```
df_const["PM10_gs"] = df_const["PM10"] * 1/(365*24*60*60) # convert the debit to g/s
```

```
topo_polution_const = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})
for i in range(0,df_meteo.shape[0]):
for j in range(0,df_topo.shape[0]):
c = 0
for k in range(0,df_const.shape[0]):
a = (df_topo["Lat"][j], df_topo["Lon"][j])
b = (df_const["lat"][k],df_const["lon"][k])
x = geopy.distance.distance(a, b).km #distance in kilometers
if x < 1:
sigma_y = 50.5 * x**(0.894)
sigma_z = 22.8 * x**(0.675-1.3)
else:
sigma_y = 50.5 * x**(0.894)
sigma_z = 55.4 * x**(0.305-34.0)
if df_const["type"][k] == "infrastructure":
h = 5
elif df_const["type"][k] == "small housing":
h = 10
else:
h = 15
c += round(concentration(y = x ,
q = df_const["PM10_gs"][k],
u = df_meteo["wind"][i],
h = h,
sigma_y = sigma_y,
sigma_z = sigma_z))
topo_polution_const = topo_polution_const.append({'Date': i, 'Lat': df_topo["Lat"][j], 'Lon': df_topo["Lon"][j], 'Ind_P10': c}, ignore_index=True)
print('Calculations for day {} are ready.'.format(i))
```

```
topo_polution_const.head()
```

```
topo_polution_const.describe()
```

```
# divide again by the air density
topo_polution_const = topo_polution_const.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
topo_polution_const["Ind_P10"] = topo_polution_const["Ind_P10"]/topo_polution_const["density"]
```

```
topo_polution_const.describe()
```

```
# visualize the data
make_heatmap(topo_polution_const, '0', 'Ind_P10')
```

```
stations_polution_const = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})
for i in range(0,df_meteo.shape[0]):
for j in range(0,df_stations.shape[0]):
c = 0
for k in range(0,df_const.shape[0]):
a = (df_stations["Latitude"][j], df_stations["Longitude"][j])
b = (df_const["lat"][k],df_const["lon"][k])
x = geopy.distance.distance(a, b).km #distance in kilometers
if x < 1:
sigma_y = 50.5 * x**(0.894)
sigma_z = 22.8 * x**(0.675-1.3)
else:
sigma_y = 50.5 * x**(0.894)
sigma_z = 55.4 * x**(0.305-34.0)
if df_const["type"][k] == "infrastructure":
h = 5
elif df_const["type"][k] == "small housing":
h = 10
else:
h = 15
c += concentration(y = x ,
q = df_const["PM10_gs"][k],
u = df_meteo["wind"][i],
h = h,
sigma_y = sigma_y,
sigma_z = sigma_z)
stations_polution_const = stations_polution_const.append({'Date': i, 'Lat': df_stations["Latitude"][j], 'Lon': df_stations["Longitude"][j], 'Ind_P10': c}, ignore_index=True)
print('Calculations for day {} are ready.'.format(i))
```

```
stations_polution_const = stations_polution_const.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
stations_polution_const["Ind_P10"] = stations_polution_const["Ind_P10"]/stations_polution_const["density"]
```

```
stations_polution_const.describe()
```

## 5. Evaluation¶

We were provided with test data to asses our analysis.

**stations_data_test.csv.csv** - data for the measurements by the official stations from November 21st 2016 to November 25th 2016.

**atmosphere_profile_test.csv** - data to calculate the gradient.

**weather_lbsf_20161101-20161130_IP_test.csv** - weather measurements from November 21st 2016 to November 25th 2016.

#### Industry polution¶

```
meteo = "weather_lbsf_20161101-20161130_IP_test.csv"
df_meteo = pd.DataFrame(pd.read_csv(folder+meteo))
```

```
# transform some variables in correct units
df_meteo["wind"] = df_meteo["sfcWindAVG"] * 1000/3600 # convert wind speed in m/s
```

```
stations_polution_ind_test = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})
for i in range(0,df_meteo.shape[0]):
for j in range(0,df_stations.shape[0]):
c = 0
for k in range(0,df_ind.shape[0]):
a = (df_stations["Latitude"][j], df_stations["Longitude"][j])
b = (df_ind["Lat"][k],df_ind["Lon"][k])
x = geopy.distance.distance(a, b).km #distance in kilometers
if x < 1:
sigma_y = 50.5 * x**(0.894)
sigma_z = 22.8 * x**(0.675-1.3)
else:
sigma_y = 50.5 * x**(0.894)
sigma_z = 55.4 * x**(0.305-34.0)
c += concentration(y = x ,
q = df_ind["PM10"][k],
u = df_meteo["wind"][i],
h = df_ind["m"][k],
sigma_y = sigma_y,
sigma_z = sigma_z)
stations_polution_ind_test = stations_polution_ind_test.append({'Date': i, 'Lat': df_stations["Latitude"][j], 'Lon': df_stations["Longitude"][j], 'Ind_P10': c}, ignore_index=True)
print('Calculations for day {} are ready.'.format(i))
```

```
stations_polution_ind_test = stations_polution_ind_test.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
stations_polution_ind_test["Ind_P10"] = stations_polution_ind_test["Ind_P10"]/stations_polution_ind_test["density"]
```

```
stations_polution_ind_test
```

#### Construction pollution¶

```
stations_polution_const_test = pd.DataFrame({'Date': [], 'Lat': [], 'Lon': [], 'Ind_P10': []})
for i in range(0,df_meteo.shape[0]):
for j in range(0,df_stations.shape[0]):
c = 0
for k in range(0,df_const.shape[0]):
a = (df_stations["Latitude"][j], df_stations["Longitude"][j])
b = (df_const["lat"][k],df_const["lon"][k])
x = geopy.distance.distance(a, b).km #distance in kilometers
if x < 1:
sigma_y = 50.5 * x**(0.894)
sigma_z = 22.8 * x**(0.675-1.3)
else:
sigma_y = 50.5 * x**(0.894)
sigma_z = 55.4 * x**(0.305-34.0)
if df_const["type"][k] == "infrastructure":
h = 5
elif df_const["type"][k] == "small housing":
h = 10
else:
h = 15
c += concentration(y = x ,
q = df_const["PM10_gs"][k],
u = df_meteo["wind"][i],
h = h,
sigma_y = sigma_y,
sigma_z = sigma_z)
stations_polution_const_test = stations_polution_const_test.append({'Date': i, 'Lat': df_stations["Latitude"][j], 'Lon': df_stations["Longitude"][j], 'Ind_P10': c}, ignore_index=True)
print('Calculations for day {} are ready.'.format(i))
```

```
stations_polution_const_test = stations_polution_const_test.join(air_density, on = "Date",lsuffix='_caller', rsuffix='_other')
stations_polution_const_test["Ind_P10"] = stations_polution_const_test["Ind_P10"]/stations_polution_const_test["density"]
```

```
stations_polution_const_test
```

## 6. Deployment¶

Nowadays there are 6 official stations which measures the air quality in Sofia. However, one of them is in the mountain and it is probably not a great source of information about the quality of the air in the city below. The results in our analysis could be used to place new official stations appropriately. And with the introduction of new stations the measurements of the air we breath would be more close to the true ones.

## 3 thoughts on “Datathon – Sofia Air 2.0 – Solution – Team Chameleons”

Great job. I would suggest you to add a legend to the maps. For the stage 2 (construction sites) you could try to take into account the stage of the construction sites based on its starting and ending date.

I would agree with pepe’s comment for state of construction sites.

Otherwise, using color overlay with “blurring” over map for pollution spreading is quite effective way to present results. I will repeat one of my comments to other team “It would be even better (I know you didn’t had time) to take into account surounding buildings and how they impact pollution spread to get more custom “shape” of spread instead of combination of circles” …. I would use custome scale for coloring within “heat map” instead of default ones because something which is bright red doesn’t necessarily means we have so much larger polution effect than in yellow/green zones. It is just default color scale which spans from minimal to maximum available measured value. Applying custom color scale which coresponds with some official polution scale rather that min/max value combination would tell story even better and visually showed all clues where you need to focus for and “smart-city” initiative.

@pepe

Thanks for the feedbacl. In regards to the state of construction, I think this was taken care in the provided dataset. Every construction type has a default length which they pollute the air for. For example, “small housing” is assumed to affect the air in the next n months after starting. We were told that all examples which exceeded this length in comparison to our sample period were filtered out from the construction sites dataset.

@tomislavk

Thanks for the feedback. I completely aggree in regards to the color scale.