**1. Understanding the data:**

For our experiments we took into account only certain parts of the data.

Our data records have 4 fields:

– time, with hour granularity

– longitude

– latitude

– PM10 concentration

**1.1 The distance between official stations and citizen stations**

In our experiments we try to figure out if the citizen air stations measurements are correlated with the official air stations measurements. We also try to figure out to which extent and then to normalize the measurements of the citizen air stations, by also taking into account the official air stations numbers.

The citizen air measurement stations are spread over a wide distance in the city. On the other hand, the official air stations are grouped in a single smaller region. It is common sense to think that if a citizen station and an official station are very far apart then they are probably less correlated.

One good question to ask would be to estimate a relation between the correlation and the physical distance between two stations. We could go even further and take into account other factors such as topology, local factors and granular meteorology data. Unfortunately, due to time constraints we did not investigated this areas.

In our small research we simply took into account a small region which contains a high number of citizen stations and all of the official stations.

The area is bounded between the latitudes 42.62 and 42.74 and longitudes 23.45 and 23.20:

**42.62 < longitude < 42.74**

**23.20 < latitude < 23.45**

We choose this numbers visually, by inspecting the map.

**1.2 Time dimension**

We took into account only the data from 2018, because it was available in high numbers both for the official stations and citizen stations.

The measurements granularity is by the hour.

**2. Measuring the correlation of data**

We measure the correlation of the measurements of each of the citizen stations with respect to each of the official stations.

Therefore, for each of the citizen station we are left with 5 correlation coefficients, one for each of the 5 official stations.

Each correlation coefficient is a Pearson correlation coefficient calculated over the whole time interval(from begining of 2018).

Finally, in order to normalize the measurements we perform the following:

For each hour we know the citizen station measurement and the 5 official station measurements.

1) We multiply each of the 5 official measurements with the asociated Pearson correlation coefficient. Each coefficient yields a number between – 1 and 1. Zero means no correlation, while 1 or -1 represents a strong positive or negative correlation.

2) We add together the 5 weighted official measurements and the citizen station measurement.

3) We caclulate the weights sum as 1 plus the sum of all the pearson coefficients. One is from the citizen experiment, we consider it is fully correlated with itself.

4) We divide the weighted measurements sum with the weights sum in order to achieve a weighted mean of the measurement at that time.

**3. Results**

In order to demonstrate the normalization effect we provide two heat maps of the PM10 measurements, before and after the normalization.

The big squares are the official stations while the small dots represent the citizen stations. As you can see, after the normalization the citizen stations took some influence from the official measurements.

Before normalization:

After normalization:

## 5 thoughts on “Datathon Sofia Air Solution – Air station measurement bias correction using Pearson correlation coefficient”

Approach from the article name sounds good, now just waiting to see the write-up 🙂

Hello @paspaldzhiev, I’m glad you like the idea. I submited the article with a little delay, I’m sorry for that.

Please don’t hesitate to ask me any questions about it. Thank you and I wish you a great day !

@all would be good to see some plots of e.g. the distribution of the changes before/after. Also, would be good to see comparison of official stations w/ some citizen science neighbours. Some of the differences between stations in space may be due to heterogeneous conditions (different sources & intensity of pollution e.g)?

Approach is ok. But prediction part is not covered much. There is lot scope of improvement.

Hello all, thank you very much for your suggestions and ideas, I will keep them in mind for the future.