Datathons Solutions

Datathon Sofia Air Solution – Air station measurement bias correction using Pearson correlation coefficient

This article aims to improve the estimation of the measured PM10 pollutants. In Sofia, there are several air pollution measurement stations. They measure PM10 particles, which are particles found in the air with a diameter between 2.5 and 10 micrometers.

The measurement stations fall into two categories, official stations and citizen stations. The official stations provide reliable measurements, they are better monitored and documented. The down-side is that they are only 5 and they are all concentrated in a single region. The citizen stations represent devices mounted on people homes or properties which measure PM10 particles. There is a whole network of such devices. They are many in number and provide a good coverage of the city. The problem with those measurements is that they are biased because of many local factors. Therefore the measurements form the citizen stations are not as reliable as those from the official stations, but on the up-side they are many in numbers.

In this article we define a method to reduce the bias of the measurements from the citizen stations.


1. Understanding the data:

For our experiments we took into account only certain parts of the data.

Our data records have 4 fields:

– time, with hour granularity
– longitude
– latitude
– PM10 concentration

1.1 The distance between official stations and citizen stations

In our experiments we try to figure out if the citizen air stations measurements are correlated with the official air stations measurements. We also try to figure out to which extent and then to normalize the measurements of the citizen air stations, by also taking into account the official air stations numbers.

The citizen air measurement stations are spread over a wide distance in the city. On the other hand, the official air stations are grouped in a single smaller region. It is common sense to think that if a citizen station and an official station are very far apart then they are probably less correlated.

One good question to ask would be to estimate a relation between the correlation and the physical distance between two stations. We could go even further and take into account other factors such as topology, local factors and granular meteorology data. Unfortunately, due to time constraints we did not investigated this areas.

In our small research we simply took into account a small region which contains a high number of citizen stations and all of the official stations.
The area is bounded between the latitudes 42.62 and 42.74 and longitudes 23.45 and 23.20:

42.62 < longitude < 42.74
23.20 < latitude < 23.45

We choose this numbers visually, by inspecting the map.

1.2 Time dimension

We took into account only the data from 2018, because it was available in high numbers both for the official stations and citizen stations.
The measurements granularity is by the hour.

2. Measuring the correlation of data

We measure the correlation of the measurements of each of the citizen stations with respect to each of the official stations.
Therefore, for each of the citizen station we are left with 5 correlation coefficients, one for each of the 5 official stations.
Each correlation coefficient is a Pearson correlation coefficient calculated over the whole time interval(from begining of 2018).

Finally, in order to normalize the measurements we perform the following:

For each hour we know the citizen station measurement and the 5 official station measurements.
1) We multiply each of the 5 official measurements with the asociated Pearson correlation coefficient. Each coefficient yields a number between – 1 and 1. Zero means no correlation, while 1 or -1 represents a strong positive or negative correlation.
2) We add together the 5 weighted official measurements and the citizen station measurement.
3) We caclulate the weights sum as 1 plus the sum of all the pearson coefficients. One is from the citizen experiment, we consider it is fully correlated with itself.
4) We divide the weighted measurements sum with the weights sum in order to achieve a weighted mean of the measurement at that time.

3. Results

In order to demonstrate the normalization effect we provide two heat maps of the PM10 measurements, before and after the normalization.
The big squares are the official stations while the small dots represent the citizen stations. As you can see, after the normalization the citizen stations took some influence from the official measurements.

Before normalization:


After normalization:




Share this

5 thoughts on “Datathon Sofia Air Solution – Air station measurement bias correction using Pearson correlation coefficient

    1. 0

      Hello @paspaldzhiev, I’m glad you like the idea. I submited the article with a little delay, I’m sorry for that.
      Please don’t hesitate to ask me any questions about it. Thank you and I wish you a great day !

  1. 0

    @all would be good to see some plots of e.g. the distribution of the changes before/after. Also, would be good to see comparison of official stations w/ some citizen science neighbours. Some of the differences between stations in space may be due to heterogeneous conditions (different sources & intensity of pollution e.g)?

Leave a Reply