1. Understanding the data:
For our experiments we took into account only certain parts of the data.
Our data records have 4 fields:
– time, with hour granularity
– PM10 concentration
1.1 The distance between official stations and citizen stations
In our experiments we try to figure out if the citizen air stations measurements are correlated with the official air stations measurements. We also try to figure out to which extent and then to normalize the measurements of the citizen air stations, by also taking into account the official air stations numbers.
The citizen air measurement stations are spread over a wide distance in the city. On the other hand, the official air stations are grouped in a single smaller region. It is common sense to think that if a citizen station and an official station are very far apart then they are probably less correlated.
One good question to ask would be to estimate a relation between the correlation and the physical distance between two stations. We could go even further and take into account other factors such as topology, local factors and granular meteorology data. Unfortunately, due to time constraints we did not investigated this areas.
In our small research we simply took into account a small region which contains a high number of citizen stations and all of the official stations.
The area is bounded between the latitudes 42.62 and 42.74 and longitudes 23.45 and 23.20:
42.62 < longitude < 42.74
23.20 < latitude < 23.45
We choose this numbers visually, by inspecting the map.
1.2 Time dimension
We took into account only the data from 2018, because it was available in high numbers both for the official stations and citizen stations.
The measurements granularity is by the hour.
2. Measuring the correlation of data
We measure the correlation of the measurements of each of the citizen stations with respect to each of the official stations.
Therefore, for each of the citizen station we are left with 5 correlation coefficients, one for each of the 5 official stations.
Each correlation coefficient is a Pearson correlation coefficient calculated over the whole time interval(from begining of 2018).
Finally, in order to normalize the measurements we perform the following:
For each hour we know the citizen station measurement and the 5 official station measurements.
1) We multiply each of the 5 official measurements with the asociated Pearson correlation coefficient. Each coefficient yields a number between – 1 and 1. Zero means no correlation, while 1 or -1 represents a strong positive or negative correlation.
2) We add together the 5 weighted official measurements and the citizen station measurement.
3) We caclulate the weights sum as 1 plus the sum of all the pearson coefficients. One is from the citizen experiment, we consider it is fully correlated with itself.
4) We divide the weighted measurements sum with the weights sum in order to achieve a weighted mean of the measurement at that time.
In order to demonstrate the normalization effect we provide two heat maps of the PM10 measurements, before and after the normalization.
The big squares are the official stations while the small dots represent the citizen stations. As you can see, after the normalization the citizen stations took some influence from the official measurements.