Thank you for reading. I do not have the same Ln numbers but I think you’re mentioning the warnings associated to the distribution plots. Still wondering how to fix it ^^

Thank you very much for your input martin ! I’ll have a look and see how sensitive the distance computation method is relatively to clustering. Also, I hope you’re ok with my use of a bit of your code regarding the world map data viz part. It was well designed and I wanted a clean/quick solution but will adapt it asap to fit my purposes. Best regards !

Great work! I learned a lot from your Python code.
However, I have some remarks/questions.
Did you filter the citizen data on P10 measurement quality? There are a lot of observations above the max value from the official stations.
What is the point in averaging the data per hour for several citizen stations? What do we gain with it?

Thanks for your message 🙂 You are right I might have filtered PM10 measurement in regard with official stations but I was not sure whether these extremum values were outliers to remove or carried critical information we should keep to properly reflect the most important situations we wish to forecast i.e. rare PM concentration peaks. Secondly, averaging the data per hour for citizen stations in a given geo-unit (or cluster) could synthesize information and smooth the signal. Do not know yet how relevant it is to be honest, but assuming this makes sense, DBA Averaging tackles the updating step in K-Means more appropriately than Euclidean metric for time series.

One more thing, how did you come up with the limits for the pressure measurements? When I compared both data sets (official and citizen), I found that 75% of the citizen measurements for pressure is below the min value from the official data.

Thank you for this relevant question, I also compared official meteo data to citizen ones and found a similar discrepancy on pressure. After a more careful reading of the meteo doc, it was due to the sea-level adjusted pressure they provided. Once you perform this transformation to take into account elevation and temperature in the pressure derivation, the adjusted version nicely belongs to the extremum from official data. This was not obvious at all, and I think those measurement are now properly comparable, especially on two different elevation levels (see update notebook)

## 14 thoughts on “Monthly Challenge – Sofia Air – Solution – Jeremy Desir Weber”

Looks great. What is the matter with this Ln108 erorr?

Thank you for reading. I do not have the same Ln numbers but I think you’re mentioning the warnings associated to the distribution plots. Still wondering how to fix it ^^

Your assignments to peer review (and give feedback below the coresponding articles) for week 2 of the Monthly challenge are the following teams:

https://www.datasciencesociety.net/sofia-air-week-1/

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-jacob-avila/

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-iseveryonehigh/

Great work. Using clustering to match the stations is pretty cool.

Thanks for sharing, there are some interesting concepts. For the distance algorithm you might want to check out the vincenty distance:

https://stackoverflow.com/questions/38248046/is-the-haversine-formula-or-the-vincentys-formula-better-for-calculating-distan

Thank you very much for your input martin ! I’ll have a look and see how sensitive the distance computation method is relatively to clustering. Also, I hope you’re ok with my use of a bit of your code regarding the world map data viz part. It was well designed and I wanted a clean/quick solution but will adapt it asap to fit my purposes. Best regards !

Of course you are more than welcome 🙂 At the end if the day we are here to collaborate and learn! (At least i am 🙂

Your assignments to peer review (and give feedback below the coresponding articles) for week 3 of the Monthly challenge are the following teams:

https://www.datasciencesociety.net/the-pumpkins/

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-kiwi-team/

https://www.datasciencesociety.net/data-exploration-observations-planning/

Great work! I learned a lot from your Python code.

However, I have some remarks/questions.

Did you filter the citizen data on P10 measurement quality? There are a lot of observations above the max value from the official stations.

What is the point in averaging the data per hour for several citizen stations? What do we gain with it?

Thanks for your message 🙂 You are right I might have filtered PM10 measurement in regard with official stations but I was not sure whether these extremum values were outliers to remove or carried critical information we should keep to properly reflect the most important situations we wish to forecast i.e. rare PM concentration peaks. Secondly, averaging the data per hour for citizen stations in a given geo-unit (or cluster) could synthesize information and smooth the signal. Do not know yet how relevant it is to be honest, but assuming this makes sense, DBA Averaging tackles the updating step in K-Means more appropriately than Euclidean metric for time series.

One more thing, how did you come up with the limits for the pressure measurements? When I compared both data sets (official and citizen), I found that 75% of the citizen measurements for pressure is below the min value from the official data.

Thank you for this relevant question, I also compared official meteo data to citizen ones and found a similar discrepancy on pressure. After a more careful reading of the meteo doc, it was due to the sea-level adjusted pressure they provided. Once you perform this transformation to take into account elevation and temperature in the pressure derivation, the adjusted version nicely belongs to the extremum from official data. This was not obvious at all, and I think those measurement are now properly comparable, especially on two different elevation levels (see update notebook)

Your assignments to peer review (and give feedback below the coresponding articles) for week 4 of the Monthly challenge are the following teams:

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-banana/

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-dirty-minds/

https://www.datasciencesociety.net/monthly-challenge-sofia-air-solution-kung-fu-panda/

I am highly impressed by the progress you have made – method-wise and analysis-wise…. good job!