Prediction systems

Datathon Sofia Air Solution – Telelink Case Solution

4
votes

5 thoughts on “Datathon Sofia Air Solution – Telelink Case Solution

  1. 0
    votes

    Hi all!

    Good focus on data quality so far and cool use of maps for exploratory analysis. Well-spotted on missing values in the ‘official’ (EEA) dataset!

    For data enriching, did you have to deal with cases where there are large chunks of missing hours/days – e.g. how would you fill values missing values between 2PM and 8PM when all values are missing? Does your approach aim to handle this?

    Good notes on future improvements to be made. Esp. like the potential use of Google’s traffic data – too often folk focus on reinventing the wheel with primary measurements, when more and more often these days, our benevolent corporate overlords have already done the legwork for us!

  2. 0
    votes

    Hi, guys, good job!

    Business Understanding: the text is relevant and the research objectives are stated clearly.

    Data Understanding: I like very much application of heatmaps so as to visualize the air pollution information contained in the citizen’s data set. You’ve did a good analysis on the issues related to the data in the set with official measurements. Yet, pay attention that citizen’s dataset spans from 2017 to 2018. Therefore, taking years 2013-2016 as training set and 2017-2018 as a test set would work only if you are to predict air pollution at the main stations. However, the objective is to deliver forecasts for citizen stations.

    Including a section on future improvements and a list of references is an advantage.

  3. 0
    votes

    @all I do not completely understand this part:

    Chech if the PM10 from the sensor is no more than 3 times bogger than the official one:

    if it falls in the limit – take the data ai valid;
    if not – replace with the official stations.

    What do you mean by “replace”?

    1. 0
      votes

      Hi, there are some typos here.

      This is regarding the task to validate if the citizens data are valid and trusted. So, we have decoded the geohashes and ended up with the locations of the citizens’ stations. After that we calculated the distances between a station and all official stations. We grouped by datee and station and checked if the mean measure for a day at a particular station is 3 times bigger than the official mean measure. If it is, then we assumed that there is a measurement error and we replace this value with the official measurement value. Here, the cutoff is somewhat subjective. We were thinking to compare with the 3 standard deviations intervals but aproach will not be appropriate as we are not working with a normal distribution (we did not run a formal tests, but the data cannot be with a value below 0, so another distribution should be better, probably gamma). So this constant, 3, is just arbitrary.

Leave a Reply