I have just begun my machine learning course from Andrew Ng at Coursera so I thought that this challenge would be a good test of my learnings. I apologise for the delay for article writing as I was not sure if I should have taken this challenge or not since the dataset seemed difficult to understand. After seeing a few articles around, I think I got an idea of what to do for my first step.
Here are things that I did for week 1 :
I imported the dataset into jupyter notebook and first thing I did was to join the the two datasets.
Then I grouped them by their geohashes and separated them into various dictionaries. Each with key as the geohash and value as the various columns of that geohash.
Then I removed the geohash column from the dataframe inside those values of the dictionary
After that, I used ‘ffill’ which is known as the forward fill for replacing 0 values from temperature, pressure and humidity. I don’t think so I should have replaced the 0 values from the temperature column but most temperature value seemed more than 0 degrees at a glance. I will change that in the future (I’m traveling now so I don’t have access to jupyter notebook)
I did this after grouping geohashes because I don’t want different geohashes’ values to get mixed up.
Now comes the hardest part which took me a lot of time and hair pulling. The visualisation.
I was using spyder all this time because it has a fantastic feature called ‘variable explorer’ which I’m a huge fan of. I wanted to visualise these data as heat map. So I started with generating a KML file. It failed miserably, then I moved on to generating a GeoJSON. That too failed horribly. Last night I was randomly searching stuff I stumbled upon a library called ‘Folium’. Lo-behold, all my problem solved !
But it required Jupyter Notebook, so I needed some time to make the switch :/
So then, I take the average of all the values in P1, P2 etc. And map them to each geohash.
That’s all for week 1. Next I’ll look into Linear regression !