|For how many years have you been experimenting with data?||
Popular articles by zenpanik
Popular comments by zenpanik
Great – lots of content and interesting charts and numbers.
6. Would you bet your own money on your predictions? If so how much?
1. You may want to include some evaluation metrics for your models both on train & test sets.
2. On the data prep part – it is not the best solution to just remove rows where you see missing values because it is time-series data and could seriously bias your next steps.
3. Assumption you have made about the “large number of missing values” is probably poor. Do you have any data/metric you used to prove it?
4. You may want to include more detailed explanation why the data is not continuous (here is a link on discrete and continuous data https://www.mathsisfun.com/data/data-discrete-continuous.html)
5. How you would rank your model? What are the metrics you used?
You have underestimated the data understanding, EDA and feature engineering. It is an important part of data science. Having a visual representations would be nice. Also tables and some numbers are welcome in the paper.
Very good article. I would like to see more content inside though. It would be great if you have uploaded the jupyter notebook with results and plots here. I have several comments and suggestions (apologies if these were done but I can not open the project).
1. Since there are multiple already trained models for english language, the first chart can be extended to:
– Multiple histograms depending on PoS taggings (part of speach) – for nouns, verbs, etc.
– Histograms for Plural vs Singular distributions
– Stacked bar charts (histograms) with count of Stop words vs Other words
– Names and Special words – Entity recognition models
2. The second chart of Most common words can be replicated for
– different PoS
– Plurals vs Singulars
– Stop words (why do not you analyze stop words?)
Then you can bucketize the sentenses based on this exploratory analysis
3. Random Forest model – why word2vec was not included as features in random forest model?
4. I do not see any metrics or graphs on the models performance. What is the best performance you were able to achieve?