Monthly Challenge: https://www.datasciencesociety.net/events/text-mining-data-science-monthly-challenge/
Monthly Challenge Case: https://www.datasciencesociety.net/monthly-challenge-ontotext-case/
Mentors’ Weekly Instructions: https://www.datasciencesociety.net/text-mining-data-science-monthly-challenge/
Data modeling + Investigation of the results + Evaluation
At this stage, participants will have to use machine learning algorithms for classifying the companies into industry categories. Teams will be encouraged to try different techniques and then investigate the results. During this investigation, participants should analyze carefully the causes of misclassification.
When choosing modeling technique make sure that you are completely aware of the limitations of data and the specific assumptions about the data made by different data mining algorithms. You should also consider the target distribution and figure out how to deal with the imbalanced nature of data.
💡 HINTS: Don’t forget that different vectorization schemes, number of features and feature selection approach can lead to substantially different results during the modeling stage. You can use some form of a grid search technique (the Pipeline class in Python will be helpful in this task) in order to find the optimal sequence of different transformations that will lead to the best results.
Example of using Pipeline in Python in order to choose best text representation, n-gram model, number of features and hyper-parameters of the used classifier – http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html
Think about which measures for validation will be appropriate for evaluating the results and for ensuring the robustness of the proposed solution to this classification task. Perform error analysis – carefully investigate the results and the probable reasons for misclassification (Type I or Type II errors).
Summarize all your results. And once again – share your findings and thoughts in creative ways – use funny pictures, emoticons and whatever you decide is useful and can help you in expressing all the hard work you have done and what you have accomplished! 🏆
Finally, try answering the following questions:
- Did you manage to answer to the business problem at hand? What are the limitations?
- Do you have suggestions for further improvement in your work?
Mentor’s approach to this week’s task
Check out the Video by the mentor!
Link to GitHub script from the video