Monthly Challenge: https://www.datasciencesociety.net/events/text-mining-data-science-monthly-challenge/
Monthly Challenge Case: https://www.datasciencesociety.net/monthly-challenge-ontotext-case/
Mentors’ Weekly Instructions: https://www.datasciencesociety.net/text-mining-data-science-monthly-challenge/
Text Representation + Feature selection
At the end of this phase, every team should have their datasets turned into an appropriate format for further analysis and building of the classification model. This means that the text should be turned into numbers in order to be able to apply mathematical algorithms. So, participants should choose a text representation method. As a next step, feature selection techniques can be employed with the aim of choosing the best subset of explanatory variables.
1. Text Representation
It’s time to turn the text data into numerical form. The most widely used technique is the Vector space model which represents text in high-dimensional space. Familiarize yourself with model’s assumptions. If it is appropriate for the problem at hand, another question which immediately will arise is which form of text vectorization to use?
💡 HINTS: Read more about binary, count and tf-idf representations. Consider both theoretically and practically which form is more suitable for the problem at hand. You can try different schemes. Useful Python libraries for turning text into the numerical format – sklearn.
📖 Resources: For more information on the Vector Space Model and text representations usage:
- Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data applications. Academic Press.
- Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), 1-167.
2. Feature Engineering and Selection
Here comes the interesting part – feature engineering. At this stage, you should choose what features to use for predicting the industry sectors. Be creative and constantly try to change your perspective of data. It necessary – step back at the data processing stage and extract more information of the text data. Guiding questions:
- Are you going to use only text data or there are other features available? Consider which variables can be correlated with the target.
- How about using the n-gram model for text features?
- How many features are you going to use in the classification? Do you need to use an appropriate feature selection technique to maintain only the most important features, save execution time and reduce chances of overfitting the data?
📖 Resources: For more information on n-grams and feature selection techniques in text mining:
- Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data applications. Academic Press.
- Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), 1-167.
- Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103.
Mentor’s approach to this week’s task
Check out the Video by the mentor!