Monthly Challenge: https://www.datasciencesociety.net/events/text-mining-data-science-monthly-challenge/
Monthly Challenge Case: https://www.datasciencesociety.net/monthly-challenge-ontotext-case/
Mentors’ Weekly Instructions: https://www.datasciencesociety.net/text-mining-data-science-monthly-challenge/
Data Preparation
Textual data is highly unstructured and to extract meaningful insights and apply mathematical algorithms, it should be turned into an appropriate format for analyses. This includes the application of a series of transformations on the data which will help you to represent the text into a numerical format.
1. 1. Text Processing
On this step, it is important to apply only NLP techniques which are consistent with your findings of the data characteristics in the previous stage. Guiding questions:
- Perform some simple text normalization techniques – ex. converting all terms to lower-case (both R and Python are case-sensitive);
- Consider which text features are not directly correlated with the problem and don’t give you any information – for example, are punctuation or any special characters important or we can remove them? Are digits important? Can you find a way to extract meaning from their presence in company descriptions?
- Do the most common words in the English language (ex. ‘the’, ‘a’, ‘is’ etc.) give you any information on the problem at hand? These are commonly referred to as ‘Stop words’ in NLP and have little predictive power. If you decide to perform stop words removal you should carefully check what is going to be removed from your corpus of words (and if necessary – revise the list of stop words while having in mind data characteristics and peculiarities).
- Consider whether it is appropriate to apply some techniques for reducing inflectional forms of words – examples of such techniques are stemming and lemmatization.
- Finally, examine carefully how your text data looks like after the application of all transformations. You can perform some checks like those performed in the data understanding phase. If necessary, revise the list of transformations, change their order or add some extra steps.
💡 HINTS: Useful Python libraries for text processing – re, NLTK.
📖 Resources: For more information on stemming and lemmatization:
- Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data applications. Academic Press.
- Weiss, S. M., Indurkhya, N., & Zhang, T. (2015). Fundamentals of predictive text mining. Springer.
1.2. Text Visualization
After the text processing phase is finished and the text is normalized to some extent, it will be interesting to perform some visualization techniques (for example, word clouds) in order to get even more familiar with the data. The following questions can lead you to some useful findings:
- How many unique terms do you find in the corpus of company descriptions (or other text resources that you are using)?
- Which are the most commonly used words? What about the most uncommonly used words? Find the answer of these questions for each separate category.
1.3. Split your sample
Don’t forget that your results should be validated in some way in the final stage of the analysis. You are not provided with a separate test set, so depending on the sample size (and your approach) you can use cross-validation or some splitting technique in order to test model performance on unseen data.
Mentor’s approach to this week’s task
Check out the Video by the mentor!