In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors.
By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS team and the industry mentors have tried to do the most work on phases “1. Business Understanding” “2. Data Understanding”, while it is expected that the teams would focus more on phases 3, 4 and 5 (“Data Preparation”, “Modeling” and “Evaluation”), while Phase “6. Deployment” mostly stays in the hand of the case providing companies.
MENTORS’ GUIDELINES
(see the case here: https://www.datasciencesociety.net/the-ontotext-case-data-enriched/)
Starter on the Ontotext Case with Python – by Thomas Roca from Microsoft
Advice from a mentor: Thomas Roca, PhD Data strategist at Microsoft
Need more help ? Find Thomas on the Data.Chat – @thoms.
1. Where to start for building a training set
- First step get industry classification from ICB
- To start with the training set, a way could be to take a look at Forbes 2000 global ranking of the 2000 biggest global company. CSV of if can be found on Google dataset search
- a matching with ICB classification may be necessary..
- Extra info are available on open data for exemple:
- here: http://factforge.net/ or there: http://dbpedia.org/page/Microsoft
2. Build a text classification model
- Machine learning Ressources:
- eg. with scikit learn:
- http://scikit-learn.org/stable/modules/svm.html#classification
- blog post of an implementation of it: https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f
- Ressources on linkedin Learning: https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/
- Deep learning with TensorFlow:
- eg. with scikit learn:
Some advices for features extraction and text cleaning: Machine does make distinctions between each and every character. Reduce noise by sticiking to the meaning by: lowering case for eveything, getting rid of punctuation, stop words, numbers if not relevant. Tokenize, try to stemm or lemmatize if needed.
Preprocessing tips:
- https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/preprocess-text
- https://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization
other resources
- http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
- https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
- https://www.learndatasci.com/tutorials/predicting-reddit-news-sentiment-naive-bayes-text-classifiers/
- https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
- https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
- https://github.com/kjam/random_hackery/blob/master/Comparing%20Fake%20News%20Classifiers.ipynb
- https://blog.kjamistan.com/comparing-scikit-learn-text-classifiers-on-a-fake-news-dataset/
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
- https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/
Get industry classification from ICB
You can have a look at an example of the use of beautifulSoup for scarpping web pages in this Github notebook by Thomas Roca.
More tips
1. Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.
The advice from mentors:
Gloria Hristova (@gloria):
Make sure you have a good understanding of the underlying problem which is going to be tackled in the
Ontotext business case. In a nutshell, the main concept is that the inconsistency in industry classification schemes across different data sources leads to loss of information, missing values and inaccuracies – there is no precise and uniform categorization of companies into industries. This leads to the need for a completely automated and standardized method for classifying companies into industry sectors. This method should be capable of incorporating all the available information for a given company (this information can have substantial differences from one company to another since it comes from different data sources) and provide the correct industry categories.
The need for complete automated classification based on various data sources underlies the business problem at hand and participants will be challenged to use complex graph-based features and overcome the noise in data. Teams should be orientated towards finding a solution which provides not only accurate results but also robustness and applicability (should be bared in mind especially during the stages of feature engineering and validation of the proposed methods). Overcoming this business problem will lead to higher quality of data, ability to carry out various industry analyses, application of more diverse business analytics techniques and tools to gain insights into the business environment etc.
2. Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
The advice from mentors:
Jan-Benedikt Jagusch (@jjagusch):
- With textual data, an exhaustive exploration is critical.
- Before throwing algorithms at the problem, read through some samples in your corpus and check for the following criteria:
- Are all documents written in the same language (English)?
- Are the missing values?
- Do all documents follow the same structure with approximately the same length?
- Are the specific keywords?
- Are there obvious formatting problems that need to be fixed (hyperlinks, tabs, carriage returns)?
- Should have a good understanding of your text database, before continuing your project.
Gloria Hristova (@gloria):
Make sure you provide a comprehensive description of all the provided data in this case. Look at different perspectives and describe all your findings. Dig deeper into the FactForge platform and extract even more complex features. Use all the hints provided by Ontotext and of course – your creativity. 🧙♂️
This stage is crucial for all the following steps because essentially you discover what data you have and what can you do with it. It is advisable both for you and for the reader, to visualize all your findings at this stage – find ways to describe the data in interesting and at the same time meaningful ways.
💡HINTS: Plotly provides powerful graphical and analytical tools for making beautiful visualizations. It can be easily implemented in Jupyter Notebook and used with both R and Python.
2.1. Explore the target variable 🎯
Try finding answers to the following questions:
- How many classes do we have in our target variable?
- How many instances do we have under each label?
- How many companies have multiple (one, two or more industry categories) labels?
- Are some of the classes under-represented (and the opposite) or the dataset is balanced (has equal
observations in each class of the target)?
The observations you make at this stage should influence your subsequent choice of sampling, modeling and validation techniques.
2.2. Explore the features 🔎
The following questions can lead you to interesting findings:
- How many unique terms (before processing) do you find in the corpus of company descriptions (or other text resources that you are using)?
- Which are the most commonly used words? What about the most uncommonly used words?
- What is the distribution of length (in number of words) of company descriptions? Look at the extremes.
- Find length of individual terms. Look at the extremes.
Answering to those questions will not only help you understand better the data but also will guide you in finding missing values or some unusual observations which in turn should influence your choice of text processing techniques in the next stage.
3. Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.
The advice from mentors:
Jan-Benedikt Jagusch (@jjagusch):
It should not come as a surprise that machine learning algorithms cannot innately work with text. So, we will need to convert the corpus to numbers! Here are some steps that your pipeline should not miss:
- Standardise the formatting: remove all special characters, numbers, hyperlinks and replace all white spaces with regular spaces. Convert everything to lower case. Pay also special attention to word contractions (“I’m”, “he’ll”, etc.).
- Create word tokens: split your string by spaces.
- Reduce words to its origin: possible application of stemming or lemmatisation. Usually it is necessary to tag the part-of-speech first.
- Remove words that have no value: these words are also called “stop words”.
- You can consider filtering further and focusing your analysis on nouns and adjectives.
- Also you can consider applying spell checking to your corpus. But be careful, this takes quite some time and might erroneously correct special terms it is not trained on. You can find a spell checking algorithm here (written by the famous Peter Norvig): https://norvig.com/spell-correct.html.
- Next you need to convert your cleaned corpus into a matrix, using word embedding. In increasing order of complexity, possible approaches are Bag-of-Words, TF-IDF, Word 2 Vec, Sentence 2 Vec.
- Especially with BoW and TF-IDF, reducing the feature space is key. You could think about using Singular Value Decomposition or Latent Dirichlet Allocation.
Gloria Hristova (@gloria):
Textual data is highly unstructured and to extract meaningful insights and apply mathematical algorithms, it should be turned into appropriate format for analyses. This includes the application of series of transformations on the data which will help you to represent the text into numerical format.
3.1. Text Processing 🔣
On this step it is important to apply only NLP techniques which are consistent with your findings about the data characteristics in the previous stage. Guiding questions:
- Perform some simple text normalization techniques – ex. converting all terms to lower-case (both R and Python are case-sensitive);
- Consider which text features are not directly correlated with the problem and don’t give you any
information – for example, are punctuation or special characters important or we can remove them? - Are digits important? Can you find a way to extract meaning from their presence in a company
description? - Perform tokenization of text data (below this section you can find link explaining the technique and its meaning).
- Do the most common words in the English language (ex. ‘the’, ‘a’, ‘is’ etc.) give you any information on the problem at hand? These are commonly referred as ‘Stop words’ in NLP and have little predictive power. If you decide to perform stop words removal you should carefully check what is going to be removed from your corpus of words (and if necessary – revise the list of stop words while having in mind data characteristics and peculiarities).
- Consider applying some technique for reducing inflectional forms of words – example of such techniques are stemming and lemmatization.
- Finally, examine carefully how your text data looks like after the application of all transformations. You can perform some checks like those performed in the data understanding phase. If necessary, revise the list of transformations, change their order or add some extra steps.
💡HINTS: Useful Python libraries for text processing – re, NLTK.
📖 Resources: For more information on tokenization, stemming and lemmatization:
1. Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data applications. Academic Press.
2. Weiss, S. M., Indurkhya, N., & Zhang, T. (2015). Fundamentals of predictive text mining. Springer.
3.2. Text Representation 📜
It’s time to turn the text data into numerical form. The most widely used technique is the Vector space model which represents text in high-dimensional space. Familiarize yourself with model’s assumptions. If it is appropriate for the problem at hand, another question which immediately will arise is which form of text vectorization to use?
💡 HINTS: Read more about binary, count and tf-idf representations. Consider both theoretically and practically which form is more suitable for the problem at hand. You can try different schemes. Useful Python libraries for turning text into numerical format – sklearn.
📖 Resources: For more information on the Vector Space Model and text representations usage:
1. Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data applications.
Academic Press.
2. Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language
technologies, 5(1), 1-167.
3.3. Feature Engineering 🤓
Here comes the interesting part – feature engineering. At this stage, you should choose what features to use for predicting the industry sectors. Be creative and constantly try to change your perspective of data. If necessary -step back at the data processing stage and extract more information of the text data. Guiding questions:
- Are you going to use only text data or there are other features available? Consider which variables can be correlated with the target.
- How about using the n-gram model for text features?
- How many features are you going to use in the classification? Do you need to use appropriate feature selection technique to maintain only the most important features, save execution time, avoid memory issues and reduce chances of overfitting the data?
📖 Resources: For more information on n-grams and feature selection techniques in text mining:
1. Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data applications.
Academic Press.
2. Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language
technologies, 5(1), 1-167.
4. Modeling
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.
The advice from mentors:
Jan-Benedikt Jagusch (@jjagusch):
- Keep in mind that this is a multi-label classification problem. Analyse the distribution of the labels and consider using over- or undersampling (e.g. SMOTE).
- Understand the magnitude of your training data. Eventually, you will have numerous input features for your algorithm. Maybe it is appropriate to find a model that can run in parallel and does not use all input features simultaneously (such as a Random Forest).
Gloria Hristova (@gloria):
When choosing modeling technique make sure that you are completely aware of the limitations of data and the specific assumptions about the data made by the given data mining algorithm.
A key aspect of the case at hand is that you are dealing with a multi-label classification problem (each company can be assigned to several industry categories). At the data modeling stage, it is very important that participants consider algorithms and techniques that are suitable for such type of data mining problems.
You should also consider target distribution and figure out how to deal with the imbalanced data.
Don’t forget that your results should be validated in some way. Depending on the sample size you can use cross-validation or some splitting technique in order to test model performance on unseen data.
💡 HINTS: Don’t forget that different vectorization schemes, number of features and feature selection approach can lead to substantially different results during the modeling stage. You can use some form of a grid search technique (such example is the Pipeline class in Python) in order to find the optimal sequence of different transformations that will lead to the best results.
Example of using Pipeline in Python in order to choose best text representation, n-gram model, number of features and hyper-parameters of the used classifier.
5. Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
The advice from mentors:
Jan-Benedikt Jagusch (@jjagusch):
- Think about what metric you want to optimise for. In the case of imbalanced learning, accuracy is a very biased metric.
- Do not forget that we are looking for a realistic solution. Maybe you want to balance classification performance with complexity and interpretability. Important note: you cannot interpret your model without understanding your input features. Especially SVD can produce quite transparent features (“topics”) that you can analyse graphically and statistically.
Gloria Hristova (@gloria):
For evaluating the results and ensuring the robustness of the proposed solution to this classification task – think about which measures for validation are appropriate when dealing with multi-label classification problems. Analyze the results and the probable reasons for misclassification (Type I or Type II errors).
6. Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.
The advice from mentors:
Jan-Benedikt Jagusch (@jjagusch):
- Data science is all about visualisation: try to convey your ideas graphically.
Gloria Hristova (@gloria):
Summarize all your results. And once again – share your findings and thoughts in creative ways – use funny pictures, emoticons and whatever you decide is useful and can help you in expressing all the hard work you have done and what you have accomplished! 🏆
Finally, try answering the following questions:
- Did you manage to answer to the business problem at hand? What are the limitations?
- Do you have suggestions for further improvement of your work?