Monthly Challenge Case: https://www.datasciencesociety.net/monthly-challenge-ontotext-case/
Mentors’ Weekly Instructions: https://www.datasciencesociety.net/text-mining-data-science-monthly-challenge/
At this stage, participants will have to use some visualization techniques to get familiar with the available ‘basic’ features (raw data). This phase is crucial because the made observations and hypotheses will determine what techniques will be appropriate when building the classification model. Also, participants who want to use more complex features will have time to get familiar with the FactForge platform, explore what explanatory variables can be extracted and enrich their datasets. At the end of this first phase, all participants should have a dataset ready to be preprocessed during week 2. Beautiful visualizations of the unique datasets of each team will be more than welcome! 😊
For beginner data scientists – it is highly recommended to stick with the base dataset provided in csv format. Thus, you will be still able to provide a strongly competitive solution and also you will have more time to invest in understanding the basic principles behind text visualization, processing, and modeling.
For advanced data scientists – you can try to dig deeper into the FactForge platform and do some feature engineering before the next stage where all the base + extracted features will have to be processed and prepared for modeling.
Mentor’s advice to this week’s task:
Text Mining Monthly Challenge – Initial Boost to Week 1
The following guidelines follow the CRISP-DM methodology. However, each of the general stages in the proposed methodology consists of important intermediate steps which participants should be aware of while working on their solutions. These guidelines aim to draw the attention to most of the critical points and aspects which should be considered during the development of a working solution of the business problem.
1. 💼 Business Understanding
Make sure you have a good understanding of the underlying problem which is going to be tackled in the Ontotext business case. In a nutshell, the main concept is that the inconsistency in industry classification schemes across different data sources leads to loss of information, missing values and inaccuracies – there is no precise and uniform categorization of companies into industries. This leads to the need for a completely automated and standardized method for classifying companies into industry sectors. This method should be capable of incorporating all the available information for a given company (this information can have substantial differences from one company to another since it comes from different data sources) and provide the correct industry categories.
The need for complete automated classification based on various data sources underlies the business problem at hand and participants will be challenged to use complex graph-based features and overcome the noise in data. Teams should be orientated towards finding a solution which provides not only accurate results but also robustness and applicability (should be bared in mind especially during the stages of feature engineering and validation of the proposed methods). Overcoming this business problem will lead to higher quality of data, ability to carry out various industry analyses, application of more diverse business analytics techniques and tools to get insights into the business environment etc.
2. 📊 Data Understanding + Feature Extraction
Make sure you provide a comprehensive description of all the provided data in this case. Look at different perspectives and describe all your findings. For the advanced data scientists – dig deeper into the FactForge platform (http://factforge.net/) and extract even more complex features. Use all the hints provided by Ontotext and of course – your creativity.
This stage is crucial for all the following steps because essentially you discover what data you have and what can you do with it. It is advisable both for you and for the reader, to visualize all your findings at this stage – find ways to describe the data in interesting and at the same time meaningful ways.
💡 HINTS: Plotly provides powerful graphical and analytical tools for making beautiful visualizations. It can be easily implemented in Jupyter Notebook and used with both R and Python.
❗️ Important Note: For beginner data scientists – it is highly recommended to stick with the base dataset provided in csv format. Thus, you will be still able to provide a strongly competitive solution and also you will have more time to invest in understanding the basic principles behind text visualization, processing and modeling.
2.1. Explore the target variable 🎯
Try finding answers to the following questions:
- How many classes do we have in our target variable?
- How many instances do we have under each label?
- How many companies have multiple (one, two or more industry categories) labels?
- Are some of the classes under-represented (and the opposite) or the dataset is balanced (has equal observations in each class of the target)?
- Are there missing values in the target variable? How would you cope with this problem?
The observations you make at this stage should influence your subsequent choice of sampling, modeling and validation techniques.
2.2. Explore the features 🔎
The following questions can lead you to interesting findings:
- What is the distribution of length (in number of words) of company descriptions? Look at the extremes.
- Is company’s description length correlated with the industry category?
- Are there companies with no text description of their main business activities? How would you cope with this problem?
Answering to those questions will not only help you understand better the data but will also guide you in finding some unusual observations which in turn should influence your choice of text processing techniques in the next stage.
Mentor’s approach to this week’s task: