Why you should join the Data Science Monthly Challenge and what you can expect?
The Data Science Monthly Challenge provides an exceptional opportunity for participants, no matter of their background and previous experience, to be involved in finding a solution to a real data science problem step by step. The proposed gradual approach towards advanced business problems will give participants a chance to familiarize themselves in depth with each of the important steps which should be considered during the development of an effective and high-quality data science projects.
It is important to note that each of the general stages consists of important intermediate steps which participants should be aware of while working on their solutions. Thus, during the monthly challenge teams are gaining knowledge on key concepts in the analytics field while meantime they are getting used to the employment of benchmark methodologies for successfully carrying out data science projects.
And last but not least the monthly challenge is an excellent opportunity for data enthusiasts to prepare themselves for working as data scientists where the experience with real data science problems is invaluable. Nevertheless, the monthly challenge can also be inspiring for those with a more competitive attitude to gain knowledge and skills and to participate in the international Datathons organized by the Data Science Society
So, what are you waiting for? 🙂 Start your data science learning journey now and read what the mentors at the April’s Data Science Monthly challenge have prepared for you.
Monthly Challenge: https://www.datasciencesociety.net/events/text-mining-data-science-monthly-challenge/
Case: https://www.datasciencesociety.net/monthly-challenge-ontotext-case/
Structure of the Data Science Monthly Challenge
An initial plan of what will be done each week (participants will be provided with a new tutorial each week which will serve as a starting point for further analysis):
Week 1.Data Understanding + Feature Extraction
At this stage, participants will have to use some visualization techniques to get familiar with the available ‘basic’ features (raw data). This phase is crucial because the made observations and hypotheses will determine what techniques will be appropriate when building the classification model. Also, participants who want to use more complex features will have time to get familiar with the FactForge platform, explore what explanatory variables can be extracted and enrich their datasets. At the end of this first phase, all participants should have a dataset ready to be preprocessed during week 2. Beautiful visualizations of the unique datasets of each team will be more than welcome! 😊
Important Note:
For beginner data scientists – it is highly recommended to stick with the base dataset provided in csv format. Thus, you will be still able to provide a strongly competitive solution and also you will have more time to invest in understanding the basic principles behind text visualization, processing, and modeling.
For advanced data scientists – you can try to dig deeper into the FactForge platform and do some feature engineering before the next stage where all the base + extracted features will have to be processed and prepared for modeling.
Mentor’s advice to this week’s task:
Text Mining Monthly Challenge – Initial Boost to Week 1
The following guidelines follow the CRISP-DM methodology. However, each of the general stages in the proposed methodology consists of important intermediate steps which participants should be aware of while working on their solutions. These guidelines aim to draw the attention to most of the critical points and aspects which should be considered during the development of a working solution of the business problem.
Mentor’s approach to this week’s task:
Monthly Challenge – Ontotext case – Week 1 – Mentor’s Approach
Week 2. Data processing + WordClouds
At this stage, participants will have to use some simple and/or more complex natural language processing techniques in order to normalize the messy data. Again, some visualization techniques can be employed in order to get even more insights from the processed data. WordCloud visualizations are powerful, beautiful and always there when it comes to text analytics – learn how to create them. 😊
P.S. Do not forget to split your data – you are not provided with a separate test set, so it depends on you to leave some of the observations in order to validate your model in the end of the analysis.
Mentor’s advice to this week’s task:
Monthly Challenge – Ontotext case – Week 2 – Mentor’s Approach
Week 3. Text Representation + Feature selection
At the end of this phase, every team should have their datasets turned into an appropriate format for further analysis and building of the classification model. This means that the text should be turned into numbers in order to be able to apply mathematical algorithms. So, participants should choose a text representation method. As a next step, feature selection techniques can be employed with the aim of choosing the best subset of explanatory variables.
Mentor’s advice to this week’s task:
Monthly Challenge – Ontotext case – Week 3 – Mentor’s Approach
Week 4. Data modeling + Investigation of the results + Evaluation and summary of all the hard work done.
At this stage, participants will have to use machine learning algorithms for classifying the companies into industry categories. Teams will be encouraged to try different techniques and then investigate the results. During this investigation, participants should analyze carefully the causes of misclassification.
Finally, participants will have to propose their best solution to the problem at hand according to their findings made during the entire analysis. Teams will have time to sum up all that has been done and prepare a presentation of their results. Also, it is desirable that teams provide suggestions on what could have been done better according to them and whether they have some ideas for future analysis and development of their current solutions.
Mentor’s advice to this week’s task:
Monthly Challenge – Ontotext case – Week 4 – Mentor’s Approach
Mentor’s approach to this week’s task
The video tutorial and the code will be uploaded on 23. May!
Expected Output
The monthly challenge ends with publishing your final article solutions – the result should be a clearly presented model that can be realistically implemented.