Datathon Case Description
This document describes the requirements for Case description for the Datathon, organized by Data Science Society. This is a revised version as of March 2019.
1.1.What does Data Science Society provide?
The Datathon is conducted both on- and off- line at the same time, hence there are some new requirements for the case description.
Data Science Society promotes online and offline (before and during the event) each case with equal treatment using the supplied marketing (and other) information by the partner company. Data Science Society provides equal visibility of all cases among the participants.
Data Science Society may organize additional mentors’ assistance to any of the cases.
1.2. What does the partner company provide?
Each partner company provides a case and depending on its overall quality it might get selected to be solved by teams at the event. The teams have freedom of choice when deciding on which cases to work on and only the attractive and interesting cases will be selected.
The partner company provides marketing information about the industry experts from the partner company (short bio, photos) provided about two months before the event.
With each provided case the following should be included:
· Written description (see 2.1, 2.2., 2.3. & 2.4. below);
· Data set (see 3. below);
· Short video presentation (see 4. Below).
· Mentorship by Industry experts (see 5. Below);
The data set will be available to the teams non-confidentially, including after the event.
2.1. Business problem formulation
The goal of this first section of the case is the participants in our Datathon to obtain enough understanding about the problem under investigation from business perspective.
More precisely, answers of the following questions have to be provided:
· What is the business problem?
· Why business needs to solve the problem?
· What are the important problem specifics (from business sight), which have to be accounted for in the solution?
· Are there hypotheses, which have to be investigated and possibly introduced in the solution?
· What are the business requirements, which have to be satisfied in order the final result to be satisfactory?
Practical examples can put additional light on the problem description.
2.2. Research problem specification
This section consists of problem description from data science point of view. It is meant to assist the teams to what general directions should they take in their solution.
Firstly, this part should consist of recommended data science approach, methods and/or techniques, whenever it is possible or desirable for the case.
Secondly, some hints, examples, best practices, relevant papers, etc. could be provided, if possible, to clarify the case and to further direct the teams.
Thirdly, useful insights, cross-sections, distributions, etc. may be provided for particular parts of the case.
Lastly, define acceptable accuracy measure, KPI-s, and other technical requirements.
2.3. Data description
This part describes the dataset provided for the case solution:
· Precise description of each variable, its meaning, and each variable value meaning.
· The structure of the data set(s) – indices of independent and/or dependent variables.
· Variables types, ranges, special values, missing values, etc. in the data set.
· If participants would have to additionally collect data: Description of appropriate data sources, structures, etc.
2.4. Expected outputs
This part of the text provides information about required/desired results from the teams working on the given case. These expected outputs may depend on the business problem, or on the research specifics, or on the data set, or some other technical requirement.
The expected outputs may include description of various items expected to be built while solving the case such as:
· Algorithms, workflows, etc.;
· Possible type of models, rule sets, etc.;
· Source code, used environments and libraries.
All outputs (including the expected by the case) will be documented in a publicly available paper (DSS template), structured along the concept of CRISP-DM.
3. Data set
3.1. Full data set
The data set will be available to the teams non-confidentially. The data size should be limited to about 20 GB (advisable 10 GB). If the data set is big it would be better to distribute it among several files.
3.2. Control data set
Three data subsets have to be provided:
- working data subset – to be presented to the participants to build their solution;
- validation data subset for the automated Leader board throughout the competition;
- additional control data subset, which stays hidden from the participants throughout the competition, but it is used to evaluate the results at the end.
3.3. Sample data set
Also it is highly advisable to provide data samples for easy understanding of the structure. The sample data set is given to the participants before the start of the competition.
4. Video presentation of the case.
4.1. Video requirements
The provided video will be the only spoken presentation of the case to the participants in the event. The video should be up to 5 minutes long and it will be uploaded to YouTube DSS channel so the video should be optimized according to YouTube guidelines (see them here…).
4.2. Video content.
The video should contain:
· Spoken explanations on the case, including: Business problem formulation, Research problem specification, Data description, Expected outputs and a humorous joke;
· Suitable visualizations for the important aspects of the case.
4.3. Additional materials.
Aside from the video, the following may be provided:
· Subtitles (advisable);
· Short written description of the video and/or the presenting persons, also including necessary links.
5. Mentorship requirements for Industry experts
In order to support the teams in their task, for each case there should be business and data science expert(s) from the company, who:
· Provide short bio, including technical background and fields of expertise, provided about two months before the event;
· Are available online in Data.Chat one week before the event for about an hour every day;
· Participate in Q & A session off- and on- line for one hour before the event;
· Are available during the competition – at least within two time slots of 3 hours;
· Present the case in a 5-min. video (provided about 1 month before the event), which will be uploaded in the YouTube channel of DSS
· Have the obligation to support all teams, which have asked for their advice.
An example of a well prepare case: