In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors.
By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS team and the industry mentors have tried to do the most work on phases “1. Business Understanding” “2. Data Understanding”, while it is expected that the teams would focus more on phases 3, 4 and 5 (“Data Preparation”, “Modeling” and “Evaluation”), while Phase “6. Deployment” mostly stays in the hand of the case providing companies.
MENTORS’ GUIDELINES
(see the case here: https://www.datasciencesociety.net/the-telenor-case-flight-failures-kings-landing-to-the-north/)
1. Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.
The advice from mentors:
Metodi Nikolov (@metodinikolov):
Spend a non-trivial amount of time on just reading the case description, even before looking at the data. Try to form an understanding and a prior view on what the data could be like. List what the case is asking you to do and brainstorm what would be 3 ways to accomplish each of the asks. Don’t try to make them complicated — that’s not the idea — just have them written down. If you later get stuck, you could get back to list and start thinking about an other way to the same thing.
After you have the list of asks – sorted them from easiest to hardest, so that you can start with the easy ones – you will be accomplishing things (yay!) as well as getting more understanding of the data and task – this way the hard asks become easier.
Tomislav Krizan (@tomislavk ):
I cannot stress enough importance of understanding the problem (and repeating this statement). There is a popular statement that is usually attributed to Albert Einstein which goes “If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions”. Go through case description, brush up info suggested by @metodinikolov. Also, please focus on Traffic Matrices theory (good example is https://pdfs.semanticscholar.org/presentation/6606/7c78dccda786de6d6b190a9e1f1ad421f8e8.pdf or https://www.nanog.org/meetings/nanog43/presentations/Blili_trafficmatrix_N43.pdf). What you will not know is actual network path (and backups), but you have path name (e.g. raven) and end-points which is enough for simulation. Some additional theoretical work on drop-call problem can be found on http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.9178&rep=rep1&type=pdf.
Based on problem understanding, applying data and other comments from next chapters will yield in successful implementation.
2. Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
The advice from mentors:
Metodi Nikolov (@metodinikolov):
Your first few hours of work should be spent here. Figure out a way to access the data quickly then do as much Exploratory Data Analysis as possible. Plot as many different plots as you can and think carefully what the plots are telling you. Devise data transformations that can provide new ways the graph the data (and repeate the last step with the transformed data!). A very rough list of possible plots (just to get you started! these will not be enough on their own):
- take a raven and plot the fails against time.
- take a family (family member) and do the same
- plot all fails from all ravens for each period
- etc.
Along with plots, think about what tables will be usefull – create them and review just as with the plots.
Don’t forget to go back to point 1. and the asks of the case – you will surpriced to see how many of things the case asks you to do have already been done at this step.
3. Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.
The advice from mentors:
Metodi Nikolov (@metodinikolov):
Steps 3, 4 and 5 go hand in hand – the modelling you choose to do will have effect on what transformations you will have to do on the data, and what transformations are possible will govern what models you will be able to apply. Testing (evaluating) the model will show you how the model performs and hopefully suggests how it can be improved – sometimes by making additional data transformations.
Use step 2 to guide you – out of all the plots and tables you have created, you should have a solid grasp of that the data is and how it could be augmented.
Create and keep an up-to-date algorithm on how the data has been changed so far, preferably in an automated fashion. This will serve two purposes – you will not forget what you did and you will be able to redo it if necessary (don’t forget to keep a clean master start dataset somewhere).
4. Modeling
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.
The advice from mentors:
Metodi Nikolov (@metodinikolov):
How you choose to model the data is ultimately up to you and there isn’t a single way that would be “the best” — remember that you are working under a time constraint and making the most complex and all-encompasing model will be for naugh if that model needs a week to run. Start small and simple – and iterate! Take what you can learn from the first toy model and implement that in the next one. This way you will improve at each step (even if just knowing what doesn’t work well) and further you will be able to publish a solution with the most basic model.
As for particular ways to model the data that the case is about, I would dust off my knowledge in Survival Analysis, Renewal Theory and Time Series analysis in general. Do not be intimidated by the amount of thing in the Wiki – not everything from these will (or have to ) be used here. but some of it migh:)
5. Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
The advice from mentors:
Metodi Nikolov (@metodinikolov):
Take a hint from Test Driven Development – think how you will test the model first. That is, come up with success criteria (you already did a list of these, remember?) for the model and implement them. These have to work for all models (think as a final user of the output) so you will be writing once and using multiple times (on each iteration of the model). If need be, gradually increase the things you are testing for. As the main ask of the case is to predict which ravens/families will fail in the next 4 days, come up with your version of back testing You could try to have it look at not only 4 days, but 1 hour, 1 day etc. – this can only improve your modelling.
6. Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.
The advice from mentors: