In this article, the mentors give some preliminary guidelines, advice, and suggestions to the participants for the case. Every mentor should write their name and chat name at the beginning of their texts so that there are no mix-ups with the other mentors.
By rules, it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS team and the industry mentors have tried to do the most work on phases “1. Business Understanding” “2. Data Understanding”, while it is expected that the teams would focus more on phases 3, 4 and 5 (“Data Preparation”, “Modeling” and “Evaluation”), while Phase “6. Deployment” mostly stays in the hand of the case providing companies.
MENTORS’ GUIDELINES
(see the case here: https://www.datasciencesociety.net/the-kaufland-case-iot-predictive-maintenance/)
1. Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.
The advice from mentors:
Simon Stiebellehner (sist):
First, make sure that you understand what the goals and deliverables are. This is perhaps the most important step in the entire Data Science process. It’s crucial for the business value of the result and it ensures that you spend the little time you have on what really matters. Therefore, do not hesitate to ask anything that might not be fully clear to the Kaufland expert.
Jan Sauer (jans):
Based on the description of the case, Kaufland seems keen on an exploratory analysis of the data. It is very easy to get sidetracked in this kind of analysis so plan your course of action carefully and set goals and milestones to work towards. Don’t be afraid to abandon ideas that seem to not lead anywhere.
Tomislav Krizan (@tomislavk from Atomic Intelligence):
I cannot stress enough importance of understanding the problem. There is a popular statement that is usually attributed to Albert Einstein which goes “If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions”. Research on maintenance clearly states that most expensive one is when something is broken, then with fixed intervals and cheapest of all of them is predictive maintenance. Single environment like forklift in this case behaves similar in some predictive way. Focus on how 6 sensors interact and work together to provide a single entity. Do not look them as sepparate elements but rather segments of whole environment
2. Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
The advice from mentors:
Simon Stiebellehner (sist):
Second, the entire team should get a decent understanding of the data. Help each other understand, talk about what you find and write important points down in a structured way. Finally, make a clear plan:
– define what you want to show in the end
– define features required to reach your goal
– find dependencies
– prioritize features considering dependencies and value they add to the final result
– assign tasks, estimate them and agree on deliverables
– have one person be responsible for coordination and consolidation of deliverables into the final result
Jan Sauer (jans):
The machines are essentially identical; they are the same age, according to the company expert, and serve the same purpose. Can this help you identify relationships between sensors? Can this help you find anomalies in the sensor data?
Tomislav Krizan (@tomislavk from Atomic Intelligence):
@jans already stated that machines are essentially identical and we need to look them as something of same type. We are not aware that Kaufland is using machines of different brands or even versions so our estimate would be that all of them are comprised of same parts and materials. In some other cases we would need to take into consideration even different versions of same machine. Additionally for data understanding:
- I would like to stress need to correlate data sets from machines and from maintenance because if you exchange parts, you can’t look into previous measurments and use them for your modelling because we now have brand new part with different lifecycle
- Look into features which can be additionally extracted (like how time of a day contributes to different measurements, do we have some peaks during the day, if you would like to add more value, you can get temperature forecast/measurments for region where Kaufland warehouses are and how that outside environment is interacting with our “inside” environment
3. Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.
The advice from mentors:
Simon Stiebellehner (sist):
Typically, you would start by implementing data cleaning steps. I highly recommend packing any code you write into well-documented functions and classes from the very beginning. Especially when using Jupyter Notebooks people often tend to be sluggish and forget about maintainability and modularity of code. This goes at cost of collaboration effectiveness and efficiency. Also, reduce the data size as soon and as much as possible since this will speed up anything you do to the data dramatically.
Jan Sauer (jans):
Measurement times and intervals may vary between sensors and machines. How are these interval lengths distributed? As you only have a very limited amount of time, consider aggregating your data into larger bins to speed of analysis times and simplify results. This not only reduces the timestamp problem, it may also be of interest to Kaufland to know how frequent measurements must be to detect anomalies.
Tomislav Krizan (@tomislavk from Atomic Intelligence):
Beside this comments from other mentors, please also look into missing data and how that contributes to your modelling and machine behaviour
4. Modeling
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements in the form of data. Therefore, stepping back to the data preparation phase is often needed.
The advice from mentors:
Simon Stiebellehner (sist):
Next, you would likely dive into the analysis and modeling. The case description provides you with a couple of good suggestions on what to look at. Nevertheless, it is up to you what you want to investigate in detail and what your final solution looks like. Eventually, you should have a model that detects disruptive events based on sensor data before they happen. But that’s usually not everything. Put yourself in the business’ perspective and understand what they would be interested in on top of predicting events. Here are a couple of ideas and hints of interesting aspects to look into:
Identify anomalies in the sensor values. Do identified anomalies correlate with repair events time-wise? Did strong anomalies result in repairs? Does this let you infer “labels” for anomalies? Could we have predicted these anomalies or repairs (build a model that does so)?
Can we differentiate between kinds of anomalies (e.g. through clustering, investigating particular sensor values…)?
Do sensor values behave similarly across devices or does each device have its own profile? This might tell you whether you can use one model for all devices or if you need to build a model for every single or every cluster of devices.
Jan Sauer (jans):
Businesses often want to understand models to know why a certain result is being predicted. ‘Black box’ models that are difficult to interpret are also difficult to troubleshoot. On the other hand, if the anomalies and failures predicted by your model can be analysed further, e.g. by clustering them to identify types of anomalies, your model has additional business value for Kaufland. If you choose to go the deep learning route, make sure you can explain in your final presentation of the model why it behaves the way it does.
5. Evaluation
At this stage in the project, you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
The advice from mentors:
Simon Stiebellehner (sist):
From a methodological/technical perspective, keep in mind that it’s time series data. Therefore, make sure to use appropriate modeling approaches. Furthermore, in case you decide on framing the challenge as an anomaly detection problem, you might want to look into Autoencoders (especially Deep Recurrent/LSTM Autoencoders). However, take your computational resources and the available amount of data into account when choosing a modeling approach (especially when it comes to Deep Learning). No matter what you eventually choose, it is important to show the validity and robustness of your solution. Therefore, it is crucial that you choose the right target metrics and evaluate your approaches following best practices. Similarly, provide evidence or at least solid arguments for every claim you make.
Jan Sauer (jans):
Remember, your ultimate goal is to predict failures of individual parts to prevent downstream failures of the entire machine and to reduce machine downtime. In this sense, you want to make sure your model misses as few false negatives as possible as it is cheaper to have someone manually inspect a part and discover nothing wrong with it than it is to repair the entire machine because the model did not identify a failing component. Make sure your evaluation reflects the business need of the case.
6. Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases, it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.
The advice from mentors:
Simon Stiebellehner (sist):
Once your analysis is complete and you have a working model, you need to create the final deliverable. First, create a thorough report where you show what you have done and why, discuss your findings, show model performance and provide recommendations/next steps. Your report should be valuable to the business, hence you should make it fit-for-purpose and not focus on technical details too much (you can always put them in the appendix). Structure it well and have it follow a story. Then, if you still have time, go the extra mile: create some kind of product prototype. This could be a very simple web frontend using e.g. tkinter (Python) that allows a laymen end-user to trigger model predictions.
Jan Sauer (jans):
Your report, regardless of the actual content, must be complete. That means someone with programming and data science knowledge should be able to rebuild your solution from scratch based on your report and obtain the same results. If you include source code, make sure it is well-documented and structured so that there are no ambiguities about what your code does. Right at the beginning of the datathon you should agree within your team as to what conventions you will adhere to (e.g. PEP8 for python code) to not only make sure that you understand each other’s code but also that your deliverable is easy to understand. Nothing is more frustrating than seeing a good solution ignored because you didn’t have the time to properly package and present it.