The Kaufland Case [Global Datathon 2018] – Guidelines by Simon Stiebellehner


The Kaufland Case poses an interesting Predictive Maintenance challenge.

First, make sure that you understand what the goals and deliverables are. This is perhaps the most important step in the entire Data Science process. It’s crucial for the business value of the result and it ensures that you spend the little time you have on what really matters. Therefore, do not hesitate to ask anything that might not be fully clear to the Kaufland expert.

Second, the entire team should get a decent understanding of the data. Help each other understand, talk about what you find and write important points down in a structured way. Finally, make a clear plan:
– define what you want to show in the end
– define features required to reach your goal
– find dependencies
– prioritize features considering dependencies and value they add to the final result
– assign tasks, estimate them and agree on deliverables
– have one person be responsible for coordination and consolidation of deliverables into the final result

Typically, you would start with implementing data cleaning steps. I highly recommend packing any code you write into well-documented functions and classes from the very beginning. Especially when using Jupyter Notebooks people often tend to be sluggish and forget about maintainability and modularity of code. This goes at cost of collaboration effectiveness and efficiency. Also, reduce data size as soon and as much as possible since this will speed up anything you do to the data dramatically.

Next, you would likely dive into the analysis and modeling. The case description provides you with a couple of good suggestions on what to look at. Nevertheless, it is up to you what you want to investigate in detail and what your final solution looks like. Eventually, you should have a model that detects disruptive events based on sensor data before they happen. But that’s usually not everything. Put yourself in the business’ perspective and understand what they would be interested in on top of predicting events. Here are a couple of ideas and hints of interesting aspects to look into:
Identify anomalies in the sensor values. Do identified anomalies correlate with repair events time-wise? Did strong anomalies result in repairs? Does this let you infer “labels” for anomalies? Could we have predicted these anomalies or repairs (build a model that does so)?
Can we differentiate between kinds of anomalies (e.g. through clustering, investigating particular sensor values…)?
Do sensor values behave similarly across devices or does each device have its own profile? This might tell you whether you can use one model for all devices or if you need to build a model for every single or every cluster of devices.

From a methodological/technical perspective, keep in mind that it’s time series data. Therefore, make sure to use appropriate modeling approaches. Furthermore, in case you decide on framing the challenge as an anomaly detection problem, you might want to look into Autoencoders (especially Deep Recurrent/LSTM Autoencoders). However, take your computational resources and the available amount of data into account when choosing a modeling approach (especially when it comes to Deep Learning). No matter what you eventually choose, it is important to show the validity and robustness of your solution. Therefore, it is crucial that you choose the right target metrics and evaluate your approaches following best practices. Similarly, provide evidence or at least solid arguments for every claim you make.

Once your analysis is complete and you have a working model, you need to create the final deliverable. First, create a thorough report where you show what you have done and why, discuss your findings, show model performance and provide recommendations/next steps. Your report should be valuable to the business, hence you should make it fit-for-purpose and not focus on technical details too much (you can always put them in the appendix). Structure it well and have it follow a story. Then, if you still have time, go the extra mile: create some kind of product prototype. This could be a very simple web frontend using e.g. tkinter (Python) that allows a laymen end-user to trigger model predictions.

Share this

Leave a Reply