Kaufland is a German hypermarket chain, part of the Schwarz Gruppe. It is amongst the biggest hypermarket chains in Central and East Europe. The chain operates over 1,200 stores in Germany, the Czech Republic, Slovakia, Poland, Romania, Bulgaria, Croatia, Australia and Moldova. Kaufland plans to open stores in Australia and Moldova. The Kaufland team is devoted to enhancing customers’ satisfaction with the products and services offered by its stores and keeping up with the competition.
On 22.01.2018 Amazon opened Amazon Go, a physical store without cashiers and checkout lines – customers just grab the products from the shelves and go. AI algorithms detect what product you have grabbed.
Kaufland offers the unique opportunity to work with their internal data on a similar problem – developing a computer vision algorithm that detects which is the fruit or vegetable scanned.
It is nice to provide more than sixty types of vegetables to your customer. It is not nice if the customer is forced to search these products on a scale in a menu with more than sixty items. So, your task will be to find an algorithm to do this automatically. Making this procedure more comfortable will enhance customer satisfaction with Kaufland. Because of that, your task is to build up an image recognition that reliably recognizes which type of vegetables the customer has selected.
A typical user scenario is: The customer is weighting one type of vegetables/fruits per time, which may (cherries, plums) or may not (watermelon) be wrapped into a plastic bag. The algorithm embedded into the scale, automatically recognizes the type of fruit/vegetable by recognizing an image, taken from camera located above the scale. The scale’s monitor shows several options of fruits/vegetables that are most likely on the scale and asks for customer’s confirmation. Ideally, the final results shown on the screen will be one fruit/vegetable, but as many of them share similar properties it would also be acceptable several items to be shown to the customer. The lighting, the angle from which the pictures are taken and the size of the pictures will be the same for every image.
The model should be assessed by the final produced accuracy on the test dataset. The accuracy will be calculated as number of correctly predicted objects over all objects in the test dataset. An object is considered correctly predicted if it is in Top 3 most probable objects of the model output.
Solutions and approach
The first step is to understand the given data. In what format it is and how exactly it is divided.
In this case the data is in form of images. They can be converted to grayscale or Histogram of oriented gradients (HOG) feature descriptor. HOG is a handcrafted feature descriptor and as such can’t benefit from more data.
Due to the fact that all of them are images, a recognition on the whole photo can be used with some RGB mapping.
However, the images can vary in their number per product, the camera angle, some are blurry others are not, which makes a lot of potential problems.
The data should be randomly split into a training test and a validation set. For the classes of products with only a few images this split is usually 50%.
Really different approaches can be evaluated:
- transfer-learning approach
- training a dedicated CNN from scratch
- CapsuleNet from scratch
CapsuleNet is a new approach to image classification problems, introduced by G. Hinton in “Dynamic Routing Between Capsules“. It shows promising results on many datasets like Mnist and Cifar-10, because it enforces attention on the pattern learned by the neural network.
The transfer learning approach relies on the idea that a neural-network classifier trained on a similar dataset will perform adequately in a new setting. There are two main and slightly different variations of transfer learning.
Applying a simpler classifier (Support Vector Machine) using the extracted features as predictors, and comparing the prediction accuracy of the SVM from different layers, obtain the maximally informative layer. Finally, after this information has been obtained, a neural network with much smaller number of parameters can be fit after the maximally informative layer in place of the SVM for maximum performance (a top net).
Another approach would be to take a pretrained neural network and additionally train it on the new dataset. The network can then be either used as a feature-extraction mechanism as described above or directly for predictions.
The best model during the Datathon 2018 is from the Team Cherry. They achieved 99.46% accuracy with processing time during training of 0.006 s per image on a single GPU Titan X (200s / epoch with 37 000 images).
This approach is currently being tested in Germany.
Solutions during the Datathon 2018