Datathons Solutions

Datathon – Kaufland Airmap – Solution – Nishki


1. Business Understanding

A Kaufland store is a very big thing. It has a sales floor of up to 12.000 square meters and provides more than 30.000 products. A lot of events can occur on our shelves that are likely to be overlooked. Items can get sold out, other items might be placed on the wrong position or they are wrongly labeled.

Consequently, customers miss the products they need or relevant information about these products. This may impact customer satisfaction and in last consequence lead to missed revenue. But how to avoid this situation and get all the relevant information from the shelves in a very short time?

How about using a drone, flying through the store while capturing various relevant information within seconds? We can take care of the flying operations, so you can focus on the computer vision component.

Checking shelf compliance means processing different types of information and comparing the results. So first we can check whether there is a product or not. If there is one it has to be checked what item it is. The third step is to compare this information with the item that shall be found at this place. Therefore, this information must be provided by another source like for example a database or a barcode located at the shelf.

2. Data Understanding


The dataset consists of approx. 350 images, containing around 60 different items and their store labels. Each image is accompanied by an XML file with the same name, containing the annotations of the objects (items and store labels) present in the image. We have split the dataset into 3 parts and will refer to the biggest one as “the ground truth”. In this set, all the articles are in place and have the correct store label below them. The <name> of each item in the annotations is represented by a unique 8 positioned number (example 00060444), and the <name> tag of the corresponding store label is preceded by “label_” (example “label_00060444”).

The “working” part of the dataset provides images in which some of the items are missing (i.e. the space above the corresponding label is missing). In this case, the annotations of the label and the item are extended with a trailing “_m” (“00060444_m” and “label_00060444_m”). Other articles have wrong labels or vice versa if you choose to take the challenge to detect the nightmare of each store manager – the misplaced item/label – keep in mind these combinations are marked with trailing “_w” in the <name> tags (example “00060444_w” and “label_00060440_w”).

And then the “evaluation” dataset is very similar to the “working” one but will be held out for the DSS to evaluate the designed approach to the problem.

Among the annotations, some of the objects are not completely visible, or other circumstances are impairing a good view to them – the <difficult> tag, in this case, is set to 1.


                                                                            XML data sample 
                                                                               XML Data Sample

Kaufland Datathon XML                                                                           XML data sample

3.Data Preparation

We converted the data to a format that is required by our scripts that are used to train the machine learning models.

We had combined all XML files from the “ground_truth_xml” and “working_xml” and had written a C# console application to process the XML-s. We have used regex to remove the missing items(all that were x_m), we renamed all labels just to “label”. Also, we removed all items that are difficult to recognize(<difficult>1</difficult>) and we renamed all misplaced and wrong items to “{item_number}”.

At last, we shrank the images approximate 10 times and fixed all coordinates in XML files to match the new images. Here is code for the XML processing.


We had used the YOLO algorithm(you only look once) for Real-time object detection and classification and the implementation in DarkFlow.

How does YOLO work?

Prior detection systems repurpose classifiers or localizers to perform detection. They apply the model to an image at multiple locations and scales. High scoring regions of the image are considered detections.

We use a totally different approach. We apply a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

YOLO Documentation


Limitations of YOLO

YOLO imposes strong spatial constraints on the bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds. Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image. Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.


Loss Function in YOLO

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall. During training, we optimize the following, multi-part loss function:

When the model is given an image it finds and recognized the objects and returns a .json file with all found objects in the image with their coordinates and the type. Then with a C# algorithm based on the returned coordinates from the .json file, we cut the image to get only the object.

Example image:

Example .json output of one of the pineapples cans:

The output cut image of the pineapple can:

Example .json  output of one of the labels:

The output cut image of a label of the label:


How to get results from our solution?

Here is the result when we run an image trough our network:


Examples when we ran an image with the JSON from our network trough the C# crop code.

And here are the results when we used our OCR Service in a console app on the image above.


Using the trained DarkFlow model we pass the images and it returns a .json file for each image with the coordinates and classification data we need.

How we got the model up and running:

Login to the AWS Cloud instance & open up a terminal & clone the DarkFlow repo
git clone
Run the DarkFlow setup file
python3 build_ext --inplace

Download yolov2.weights file (the pre-trained weights from the official YOLO website) & place it in /bin folder
Download yolov2.cfg config file from the official YOLO website & make a copy of it.
Make a copy of it (my_yolov2.cfg) & change the last layer config according to the amount of labels (71 in our case) & the amount of filters(380 in our case).
Download our modified XML’s and all the Images (“./bulk” – training images & “./bulk_xml” – xml annotations)

run the following command to start training
python3 flow --train --load /bin/yolo2.weights --model cfg/my_yolov2.cfg --trainer adam --dataset "./bulk" --annotation "./bulk_xml" --save 80
Use the following command in order to produce the JSONs with coordinates data from the test image dataset:

python3 flow --pbLoad path_to_build_graph/my_yolo.pb --metaLoad path_to_build_graph/my_yolo.meta --imgdir path_to_test_images/ --json

Once we have the .json files with coordinates for an image we load them up with our C# code.

We call the GetData method from our JsonParser and we pass arguments the path image which we are currently classifying and the path of the JSON which the coordinates of the objects. Once it loads the JSON file it goes trough the lables, calls the CropImage method, passes the cropped image to our GetTextAsync method of the TextExtractor class, gets the product number from the label and updates the label name from “label” to “label_{productnumber}”.

The TextExtractor class uses the Computer Vision API from Azure Cognitive Services


We pass the returned data from JsonParser.GetData to our IsEverythingFine method.

First we group the products (with labels) by rows. We sort the objects by their X coordinate value and start traversing them from smallest to largest. If objects (with their labels) are in their own UNIQUE “objectgroups” then there are no swapped objects.

Example: (03001210 03001210 label_03001210) (03001211 03001211 label_03001211) – objects are properly seperated in their own UNIQUE subgroups.

Example: (03001210 03001210 label_03001210) (03001211 03001211 label_03001211) (03001210) – objects do not form UNIQUE subgroups (there are two groups of “03001210”).

This takes care of the issue with misplaced items/labels. Regarding the missing items problem – we check if an item exists for each label. Before that we find the product with the biggest Y value and discard all labels with a higher Y value than it. We do that so as not to classify images of perfectly arranged shelves which caught the labels of the upper shelf at the top of the image but not the products as problematic.


Combining everything our solution achieved a score of 0.677215 on the leaderboard!

Share this

Leave a Reply