Team solutions

Turtles – Case Kaufland

In this article we present our solution for helping customers and making their shopping experience easier while identifying products from images. We bring forward our idea and discuss the results of our CV experiment.


Business Understanding

Being forced to search for a one desired item in a menu with more than sixty items is not always convenient for clients. This is why we’re aiming at enhancing customer satisfaction with Kaufland by making this procedure more comfortable, faster and automatic.

The main objective of this project is creation of a system to classify images and rank the fruit’s/vegetables’ type with certain probability, through developing a computer vision algorithm that detects which is the fruit/vegetable selected by customer. The system will save customer’s time.

Data Understanding

Dataset 1: Kaufland dataset, X images of size 640 x 480. 20-6525 images per class (real data quality). We use it for test dataset.

On this dataset we observed that globally the distribution of instances per class is very inbalanced. There was one category particularly problematic “6667_Pflaumen gelb” for which there were only two images. The total number of object classes was 68 with a total number of images 37837 (all classes merged together).

Dataset 2: Collected by us a scientific dataset of images of size 512 x 480 (good quality). We use it for train dataset.

We have 68 classes in the test dataset.

  • 2 types of apples
  • apricot
  • 2 types of bananas
  • beans
  • beet
  • 7 types of cabbage
  • carrots
  • celery
  • cherry
  • 2 types of cucumbers
  • fennel
  • 2 types of eggplant
  • garlic
  • ginger
  • 3 types of grapes
  • haselnuts
  • horseradish
  • mandarinas
  • 4 types of mushrooms
  • nectarines
  • 4 types of onions
  • oranges
  • 2 types of peaches
  • peanuts
  • 2 types of pears
  • 5 peppers
  • 3 plums
  • 2 potatoes
  • 3 pumpkins
  • strawberries
  • sweet potato
  • 3 tomatoes
  • walnut
  • 2 watermelon
  • 2 zuccini

The test dataset have certain categories that are hard to differentiate since most of vegetables and fruits are presented in very different scenarios and conditions. This has an direct impact on the variability of the inherent patterns within each category on the dataset at hand and also on the real data for which the prediction model need to be optimised. Given the large number of parameters in DNN it is easy to find a combination of parameters to perfectly fit the variability at hand on the training dataset.

In order to achieve good model scalability during the training the variance of the following factors need to be artificially enhanced:

  • brightness of the scene
  • scale of the objects set (affine transformation -> scaling)
  • angle of the objects set (affine transformation -> rotation)

In addition to the step above very clearly defined and common artefactual patterns (e.g. barcodes, human hand, plastic bag parts with no objects inside) can be performed as a image preprocessing step. This step is again crucial to ensure such patterns are not encoded into the DNN structure as relevant discriminating features.

Alternatively, we could create parent classes for each category (ex. one for apples, one for tomatoes, etc.) and after a object is being classified in a parent class, we could perform another fine-classification in that class.

It will be good to have categories having similar number of items (well-balanced). In order to achieve it, we must identify which categories have low-quality images and remove those images (e.g.”6667_Pflaumen gelb”).  After that, we add more images from another collected by us dataset.

Since those images(train dataset) are of good quality, in the data enhancement steps above we could also add

  • noise,
  • rotate and blur them

in order to augment it and to prevent DNN overfitting .


In order to get more images, we apply the following random distortions:

1. Remove low-quality pictures ()

Thumbnail: Examples

2. Crop central part of the images to the most relevant part

The default input size for Inception and Xception is 299×299.

For the rest of the pre-trained models in Keras it is 224×224.

From an initial data exploration we see that we could remove around 60 pixels from the image border quite safely without touching the object part of the scene. This would avoid also several artefacts situated on the image borders (e.g. human hand, plastic bag etc).

3. Augment images (rotate them by up to 30 degrees, brighteness changes by +/- 5% on the hue channel)


We used the last version of GoogleLenet (2015) pretrained on the ImageNet data set. The general idea is to retrain the network on the dataset at hand but start from GoogleLenet structure rather then a random set of parameters. The advantage of this is that most of low-level feature parameters are already configured on the ImageNet dataset.

We split the data on training (60%), testing (20%) and validation (20%). Training and testing are used to evolve the network structure using gradient descent (step=0.01) and the validation dataset to select the optimal network structure offering the best performances.

We used tensforflow to generate DNN networks and tensorboard for performance evaluations.


On the cycle 39340 we achieve 89.9% accuracy on the training dataset (in orange above) and 82.9% accuracy on the validation dataset (in cyan above).

Share this

2 thoughts on “Turtles – Case Kaufland

  1. 0

    You show good understanding of the data set and the challenges that it represents.
    You’ve managed to find an interesting workaround some of the issues mainly orientation and lighting.
    However there are some areas that you can improve in. First is the business understanding. You need to show that you have understood the problem domain and who the final users of the system will be. What their expectations are? In the model section you make a very good argument as to why you’ve decided to use GoogleLenet, however you could go into more details about the advantages and disadvantages of using this tech as opposed to other solutions. In the evaluation section you need to provide an analysis of the results showing that you understand the inner workings of your system, not just raw results.
    Also you should never provide the results from the training and validation sets
    as those are considered training data and the results on them are misleadingly high, due to the fact that the system has seen them before. You should always have a separate set of data for testing purposes.

Leave a Reply