Image recognitionLearnTeam solutions

Fruit Ninjas: Kaufland Case


Fruit Ninjas:  Kaufland Case

Tech: Microsoft Azure:

2 vauchers:




Business understanding

Kaufland is amongst the biggest hypermarket chains in Central and East Europe. The Kaufland team is devoted to enhancing customers’ satisfaction with the products and services offered by its stores and keeping up with the competition.

The aim of the current case is to build up an image-recognition algorithm that reliably recognizes the type of fruits or vegetables the customer has put on a weighting scale. The customer is weighting one type of vegetables/fruits per time, which may or may not be wrapped into a plastic bag. The algorithm embedded into the scale should automatically recognize the type of the product by recognizing an image, taken from camera located above the scale. The fruits/vegetables should be reliably recognized despite strong reflections on the bags that may occur depending on the lighting of the store, or if the products are badly crunched or of an extraordinary shape. The scale’s monitor should show several options of fruits/vegetables that are most likely on the scale and asks for customer’s confirmation.

Data understanding introduction

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

For the purpose of the current case we have been provided with a dataset of images in jpg format. The dataset is divided into 68 categories in 68 sub-folders, each corresponding to a different fruit or vegetable. The folders are named with an unique number followed by the fruit/vegetable name in German. Each sub-folder contains images of the product it corresponds to and the number of images in the sub-folders differs significantly. The size of the images is 640×480. These are real images captured in a Kaufland store and present a product on a weighing scale, in or out of a plastics bag.

When working with images dataset, the best way to understand the data is to look it through. While exploring the provided dataset, we discovered that in many of the images are captured not only the product and the scale tray, but also other objects, such as hands and shadows, which could be an obstacle for our model and should be taken into account during the data preparation process. Furthermore, in most of the pictures the product is contained in a plastic bag and many of those pictures are unclear even to the human eye.

Another issue that we encountered, as mentioned above, is that the number of pictures for the different categories differed significantly, varying from 2 to 2000, however, for most of the categories the number is about 200. This was flagged as a problem because to train our model we would have more information for some categories and less information for others. Further steps were made during the data preparation process to resolve the issue.

In some of the categories pictures differed in color range, which would suggest that they have been taken in different stores or in different lighting conditions. This is not to be disregarded if working on color recognition approaches for our model.

Data understanding in our approach

The dataset consists of 37 thousand images, all sized 640×480, divided in 68 categories. The poorest category, yellow plums, has only 2 pictures, a total of 12 categories have less than 100, while 9 categories have more than 1000 images – with red pepper being the winner, having 6500 images.

The given dataset contains photos of almost identical setups: a tray that does not change position, some shadows, a little bit of background, the fruit of vegetable and reflections/distortions from the plastic bag. Occasionally a hand appears, guarding the fruits from falling off.

The photos are not well-lighted, being prone to dynamically appearing shadows and making some dark-colored fruits and vegetables appear colorless.

There are 68 different categories, ranging from round bright-colored objects (oranges), to large white-ish (cabbage), or dark-green, almost gray/black (cucumber, green pepper) and some virtually black, like the red cabbage, which if it was red on the outside as well, would have helped a lot.

At least half of the photos feature a plastic bag – it can heavily blur the object, even sometimes completely obstruct it. That is an important point – not all photos can be distinguished by a human simply because there is no information available on the photo, therefor for a real test it would be best to filter the dataset and think of ways to better photograph the tray results.

Given the same background, which although dull in colors is actually rich in information, it would be wasteful to attempt image recognition on the whole photo. A good practice would be to separate the object from the background. Data preparation is needed, but since it must be applied to every photo, a universal algorithm needs to be developed that can clean the photo before it knows what it is. In order to achieve that, some understanding of the data is needed.

Аs a first experiment an orange photo from the internet is being analyzed:

Even a basic selection tool from available graphics software manages to distinguish between the background, the plastic bag and the fruit. The analysis is based on neighboring pixels with close color data, with a given tolerance.

The banana in the next picture is a bit different, because it throws a colored shadow – and the software recognizes part of the tiles as “fruity”. By increasing the saturation of the whole image (making colored pixels even more colored), the bright banana stands out and by repeating the same operations, a similar result is achieved.

A similar attempt with red cabbage proved unsuccessful – the shadow of the tray exactly matches the color of the cabbage outer layer.

In such a situation, a different approach might be useful – removing bright parts of the picture, such as the tray sides, and work with the remainder part – some noise is better than all the noise.

Another problem would be the hands in pic: Nearly 10% of the pictures, especially for the big fruits or vegetables or where single items are placed on the tray, feature at least a finger. To gain better understanding, the same method is applied to a watermelon, held steadily.

With saturation (and some hue) adjustment, the 3 segments of the photo appear clearly distinguishable – the background (removed), the hand and the melon. This gives opportunity for some clusterization – or, an algorithm for adjusting saturation until proper clusterization is achieved.

It is worth noting that the melon and the hand have completely different color properties. What could be done if similarly colored objects are present, like an orange, held dearly?

With some additional RGB remapping (non-linear channel-by-channel conversion, to amplify some segments of the color spectrum while reduce others, as given on the figure on the right) the desired result is achieved – as long as a fruit is present, it can be distinguished through color analysis.

One last challenge for the basic analysis is this ginger piece, which is so densely covered by the plastic bag, that is appears almost identical to the tray color (not that it natural color helps).

Hue and Saturation adjustments as well as some level editing (boosting each RGB channel to force dim pixels to become either green or red based on their initial values) outline the ginger piece like an X-ray machine at the airport – meaning the task is achievable even when with harsh photo conditions.

Data Preparation

Now that some understanding is achieved about how objects can be identified, an algorithm is needed to clean all the photos. For that once more some photos are viewed as specters of their color spaces. An attempt in RGB spectrum analysis proved unsuccessful, since almost all present colors are a mix of the three channels. The HSV (hue, saturation, value) colormap is designed to be closer to our human perception and also these channels were widely used for previous photo examinations.

Using a photo of some pears, the saturation is mapped in different locations on the photo:

The background has full hue spectrum but is low in saturation (because of its dull colors) and also takes most of the low-light spectrum (value parameter). The selected pear, in comparison, takes only the orange-red spectrum, is very high in saturation and in lightness. The hue analysis proved tricky, because the noise from the tray masks the fruit hue location. Of course, with more careful time-consuming examination this too could be a criterion. The value parameter is also suitable for filtering – although not as clear in practice as the saturation.

The final result is achieved only through saturation filtering – and is satisfactory.

Given a subset of 15 fruits and vegetables, an algorithm using the just discovered connection, keeping the last 25% of the saturation spectrum, gave 100% successful results with oranges, apples, peaches (brightly colored), had 70% rate with white grapes and peanutes due to their more complex structure (some grapes and peanuts lost), and 50% with 3 different kinds of cabbage because of their size – they simply take more than 25% of the pixels in the photo.

Red grapes (which are pretty dark blue), cucumbers and their small relatives, plums and green pepper have 0% success rate due to the fruits occupying the lower part of the saturation spectrum (completely reverse).

То develop a universal method, 250 pictures from a category are analyzed and their spectrums mapped. The 3 graphics present the histogram (spectrum) of the hue, saturation and values from 0 to 1, where the hue is technically 0 to 360 deg if the hue color wheel from before is used.

(The spikes in the middle spectrum are from different photos – each has one spike only)

The brightly-colored oranges and located in the red thick mark – after the spike from the tray, which is why the last 25% method works that well. It is work nothing that pictures were probably taken in 2 or 3 different locations – due to the bi/tri-modal distribution of the spectrum.

The hue gives us almost no information, as oranges position themselves in the spectrum, that is also occupied by the tray, and the value parameter provides no useful information at a short glance.

In comparison, a fruit such as the green pepper gives a much more different picture:

The saturation spectrum here drops almost straight to zero – meaning that no fruit is hiding there, because the dark green peppers are as dull as the tray. The hue spectrum is much richer, but hardly any universal filtering can be applied there – without thorough research (in fact, exactly for the green peppers, hue filtering provides excellent results – but not for many other fruits, with have the same saturation problem).

With some additional examination, the value parameter proves useful – the tray occupies the 0.5 – 1 spectrum, whereas the fruit is located in the darker part. An applied filtering of 50% value specter (keeping the lower part) achieves the desired accuracy, with just a few partial shadows remaining in a small number of photos.

Looking again at the subset of 15 photos, the said 5 fruits work great with saturation filtering and exhibit the same gradual curve – meaning it is a good criterion. 8 other categories have the mentioned steep descend and work great with value filtering – whereas the remaining categories needed a more careful approach, which was not implemented in the limited timeframe.

This data shapes the preparation algorithm:

  • If a photo has a saturation spectrum with gradual descend, apply saturation filtering for the last 25%.
  • If a photo does not fulfil the said criteria, the value filtering is applied.

Applied to the whole dataset, the algorithm produced filtered images with 70% success rate across nearly all categories – meaning it is indeed universal. The bad results are where the fruit or vegetable got cropped out, instead of the background, in light conditions, or reverse – both cases of false assumptions of the algorithm, meaning with some fine tuning it could drastically improve its performance.

The filters, presented here, are a great way to reduce the data, but we did not have the time to try their implementation. The original idea was to compare the filtered and unfiltered datasets respectivly.


Train / Test sample split: The dataset consists of 37 795 color images in *.JPG format with resolution 640×480. They belong to 68 distinct categories. Each category is organized in separate directory. R routine has been developed and used in order to split the dataset into training and test sample into random manner.

Attempt to fore GPUs computation has been made. Due to incompatibility reasons of CUDA and Tensorflow(1.5) backend with windows 10 all computations has been carried out on CPU. Network architecture:

We used a form of transfer learning for this problem. This is a form of inductive transfer aimed at storing knowledge gained while solving one problem and applying it to a different but related problem.

For instance – a model trained to be able to recognize vehicles, can be used for recognition of trucks. One of the approaches we used was to remove the top 3 layers of a pretrained convolutional neural network called VGG16 with weights from ImageNet (an image database – This is a 16 layer network used by the VGG team in the ILSVRC-2014 Competition. Details about the architecture can be found in the following arXiv paper: “Very Deep Convolutional Networks for Large-Scale Image Recognition” , K. Simonyan, A. Zisserman, arXiv:1409.1556

On top of it we built one hidden layer with 1024 neurons and Dropout for half of them, optimized with Adam optimizer.

All of this was implemented in the Keras Library in Python. Unfortunately the algorithm was not fully trained because of technical issues


Microsoft Azure Storage Explorer ( was used to contain the pictures, with which we were provided.

We needed Microsoft Azure ( to manage to use all the data.

We used Microsoft Azure Machine Learning Studio ( to try to compile the code we already had in Python. There we created:

  • Experiment, but we had some issues with connecting the data
  • Notebook – Jupyter for R
  • Notebook – Jupyter for Python

After managing to install all the needed libraries, it was working, but it was too slow and we decided to run a virtual machine.

The virtual machine was created in Microsoft Azure. Unfortunately we did not know that there is a special virtual machine for data science.

After deploying all the needed software in the virtual machine and information our code in Python started to work. More than 6 hours later, it was ready, but our vouchers were expired and all the data we created with the created model are still in the virtual machine, but we don’t have the access to it.

Overall, Microsoft Azure and the supporting items helped us, but there is nothing from it in the final version of our work.


One major attempt to train the network from scratch has been made with the following results: Achieved accuracy of 80/74 of train/test set. Due to the long computational time, no further training has been carried out. However the network still has area to improve.


Used Libraries and Technologies for image processing (

MATLAB – MATLAB Image Processing Toolbox (

Python for main logic (Open source,

R for sample split (Open source,

Microsoft Azure (

Keras – functional API running on TensorFlow (

VGG16 with weights from ImageNet (



Dataset Preparation Code in MATLAB (also needed preparation for the test data):

%Reading the image, as a matrix
%converting it to HSV colorspace
%again x3, the 3 being Hue, Sat, Val
%calculating the saturation spectrum
%preparing an indexing matrix
%finding the peak
%solution variable
if(mean(sat(maxpos+2:maxpos+4))/sat(maxpos)>0.04), solution=1;end
%if we find good gradual descend, apply 1st solution
%the limit point for last 25% cumulative pixels
%hue and val spectrums
    %if the solution is not 1, calculate the middle point for the
    %value spectrum to slice
switch (solution)
    case 1
        %set every bad pixel's value to zero
        %where its saturation is under the limit
    case 2
        %set every bad pixel's value to zero
        %where its value is over the limit
%save the image if needed
%after converting to rgb

Routine for splitting the data set of images into Training and Validation samples in R:

# Image Shift

base_dir <- “C:/Users/zhivk/Desktop/Data_Datathon_Kaufland”
subdirs <- strsplit(list.dirs(base_dir, recursive = FALSE), “/”)

## set file path to working dir

setup_dir <- “C:/Users/zhivk/Desktop/KFL_Dataset_Final”

## create dirs

dir.create(file.path(setup_dir, “Development”))
lapply(sapply(subdirs, function(x) x[length(x)]), function(x) dir.create(file.path(setup_dir, “Development”,x)))

dir.create(file.path(setup_dir, “Validation”))
lapply(sapply(subdirs, function(x) x[length(x)]), function(x) dir.create(file.path(setup_dir, “Validation”,x)))

## Shuffle files

setup_dev <- list.dirs(file.path(setup_dir, “Development”), recursive = FALSE)
setup_val <- list.dirs(file.path(setup_dir, “Validation”), recursive = FALSE)

d1 <- list.dirs(base_dir, recursive = FALSE)

for (i in seq_along(d1)) {

fp <- list.files(d1[i])

ix1 <- rbinom(length(fp),1, prob = 0.5) == 1

file.copy(from = file.path(d1[i], fp[ix1]) , to = file.path(setup_dev[i],fp[ix1]))
if (all(ix1)) {file.copy(from = file.path(d1[i], fp[c(T,T)]) , to = file.path(setup_val[i],fp[c(T,T)]))}
else {file.copy(from = file.path(d1[i], fp[!ix1]) , to = file.path(setup_val[i],fp[!ix1]))}


x <- list.files(“C:UserszhivkDesktopKFL_Dataset_PrototypeValidation”, recursive = TRUE)


Reproducible routine for train the CNN in Python:

# Convolutional Neural Network

# Importing the Keras libraries and packages
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers import Dropout
from keras.backend import set_session
from keras.utils import plot_model

from tensorflow.python.client import device_lib

config = tf.ConfigProto( device_count = {‘GPU’: 0 , ‘CPU’: 5} )
config.gpu_options.per_process_gpu_memory_fraction = 0.5 # maximun alloc gpu50% of MEM
config.gpu_options.allow_growth = True #allocate dynamically
sess = tf.Session(config=config)

# Initialising the CNN
classifier = Sequential()

# Step 1 – Convolution
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 48, 3), activation = ‘relu’))

# Step 2 – Pooling
classifier.add(MaxPooling2D(pool_size = (2, 2)))

# Adding a second convolutional layer
classifier.add(Conv2D(32, (3, 3), activation = ‘relu’))
classifier.add(MaxPooling2D(pool_size = (2, 2)))

# Step 3 – Flattening

# Step 4 – Full connection
classifier.add(Dense(units = 128, activation = ‘relu’))
classifier.add(Dense(units = 20, activation = ‘softmax’))

# Compiling the CNN
classifier.compile(optimizer = ‘adam’, loss = ‘categorical_crossentropy’, metrics = [‘accuracy’])

# Part 2 – Fitting the CNN to the images

from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale = 1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)

test_datagen = ImageDataGenerator(rescale = 1./255)

training_set = train_datagen.flow_from_directory(‘C:/Users/zhivk/Desktop/KFL_Dataset_Final/Development’,
target_size = (64, 48),
batch_size = 32,
class_mode = ‘categorical’)

test_set = test_datagen.flow_from_directory(‘C:/Users/zhivk/Desktop/KFL_Dataset_Final/Validation’,
target_size = (64, 48),
batch_size = 32,
class_mode = ‘categorical’)

#plot_model(classifier, to_file=’C:/Users/zhivk/Desktop/KFL_Dataset_Final/model.png’)

steps_per_epoch = 2932,
epochs = 4,
validation_data = test_set,
validation_steps = 1210) = “C:UserszhivkDesktopKFL_Dataset_Finalmodel.h5”)


Implementation for pretrained network in Python:

from keras.applications import VGG16
#Load the VGG model
vgg_conv = VGG16(weights=’imagenet’, include_top=False, input_shape=(48, 64, 3))
# Freeze the layers except the last 4 layers
for layer in vgg_conv.layers[:-4]:
layer.trainable = False

# Check the trainable status of the individual layers
for layer in vgg_conv.layers:
print(layer, layer.trainable)





# Create the model
model = models.Sequential()

# Add the vgg convolutional base model

# Add new layers
model.add(layers.Dense(512, activation=’relu’))
model.add(layers.Dense(68, activation=’softmax’))

# Show a summary of the model. Check the number of trainable parameters

# Compile the model

checkpoint =ModelCheckpoint(‘checkpoint’, monitor = ‘val_acc’, save_best_only = True)
early_stop = EarlyStopping(monitor = ‘val_loss’, patience = 8)

# Train the model
history =, y=y_train, validation_data = (X_test,y_test), batch_size=32, epochs = 50, verbose=1)



Share this

2 thoughts on “Fruit Ninjas: Kaufland Case

  1. 0

    Overall I am not impressed by the achieved accuracy.

    One of the reasons is the expiration of the Azure vouchers and the troubles with the VM.
    Other possible reason is that the team used relatively old CNN architecture.

Leave a Reply