Team Cherry. The Kaufland Case. Fast and Accurate Image Classification Architecture for Recognizing Produce in a Real-Life Groceries’ Setting

Our best model (derived from VGG) achieved 99.46% top3 accuracy (90.18% top1) with processing time during training of 0.006 s per image on a single GPU Titan X (200s / epoch with 37 000 images).

The teams vision is for the team members to see where they stand compared to others in terms of ideas and approaches to computer vision and to learn new ideas and approaches from the other team-mates and the mentors.

Therefore the team is pursuing a pure computer vision approach to solving the Kaufland and/or the ReceiptBank cases.


Team Cherry

@valan, @hvrigazov, @cecopld, @vsenderov

Datathon 2018. 9 – 11 Feb 2018.

This is a CRISM DM paper describing the creation of neural-net-based image classification pipeline for the recognition of product images in a grocery store. The team behind the pipeline, Team Cherry, consists of students and researchers from Savantic AB in Stockholm, Sofia University, and the Bulgarian Academy of Sciences. We were mentored by Antoan Milkov, who we thank for valuable feedback and directions. Our vision is for the team members to see where they stand compared to others in terms of ideas and approaches to computer vision and to learn new ideas and approaches from the other team-mates and the mentors. Therefore, when constructing the pipeline we tried several different approaches with varying degrees of complexity.

Our best model (derived from VGG) achieved 99.46% top3 accuracy (90.18% top1) with processing time of 0.006 s per image on a single GPU Titan X.

Miroslav Valan is a Ph.D. Candidate at the Naturhistoriska Riksmuseet and at the Stochholm University in the field of computer vision in bioinformatics. He works for Savantic AG, a machine learning consultancy, where he pursues deep learning. He has been interested in computer vision for the last two years. Miroslav’s contribution is to develop and run the deep learning network used to solve the problem.

Hristo Vrigazov is a computer science student at Sofia University. He is interested in computer vision and reads the PyimageSearch blog on a daily basis. He experimented with CapsuleNet and DenseNet as alternative approaches and helped with writing the report.

Ceco is starting out with computer vision. He learned a lot from the more senior members and contributed many useful ideas.

Viktor Senderov is a colleague of Miroslav’s in the bioinformatics Ph.D. network BIG4. His previous interests included classical AI – semantic technologies, and knowledge graphs – and this year he is moving on to machine learning and probabilistic programming languages. Viktor assembled the team, wrote the majority of the report, and ran the feature extraction.

Business understadning

On 22.01.2018 Amazon opened Amazon Go, a physical store without cashiers and checkout lines – customers just grab the products from the shelves and go. AI algorithms detect what product you have grabbed.

Kaufland offers the unique opportunity to work with their internal data on a similar problem – developing a computer vision algorithm that detects which is the fruit or vegetable scanned.

It is nice to provide more than sixty types of vegetables to your customer. It is not nice if the customer is forced to search these products on a scale in a menu with more than sixty items. The task is to to automate this process. Making this procedure more comfortable will enhance customer satisfaction with Kaufland. Because of that, the task is to build up an image recognition that reliably recognizes which type of vegetables the customer has selected.

A typical user scenario is: The customer is weighting one type of vegetables/fruits per time, which may (in the case of cherries, plums) or may not (e.g. watermelons) be wrapped into a plastic bag. The algorithm embedded into the scale automatically recognizes the type of fruit/vegetable by recognizing an image, taken from camera located above the scale. The scale’s monitor shows several options of fruits/vegetables that are most likely on the scale and asks for customer’s confirmation. Ideally, the final results shown on the screen will be one fruit/vegetable, but as many of them share similar properties, it would also be acceptable several items to be shown to the customer. The lighting, the angle from which the pictures are taken and the size of the pictures will be the same for every image.

The goal of the research task is to design an image recognition system which recognizes and ranks the fruits’ or vegetables’ type with certain probability. The input of the system is an image taken from a camera located above the weighting scale in a real store environment.

The challenge is to recognize 3D objects that are naturally grown. For example, different types of apples can look quite different even it is the same type of fruit. Furthermore, vegetables appear differently if you rotate them. So, your model has to deal with this too. Last but not least, vegetables are already wrapped into bags when being weighed. Accordingly, they have to be reliably recognized in spite of strong reflections on the bags that may occur depending on the lighting of the store. Ideally your model works without being newly trained even if the store gets a new lighting system, the bags around the vegetables are badly crunched and the vegetables are of an extraordinary shape. Pre-processing of the image such as filtering, background removal, and edge detection may increase the accuracy.

The model will be assessed by the final produced accuracy on the test dataset. The accuracy will be calculated as number of correctly predicted objects over all objects in the test dataset. An object is considered correctly predicted if it is in Top 3 most probable objects of the model output.

Data understanding

The input data are stored in jpeg image format and divided into 68 sub-folders (categories). Every
such sub-folder corresponds to one fruit or vegetable type. Every category sub-folder has unique name
that starts with a number. The size of all the present images is 640×480. It is important to note that the product and therefore directory names are in German. The total decompressed size of the dataset is 5.5 GB. The total number of images is 37727.

An example of an image of bananas converted to grayscale and it’s Histogram of oriented gradients (HOG) feature descriptor is given below. HOG is a handcrafted feature descriptor and as such can’t benefit from more data; we were however interested whether it keeps the relevant information from the image. As seen by the example below, it looks like this is not the case for so unstructured data with lots of variety in terms of position of the item.

The number of images per class was highly variable with a maximum of 6525 – achieved by Red Peppers – and a minimum of 2 – achieved by yellow plums. A graph and a table of the class counts is available below which shows the diversification of the data.

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    2.0   124.8   237.0   555.8   575.2  6525.0



Fig. 1 Number of Images Per Product

We noticed that images are taken in the same conditions (camera above the scale). Multiple scales were used with each having its own shedow pattern. Luckily, images of each category were appropriatelly distributed across scales.

We also notices that images are blurry without sharp edges, so we decided to test smaller image size which would enable us real-time application.

Data Preparation

We split the data randomly in a 9:1 ratio into a training test with 33916 (33991 valan) images and a validation set with 3743 (3804 valan the count is done automatically by keras 🙂 ) (look at Appendix. Data Understanding). This split was not possible for the class of yellow plums with only two images. In that case we copied one image to the training set and one image to the validation set.


To get a feeling of the task, three different approaches were evaluated: transfer-learning approach, training a dedicated CNN from scratch and CapsuleNet from scratch. All our approaches used data augmentation for additional data generation.


CapsuleNet is a new approach to image classification problems, introduced by G. Hinton in “Dynamic Routing Between Capsules“. It shows promising results on many datasets like Mnist and Cifar-10, because it enforces attention on the pattern learned by the neural network. We used an implementation of the original paper applied to the MNIST dataset, available here. All images were resized to (64, 64). After 1 epoch that took 4 hours, CapsuleNet did not show promising results compared to other approaches, so we abandoned it. As it turned out, CapsuleNet requires more than 500 epochs on average to start producing good results.

Transfer Learning Approach

The transfer learning approach relies on the idea that a neural-network classifier trained on a similar dataset will perform adequately in a new setting. There are two main and slightly different variations of transfer learning. In fine-tuning (Yosinski et al. 2014), we adjust some or all the weights of a pre-trained model to fit a new task similar to the one on which the model was trained. In feature extraction (Azizpour et al. 2016) we use a pre-trained model to obtain a feature vector for each input image. We tried DenseNet121 for fine-tuning and VGG16 for feature extraction. Let’s now have a look at both of them in details. We will start with VGG16 for feature extraction.

Feature extraction

When using VGG16 for feature extraction, the feature vector can be obtained from different layers of the model with just one feed forward pass. One can then apply a simpler classifier, e.g. a Support Vector Machine (SVM), using the extracted features as predictors, and, by comparing the prediction accuracy of the SVM from different layers, obtain the maximally informative layer. Finally, after this information has been obtained, a neural network with much smaller number of parameters can be fit after the maximally informative layer in place of the SVM for maximum performance (a top net). In this way, we will utilize the most information of the pretrained model, and minimize the training time, as we are only training a shallow network from scratch.

We start out by creating a VGG16 model (Simonyan and Zisserman 2015) pre-trained on ImageNet, and specify 5 output layers after every MaxPool layer:

# build the VGG16 network and extract features after every MaxPool layer
model = VGG16(weights='imagenet', include_top=False)
c1 = GlobalAveragePooling2D()(model.layers[-16].output)
c2 = GlobalAveragePooling2D()(model.layers[-13].output)
c3 = GlobalAveragePooling2D()(model.layers[-9].output)
c4 = GlobalAveragePooling2D()(model.layers[-5].output)
c5 = GlobalAveragePooling2D()(model.layers[-1].output)
model = Model(inputs=model.input, outputs=(c1,c2,c3,c4,c5))

We save all features of each output layer and fit a LinearSVM on features from training set and test on validation set. We tested different pooling strategies and input sizes.

After resizing the images to 128×128 we observe that the maximally informative layer is layer 3 with an accurracy score of 70.03. Please note the following: 1)  this is top1 suggestion, so we can expect much higher result for top3 and 2) concatenating features from all layers further promotes the accuracy (78.04). This approach is of-the-shelf and a starting point especially when dealing with small datasets.

(128, 128)
('Accuracy score is ', 48.09761217528208)
('Accuracy score is ', 37.522959853056939)
('Accuracy score is ', 70.034111781684601)
('Accuracy score is ', 68.66964051430071)
('Accuracy score is ', 62.083442665966935)
('Accuracy score is ', 78.037260561532406) - concatenated
(224, 224)
('Accuracy score is ', 44.896352663342952)
('Accuracy score is ', 63.316714773025453)
('Accuracy score is ', 32.406192600367355)
('Accuracy score is ', 67.672526895827872)
('Accuracy score is ', 61.847284177381269)
('Accuracy score is ', 71.792180530044604) - concatenated

Another conclusion that we draw from this result is that the smaller image size performed (128×128) better than the bigger (224×224) with this off-the-shelf approach. This finding informed the image size choice for our “training-from-scratch” method explained later.  Since the network trained from scratch performed excellent (99.46% top 3 accuracy), we didn’t pursue the “top net” refinement of this method mentioned earlier.


Another approach would be to take a pretrained neural network and additionally train it on the new dataset. The network can then be either used as a feature-extraction mechanism as described above or directly for predictions. For that, we experimented with DenseNet121. Again the images were rescaled to (64, 64). A textual summary can be seen in the Appendix DenseNet. A graphical representation of the model can be here.  After training for 15 epochs, DenseNet121 achieved 83% validation accuracy, which was not as good as our from scratch approach.

Dedicated Convolutional Neural Network

We trained a dedicated architecture which is both fast and accurate. After experimenting with the type and number or layers, input sizes, augmentation, and the Batchnorm and Dropout normalizers, we decided to go with rather simple architecture made of four consequtive blocks each consisting of Batchnorm, standard 3×3 Convolutions+Relu activation and 2×2 Maxpooling layer. The output of the last layer was flattened and fed trough one Dense layer before the output.

Layer (type)                 Output Shape              Param #   
batch_normalization_162 (Bat (None, 128, 128, 3)       12        
conv2d_109 (Conv2D)          (None, 126, 126, 32)      896       
max_pooling2d_109 (MaxPoolin (None, 63, 63, 32)        0         
batch_normalization_163 (Bat (None, 63, 63, 32)        128       
conv2d_110 (Conv2D)          (None, 61, 61, 64)        18496     
max_pooling2d_110 (MaxPoolin (None, 30, 30, 64)        0         
batch_normalization_164 (Bat (None, 30, 30, 64)        256       
conv2d_111 (Conv2D)          (None, 28, 28, 128)       73856     
max_pooling2d_111 (MaxPoolin (None, 14, 14, 128)       0         
batch_normalization_165 (Bat (None, 14, 14, 128)       512       
conv2d_112 (Conv2D)          (None, 12, 12, 128)       147584    
max_pooling2d_112 (MaxPoolin (None, 6, 6, 128)         0         
flatten_28 (Flatten)         (None, 4608)              0         
batch_normalization_166 (Bat (None, 4608)              18432     
dropout_47 (Dropout)         (None, 4608)              0         
dense_53 (Dense)             (None, 512)               2359808   
batch_normalization_167 (Bat (None, 512)               2048      
dropout_48 (Dropout)         (None, 512)               0         
dense_54 (Dense)             (None, 68)                34884     
Total params: 2,656,912
Trainable params: 2,646,218
Non-trainable params: 10,694

The number of parametes is cca 2.6M of which majority are in the fully connectied part (very fast to train). The number of filter in each ConvLayers was 32, 64, 64, 128. For 200s we can process whole dataset (37000 images) on a single Titan X (less than 0.006s per image). Its  size is 10MB.


Update Test set data just in. Over 99% perfermance. Ask us for the notebook.

Performace was evaluated on the validation set (see previous paragraph) and is currently being evaluated on the test set (numbers will be in the updated as they become availale).  The model with BatchNorm and Dropout set to 0.8 and scheduled to decrease after 10 epochs without progress had the following accuracy:

val loss top1 score top3 score
[0.38221502454028755, 0.88825473329905602, 0.99158907201634616]

The next result is informed by Smith et. al (2017). A model with BatchNorm+Dropout 0.8 with jiggling learning rate and gradually increased batch_sizee:

val loss              top1 score              top3 score 
[0.34097324399913742, 0.90297386500526744, 0.99339141737200831]

Model tuned with a few more epochs where Dropout is set to 0

 val loss top1 score top3 score 
[0.33503147153551421, 0.90177229756741739, 0.99459298094244963]

The per-category error rate is given Fig. 2 and the confusion matrix is given in Fig. 3. Both of these figures are available on the following Google drive link. In the Google Drive, please also find an interactive plotly plot where you can explore the prediction accuracy for individual features.

Fig. 2 Per category error rate.

Fig. 3 Confusion matrix.



The obtained error rates (over 99 percent) indicate that should this model be deployed in a real-life setting, only 1 out of 1000 customers will need to click next to get the second screen of options. Also the 6 ms of processing time is suitable for the purpose.

Technical notes.

Savantic AB provided GPUs for valan

Our channel is #team_cherry.







