Team solutions

Using Convolutional Neural Networks for Real-Time Product Recognition in Smart Scales – Imagga’s Solution to the Kaufland Case

Many big retailers offer in store a rapidly growing variety of fresh produce from local and global sources, such as fruit and vegetables that need to be weighed quickly and effortlessly to mark their quantity and respective price. Smart scales that use image recognition to optimise user experience and allow additional features, such as e.g. cooking recipes can provide a new solution to this problems. The solution we provide to the Kaufland case includes training a Convolutional Neural Network (CNN) with GoogLeNet architecture on the original Kaufland data set and fine-tuning it with a Custom Training Set we have created, achieving the following results (Kaufland Case Model #13): training accuracy: Top-1: 91%, Top-5: 100%; validation accuracy: Top-1: 86.1% , Top-5: 99%, and TEST dataset accuracy of: Top-1: 86.1%, Top-5: 99.2%. We have also created another model (Kaufland Case Model #14) by combining similar categories, achieving: training accuracy: Top-1: 96%, Top-5: 100%; validation accuracy: Top-1: 92.5%, Top-5: 100%, and TEST dataset accuracy: Top-1: 91.3%, Top-5: 100%. All trainings were done on our NVIDIA DGX Station training machine using BVLC Caffe and the NVIDIA DIGITS framework. In our article we show visualisations of our numerous trainings, provide an online demo with the best classifiers, which can be further tested. During the final DSS Datathon event we plan to show a live food recognition demo with one of our best models running on a mobile phone . Demo URL: http://norris.imagga.com/demos/kaufland-case/

2
votes

Authors

Georgi Kostadinov ([email protected])

Stavri Nikolov ([email protected])

Mentors:

Antoan Milkov

 

Business Understanding

Many big retailers offer in store a rapidly growing variety of fresh produce from local and global sources, such as fruit and vegetables that need to be weighed quickly and effortlessly to mark their quantity and respective price. Smart scales that use image recognition to optimise user experience and allow additional features, such as e.g. cooking recipes can provide a new solution to this problems. Food recognition from photos (both food ingredient recognition and cooked food recognition) has a variety of applications such as healthy lifestyle (food consumption and dietary requirements monitoring, lifelogging), cooking inspirations and recommendations, food quality control, food provisioning and planning for large-scale food facilities, etc. Such real-time food recognition from photos solutions are being embedded in mobile phones (apps such as CalorieMama DeepFoodCam), smart diaries (Spoonacular), restaurants and large-scale cooking or food serving spaces, smart refrigerators, and smart scales, as in the Kaufland case for the DSS Datathon 2018.   

 

Data Understanding

The data was provided in the form of a single 5.6GB archive file (called Original in the technical notes and results below). The data set has 68 categories of fruit and vegetables and contains 32129 training photos in total. Since only one training data set was provided by Kaufland, 5666 validation images were kept aside, i.e. 15% of the total data set and used in the evaluation of all proposed and tested approaches. Kaufland also provided a TEST dataset consisting of 16248 images. We have marked the TEST results on models Kaufland Case #13 and Kaufland Case #14. We have also provided visualisations of the TEST results for both models.

The photos for each category were visually inspected. A list with all 68 categories of fruit and vegetables in the Original data set was created, together with the number of photos for each category. The names of the different fruit and vegetables were translated from German into English with the aim of using the English names for mapping to other additional data that were used in our studies (see details below).

The Original Kaufland data set is very imbalanced with some categories of fruit and vegetables having over 1000 photos while others only a few tens of photos (see below)

While this is something to be considered, there are many classical approaches to combat imbalanced classes in Machine Learning data sets. There is no fundamental reason why a training data set of the kind provided by Kaufland can’t be balanced, as over time a large-enough number of photos for each such product bought buy real clients can be captured by the camera connected to the smart scales prototype, which will guarantee a diverse and balanced data set. Hence, while some efforts were made to balance the data set by using and additional other photos for some categories where there were only a few, this was not the main focus of our efforts.

 

Data Preparation

Apart from the Original data set provided by Kaufland, several other data sets were prepared and used in our experiments. These include:

Food101:

  • The public Food101 dataset with 101 categories with images of food
  • 101 labels
  • 85850 training images
  • 15150 (15%) validation images

FoodNet:

  • Food101 dataset extended with images and categories with vegetables/fruits from ImageNet
  • 150 labels
  • 98994 training images
  • 11006 (15%) validation images

Original:

  • The data from Kaufland, no changes to the data
  • 68 labels
  • 32129 training images
  • 5666 (15%) validation images

Custom:

  • Same labels as the Original dataset, but with images downloaded from Bing
  • Downloaded, manually curated and combined:
    • from German Bing search results using the German labels – more images for  the specific German labels than the English version
    • from English Bing search results using the translated English labels
  • 68 labels
  • 3780 training images
  • 668 (15%) validation images

Combined1:

  • Combined datasets – Original + images from German Bing search results
  • 68 labels
  • 34148 training images
  • 6028 (15%) validation images

Combined2:

  • Combined datasets – Original + Custom
  • 68 labels
  • 35905 training images
  • 6335 (15%) validation images

Combined3:

  • Combined2 dataset but with merged similar categories
  • 46 labels
  • 35905 training images
  • 6335 (15%) validation images

In-Bag:

  • 10 picked categories from the original 68, currated to contain only images with products in a bag
  • 10 labels
  • 1117 training images
  • 197 (15%) validation images

No-Bag:

  • 10 picked categories from the original 68, currated to contain only images with products without a bag
  • 10 labels
  • 398 training images
  • 70 (15%) validation images

In-Bag + No-Bag:

  • Combined datasets In-Bag and No-Bag
  • 10 labels
  • 1514 training images
  • 268 validation images

Test:

  • The test data from Kaufland, no changes to the data
  • 16248 test images

 

Modeling

The main classification approach we applied is based on Convolutional Neural Networks (CNNs) and uses BVLC Caffe and the NVIDIA DIGITS framework. All trainings were done on our NVIDIA DGX Station training machine. The main training details are given below:

Training Details:

  • Chosen architecture: GoogLeNet
  • Training epochs: 30
  • Learning rate:
    • Base:
      • 0.01 – when training vanilla models
      • 0.001 – when finetuning models
    • Policy: Polynomial decay
    • Power: 0.5
  • Solver: Stochastic Gradient Descent
  • Main approach used is parameter transfer learning (finetuning) resulting in faster training and better training results.


Fig 01: Learning rate visualisation

 

Evaluation

The results we obtained from the different trainings are given below. For all evaluations the 15% test data set we created from the Original data set were used. The best models and results we achieved, which we continue improving, are marked in bold.  Visualisations of the results can be explored using Noris – Imagga’s internal custom classification visualisation tool using the following credentials for all visualisations.

Visualisation Credentials:

After training with the Original data set (Results A), we also created a FoodNet and Custom data sets and finetuned the model using models trained with them (Results B). In one of the custom datasets – Combined3, we merged the similar categories in one, preventing any ambiguity created by the feature similarity of the categories.

We also started running some experiments to check if separating photos of fruit and vegetables which are in bags versus those that are not and training two separate classifiers with training data subsets In-Bag and No-Bag, may lead to better results than using mixed photos (in bags and not in bags) for training. Results (Results C) are still inconclusive and more experiments are needed to verify this approach but initial results indicate that for the majority the categories such separation may not be necessary as the CNNs trained with mixed photos are performing no worse than the ones trained with separate photos.

Results:

Base:

Food101:

  • Dataset: Food101
  • Training accuracy:
    • Top-1: 75.5%
    • Top-5: 91.8%

FoodNet:

  • Dataset: FoodNet
  • Training accuracy:
    • 53% top-1
    • 80% top-5

 

(Results А)

Original Dataset Base Models:

Kaufland Case #1:

Kaufland Case #2:

Kaufland Case #3:

 

(Results B)

Custom Dataset Base Models:

Kaufland Case #4:

Kaufland Case #5:

Kaufland Case #6:

Original Dataset Finetuned Model:

Kaufland Case #7:

Combined1 Dataset Models:

Kaufland Case #8:

Kaufland Case #9:

Combined2 Dataset Models:

Kaufland Case #10:

Kaufland Case #11:

Kaufland Case #12:

Kaufland Case #13:

Combined3 Dataset Models:

Kaufland Case #14:

 

(Results C)

In-bag/No-bag Experiment:

Kaufland Case #15:

Kaufland Case #16:

Kaufland Case #17:

 

Deployment (Optional)

  • Visualisations (see above)
  • Demos
  • Real-time recognition demo during final presentation

Web demo:

Description: A web demo page showcasing models Kaufland Case #13 and Kaufland Case #14.
URL: http://norris.imagga.com/demos/kaufland-case/

iOS Application:

During the final DSS Datathon event we plan to show a live real-time food recognition demo with one of our best models running real time at 4K and 60 FPS on a mobile phone.

demo link: https://slack-files.com/T02FU9S2N-F979T37J6-66a2f1ca3e
k1: Kaufland Case #1
k2: Kaufland Case #13
k3: Kaufland Case #14

Source Code

Contains the two top models #13 and #14, the source code of the iOS app and the web demo app.
URL: http://norris.imagga.com/demos/kaufland-case/source.zip

References

To be added.

 

Share this

18 thoughts on “Using Convolutional Neural Networks for Real-Time Product Recognition in Smart Scales – Imagga’s Solution to the Kaufland Case

  1. -1
    votes

    Sometimes you want to eat a sandwich, but you prepared an excessive gourmet meal. I do not believe the point of the datathon is to show off what your company is capable of, we know what money can get.

    1. 0
      votes

      Thanks for your feedback. We are not showing off but just trying to show the many experiments we conducted and the ones we started, trying to find the best solution we could achieve for this specific challenge.

    2. -1
      votes

      брат не се излагай. Счупиха си казуса, и са се раздали за кефа. Никой клиент не го боли за хакатони.

    1. 0
      votes

      Thanks for your comments. Foodnet in our paper refers to a food data set we have put together from parts of 2 public research data sets (Food101+ImageNet) and not to the paper and method you have mentioned. This was not clear, so thanks for mentioning it. We shall add some references to the article in due course. We have not tested the model performance on a Raspberry Pi but based on similar models we have tested in the past on such hardware we expect performance of 2-3sec per image.

  2. 0
    votes

    + Nice presented results
    + Good theoretical description of the models
    – One of the the Datathon requirement was to present a code of the solution, but such is not found.

  3. 0
    votes

    Nice work! I clearly see that you have a lot of professional experience in this area, the results and the different cases separation are excellent. However I will not give you maximum score on the voting. The reason for this is that I think that sharing knowledge is one of the most important things when you participate such an event. If you can’t share your code (for some privacy reasons), you could describe some of the more complicated steps, and give similar examples in open-source software, so people who see your work in the future can learn how to do similar stuff.

    1. 0
      votes

      Thanks for the feedback! Yes, sharing is caring! We were so in rush at the end, that we completely forgot about the requirement. We use entirely Open Source projects, such as Caffe and DIGITS so anyone can reevaluate our methodology. Here is the source code: http://norris.imagga.com/demos/kaufland-case/source.zip

      It cointains two of our best models (#13 and #14) with their mean files, protos for defining the architecture and the caffemodel file, which anyone can use for finetuning purposes.

  4. 0
    votes

    (+) Well formed article, shows understanding of the problem and task at hand
    (+) You’ve handled the insufficient data problem well and you’ve presented a good description of the model
    (-) Providing raw results is not very helpful you should add some analysis as to why you think you got the results that you have.
    (~) Also would be nice to know more about the actual performance of the models. The most accurate model is not always the best choice especially if it requires expensive hardware to run and you don’t have the budget for it.

    1. 0
      votes

      Thanks for your comments.
      We would add analysis of the results – we simply didn’t have the time to complete it before the deadline.
      As the hardware that such a solution will work on was not specified in the challenge and we didn’t have time to test it on different hardware, we decided not to include any performance figures at this stage. Our estimates for running the models on a Raspberry Pi are 2-3secs per image, for Nvidia Jetson TX1/2 about 30FPS and on a iOS mobile phone we achieved 60FPS @ 4K using Core ML.

Leave a Reply