Our approach to the problem is as follow:
- Data Augmentation – in the data augmentation phase we are creating synthetic images based on the dataset we have. This helps us to have larger dataset for model training and validation
- Object detection – There are several challenges with the object detection. First, sample is imbalanced and there are categories with just few examples. Second, most of the labels look similar and it is very unlikely to be categorized correctly. Therefore, we are using 2 step approach for object detection. On the first step we are detecting borders of labels and products and thereafter we are solving two independent tasks:
- Categorize the labels
- Categorize the products
- Verification of misplaced and missing items
Our code could be find in the following github repo https://github.com/d-vasilev/Datathon2019 or in the data society repo https://github.com/datasciencesociety/Phoenix. The final score of the model as supplied to the leaderboard is 0.677215.
The data augmentation phase aims to improve the pool of supplied data by increasing the number of available images and the addition of noise/unexpected behavior. As a result of this phase 1’000 syntetic images were defined, which in addition to the 141 original images increase the number of available images for training from 141 to 1’141.
Two groups of images were build on top of the provided one:
- Syntetic images with objects removed from the original image.
- Syntetic images with objects previously not present in the original image.
The first type of syntetic images is generated by applying the steps described below:
- First, for each image part of the “ground_truth” dataset a background pattern is extracted. The background pattern is defined as the most common pattern next to the objects, which doesn’t include the objects or the plane defined by the height of the labels.
- Next the algorithm randomly chooses an image and an object within that image, which is not a label, and replaces it with the background patter. If the object overlaps with another object, both of them are replaced with the background pattern. This is necessary as many of the objects are placed behind another object and the rectangular shape doesn’t allow to exact differentiation between the pixels that represent the objects.
- The process above is repeated 500 times with slight adjusments:
- For 300 of the randomly selected images, the background application process is done only once
- For 150 of the randomly selected images, the background application process is repeated two times
- For 50 of the randomly selected images, the background application process is repeated three times
The second type of syntetic images is generated by utilizing the images obtained after the previous step:
- First, we select randomly one of the 500 images generated from the previous step.
- Then we extract information about the height coordinates of all labels present in the picture.
- We then select randomly one of the initially supplied images from the “ground_truth” dataset and choose on random basis one of the non-label objects within it.
- Then we select randomly one of the non-label objects within the 500 image dataset and place the object extracted in step 3 next to it. The distance between the 2 objects is determined on random basis up to a maximum of 200 pixels. This determines the horizontal position of the placement. The vertical coordinates on other hand are determined by looking at the height coordinates extracted in step 2. and placing the object up to 30 pixles above these coordinates. This process aims to ensure that the extracted product is placed on the shelf (and not in the air).
- Finally, if none of there is no space within these constrained sample of possible positions, the algorithm extracts a new object or a new image (and then a new object). The same process is then repeated up until an acceptable combination of images and objects is extracted.
- Steps 1-5 are repeated 500 times to generate a sample of 500 syntetic images.
The combined number of syntetic images is 1’000. A few examples are included below:
|Original Image||Processed Image|
|Original Image||Processed Image|
|Original Image||Processed Image|
|Original Image||Processed Image|
Please refer to folder “01. Data Augmentation Full.ipynb” to see the code and all rules used for generating the images. The observed improvement due to the addition of these images is >40% decrease in the value of the loss function.
We are using YOLO3 model for object detection. Our work is based on the following implementation https://github.com/qqwweee/keras-yolo3
The preprocessing as follow:
- The object detection model is supplied with all 1’141 images produced as a result of the previous section.
- each image is rescaled to 400 x 300 pixels (factor of 11.52). This significantly reduced the memory and CPU requirements
- YOLO anchor boxes are based on the shape of the existing objects. We are using 10 anchor boxes, derived by Kmeans on shape of all objects.
Model training is leveraging pre-trained YOLO3 model and it is done on 2 phases:
- On phase 1, just the last 3 layers are trained, which allow to use maximum of the pre-trained model and adapt it to our case. Phase 1 should be run on 50 epoch, but due to time limitations we run it only for 10 epochs and there is significant improvement of the loss function. The loss funtion value at the end of phase 1 is around 85, with loss function vlauea t the previous epoch equal to 92.
- On phase 2, all of the layers are unfreeze and should be trained for 50 more epochs. However, one epoch of training on the cloud servers supplied by DataScience Society takes around 30 minutes, so we were able to run it only for 3 epochs. Lost function is improved to 44 with loss function at the previous epoch equal to 47.
We have adapted the keras-yolo code to process the pictures in batch mode and produce output file with following information (per row):
- file name
- category (label/product)
- confidence level
- box coordianates
This output is used for cropping images from the original pictures (unscaled), which should be fed to the next phases. Please refer to file “BoxPlot_Data.zip” available in the project git repository to see all images. Below you can find a few examples.
Automating the reading of labels (i.e. matching an image of a label to a product code) is one of the tasks on the project.
Expectations on accuracy were not very high to begin with — many labels are difficult to read even for a human.
“Predicting” the product code based on an image of the label was broken down into the following steps:
- Prep step — this needs to be done only once for any given set of product codes
- Create a curated set of (cropped + resized) labels that (based on eyeball test) are suitable for OCR — in particular, the product code is clearly visible and not blurred. (In real use, this could be generated by the process that prints the labels — it will be a “perfect” version of the label).
- Individual prediction from a crop of a label image follows this process:
- Start with an image containing a label and its bounding box (i.e. a single image may contain several labels)
- Crop the label only (using the supplied bounding box)
- Approach 1: OCR
- Crop the lower/left region, roughly 1/4 of the total image. This is where the product code is located for all types of labels present in the dataset.
- The crop is resized to a standard size, and some sharpening is applied.
- Run OCR on the label crop — looking for the Kaufland product code (the string of 8 digits above the barcode).
- Approach 2:
- Compare the similarity of the current label to the curated list from above. “Predict” the product code of the current label as the product code of the closest of the curated images (these are known).
- Approach 1: OCR
- Prediction generation — If the OCR step produced a result (strings of length 6 to 9 are deemed acceptable) and the confidence is above 50%, use this as the final prediction, else use the product code of the closest of the curated images.
All the work above was done in R, mostly using libraries tesseract (OCR) and magick (image manipulation).
- Recognize barcodes and/or the producer’s product code (the product code under the barcode)
- Use product descriptions — but in our case the tesseract OCR library presented challenges. It would not work well with mixed-alphabet text, and the Bulgarian-only engine did not do well recognizing digits.
- Use a proper camera, not a phone 🙂 Esp. focus on lenses that produce as little bokeh effect as possible (i.e. the largest possible portion of the picture is in focus rather than fashionably blurred)
- The layout of the labels is totally under Kaufland’s control, steps can be taken to improve the performance of automated product code reading (in particular, use larger font size; or a font tailored to OCR but still human-readable). The small size of text (digits)
- Some extra image processing on the labels could be of use — esp. deskewing. We did not have sufficient time to experiment.
Due to time limitations we were unable to produce product categorization tool.
Verification of Items
The final verification of the items should be done on two levels:
- By comparing the extracted name from the label and the product image.
- By looking at the space above the labels.
However, due to limited time we were unable to complete the product categorization model, so for the verification phase we rely solely on the second rule. We use the Object Detection model to find all objects and then we check two things:
- First we check if all labels are correctly highlighted by the model. The accuracy on the labels is 100%. Due to the lack of categorization models all labels were flagged with flag = 0.
- Then we check if the space above the labels is filled with objects. We do that by searching for frames detected by the model, which overlap with the space above the label. If the proportion of overlap is > 20% we consider that the space is filled with products and assign flag = 0, if not we consider this space empty and assgn flag = 1.
Please refer to file “05. Calculate FInal Prediction.ipynb” to see how the accuracy flags are defined. The final score of the model as supplied to the leaderboard is 0.677215.
Hi all, I’ve reuploaded the Data Augmentation script as up until now an old version of the script was available on github.
New Link: https://github.com/d-vasilev/Datathon2019/blob/master/!Clean%20Project%20Files/01.%20Data%20Augmentation%20Full.ipynb
That is a job well done on very short terms. Wish you had time to handle the misplaced items also. Thank you for choosing this case once again and really good “Other comments” section you got there 😉 One quick question to keep the ideas floating around:
Do you think this approach can be used in reality, where the number of items is ~20 000 and continuously changing?
A bit of a late reply, but still – here are a few thoughts on the approach in general. To be clear, I am referring to the underlying problem in the stores rather than the Datathon case. My feeling is that this is a bit of an XY problem (https://en.wikipedia.org/wiki/XY_problem), and you have it a bit backwards 😊
If I were to formulate the business issue, I’d go like this:
1. Make a “map” of the store, with all shelves and expected product placements. Not really sure how much this would cost but considering many autonomous driving approaches involve building centimeter-level-accurate maps of whole cities, it should be feasible. See below for a feasible simplification.
2. Then the problem would be to match: [This is a place where somebody would mention Bayesian priors, I guess?]
2a. Whether the product seen in the image matches the expected product (which I think is an easier task than identifying the product from scratch). And you can get extra info — is the product placed properly (front side facing the front and fully visible, etc.). This is lower priority but you might still want it fixed.
2b. Whether the label is what it is expected to be.
3. Currently, a lot of the sample photos seem to be adjacent to each other, with substantial (in some cases) overlap. Go further – combine the images into a ‘panoramic’ image that would ideally capture a whole shelf’s length. That would help greatly in (1) above – you do not need full coordinates in space of all shelves/products, a “map” would be something like “On shelf A, products are expected to be in the following order – X, Y, Z…”
4. Lastly, labels:
4a. Yes, labels can be made to be “machine-readable” with minimal effort. Lots of options to choose from – proper fonts (there are lots of questions on this on Stack Overflow/Quora/etc.); barcodes, QR codes…
4b. Another option – so that a bigger barcode or QR code, or uglier font do not stick out in a bad way… Make the QR codes “open”, there are many ways this would provide value to the customer. E.g. I can have my phone remember what I buy typically (and maybe next time notify me on which shelf to find it; provide extra information on the product – nutrition values for food, etc.; get notified of promotional prices; …)
Lol! It seems you’ve done a lot with this data. What are your evaluation criteria (how do you count a correctly detected and localized object?)? I don’t see any stats for the performance of your approach (ROC curves may be).
Thanks for the feedback! Yes, I believe that the approach can be applied in reality given that similar images and xml files are supplied for all products. It seems that the data augmentation code is working well enough, so even with smaller set of images the neural network can still be trained to differentiate between objects and background. However, an important thing to remember is that all images are made from a similar perspective and the background of the shelves is relatevely consistent. This has also to be the case in any future implementation.
I don’t think that the perspective is a huge issue, as there are freely available tools, which can help with generating images, which simualte such behavior and the products are not always placed in the same way even in the current sample. However, I think that any change in the background should be paired with a fine tuning of the model before implementation.
Thanks for the feedback!
1) You can refer to code 05. Calculate FInal Prediction.ipynb to notebook “https://github.com/datasciencesociety/Phoenix/blob/master/!Clean%20Project%20Files/05.%20Calculate%20FInal%20Prediction.ipynb” to see the final asignment of the flags supplied to the leaderboard. We are just searching the space outlined in the xml files for a product. I don’t think that the evaluation approach used in the leaderboard is very good, as the coordinates in the xml file are not standardized and in cases of missing products account for the full available space, while in case of available products account only for the space of the specified product. 🙂 However, as seen in the BoxPlot_data.zip file we are correctly identifying almost all of the objects and 100% of the labels. – https://github.com/datasciencesociety/Phoenix/blob/master/!Clean%20Project%20Files/BoxPlot_Data.zip
2) For performance we only used the leaderboard, as in Computer Vision it really depends on what you are interested in.