Datathon 2020 SolutionsDatathons Solutions

ShopUp Datathon2020 – Article recommender case

ShopUp is working on the Article recommender as a part of the Datathon2020 check some other researches which they are doing at https://shopup.me/blog/ 

1
votes

Warning: DOMDocument::loadHTMLFile(): s3:// wrapper is disabled in the server configuration by allow_url_fopen=0 in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447

Warning: DOMDocument::loadHTMLFile(s3://dss-www-production/uploads/2020/05/Shopup_hybrid_recommender-1.html): failed to open stream: no suitable wrapper could be found in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447

Warning: DOMDocument::loadHTMLFile(): I/O warning : failed to load external entity "s3://dss-www-production/uploads/2020/05/Shopup_hybrid_recommender-1.html" in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447

Warning: Invalid argument supplied for foreach() in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 452

Team intro:

We are team of experts from ShopUp  (AI in Retail and eCommerce) Sergi Sergiev and Desislava Nikolova . Our goal was to experiment, play with data and share our experience. We hope you are going to enjoy the reading and please vote  + .

 

Introduction

The main objective of the Article recommender case, is to optimize the suggestions to the readers of articles online. The case has the final goal of engaging the user with topics which are the closest to his points of interest.

Case Summary

The main idea of the case is to predict the next best article for the visitor.

The evaluation of the model will be for the same users using the data for the next time period. The training dataset is for the 30 days. The articles are almost 60 000 and the visitors are over 2 300 000. Total datapoints: 2 350,470,700. Evaluation dataset is going to be for the next 1 day.

Figure: The chart of the tasks

Net Info has provided data with historical visits of articles per user. The data consists of user ID, URL, timestamp, article title and article views.
Features:
  • User id
  • Time
  • URL
  • Page Title
  • Page Views
Output format: VisitorID, first_best_article, second_best_article for the next 1 day/hour/minute 

Research

We looked at different models and made a short list of them:

 

Solution and approach

We decided to create a hybrid recommender focusing on content and user preferences.

About the content we wanted on use Deep Learning in order to provide unsupervised embedings based on the text of the article.

 

The first step is to gather some additional data, so we decided to scrape some information from the provided articles. The scraping is done with selenium and it creates an additional file, in which we store some important things from the article: website_link, title, subtitle, text, date_of_posting and hashtags. It is not clear which one will describe the article the best. We can compare them and choose one.

 

Models

  • For BERT model we use Russian model because of the time limitations

BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context. https://searchenterpriseai.techtarget.com/definition/BERT-language-model

BERT chart

  • Recurrent neural network (RNN)

A recurrent neural network (RNN) is a type of artificial neural network commonly used in natural language processing (NLP). RNNs are designed to recognize a data’s sequential characteristics and use patterns to predict the next likely scenario. We can use this method to predict the next cluster given the previous history we have for each user, in that case we can predict what is the future cluster of interest for that person. We used old code from article which we create in 2019 on summer school https://shopup.me/blog/recommenders_systems/

Figure: Basic RNN

 

 

 

The clusters are representation of the different topics and the interest of the users. For the final decisions, we will use the idea that similar people with similar interests might like the articles which others like. With those clusters will be possible to group people, use some of them for verification and then improve the suggestions for the others.

 

Evaluation

In order to evaluate the performance of a model and thus be able to choose the highest performing model, a testing functionality is available. 

The function makes use of the split into training data and test data.We suggest to convert the data to sequence of chosen articles and to take the first 20 per user and to predict the next 10. For ranking methods we decided to use three metrics which are relevant for measuring ranking:

  •  Mean Reciprocal Rank (MRR) 

  • mean Average Precision (MAP) – is supposed to be a classic and a ‘go-to’ metric for measuring the order 

  • Distribution coef which we defined as how many correct clusters are selected. For example 6 out of all (10) = 60%

The three functions are in the code mean_av_pres(),distribution_coef() and compute_mrr().

 

Final solution:

The following is the diagram of what out model looks like. The blue parts are ready and the red ones not yet.

Figure: Block scheme of the final solution

The code can be viewed and examined here:

Conclusion

This case is a great example of what our future thinking should be: the best personalized options for the people, wanting to learn more of the things they care about.

In the process we faced some problems: the 48h time interval is motivating but also not enough; finding people to work with is also a great challenge; the dataset is big, but still possible to compute in a personal computer; the collaboration with the teammates on a distance was also something we had to coup with. However, it all was worth it and it was a good improvement.

There were many problems in the steps but overall the chosen solution gives good results. The whole training process is not going to take a long time, but it all depends on the given time interval.

 

If you like what we did please follow our ShopUp blog for more details and explanations with video and code or our YouTube channel.

Share this

9 thoughts on “ShopUp Datathon2020 – Article recommender case

  1. 0
    votes

    This is a nice article. I like it that there is focus on recent BERT-like transformer models. The approach makes sense, but it is unclear to me what it is exactly. It feels like content-based recommendation. Why not also collaborative? E.g., (shameless self-promotion): http://proceedings.mlr.press/v28/georgiev13.html

    1. Why use Russian for BERT?

    There is multi-lingual BERT that has Bulgarian:
    https://github.com/google-research/bert/blob/master/multilingual.md

    There are also other transformers that support Bulgarian:
    https://github.com/facebookresearch/XLM

    There is also Slavic BERT (not soo suitable, as it is for NER):
    https://github.com/deepmipt/Slavic-BERT-NER

    2. I do not understand the last evaluation measure…

    3. Why is there a need to create clusters? Why not compare user-user or page-page directly?

    4. Why use RNN for clustering?

    5. I like it that there is visualization, but I could not read much… It is too crowded and unreadable…

    6. What are the evaluation results?

    7. BTW, are the evaluatiomn measures really “metrics” in mathematical sense? Do they satisfy the 4 axioms to be a “metric”? https://en.wikipedia.org/wiki/Metric_(mathematics)

    8. Is time information used for anything?

    1. 0
      votes

      Preslav as usual such a deep and detailed feedback. Thank you so much 🙂
      Directly to your points.
      1) Good to know that there is BG. Will use it.
      2) Mean Reciprocal Rank – I would prefer MAP, MRR is used to when you have ranking in a list measures the first correct item, so normally is used if you want to predict only one, which is next and use ranking to see where is in the list, if it is 2nd or 3d it makes the magic which is in the formula. I will update the article.
      3. I need clusters – because I want to have user interests representations somehow, which can be easier if it is in clusters, especially for sites with more than 50K articles.
      4. RNN to predict next cluster of interest, not article they have 2-3 days life, and because it is sequence data. Our patterns as humans are changing so for me clusters are “topic of interests” somehow.
      5. If I upload the updated file probably you will see more.
      6. Only for RNN is 58% you will see it on new file but not whole logic is implemented with the three metrics mentioned below. I will need like 2-3 hours more 🙂
      7. Two of them are – MRR and MAP for sure the other one is made by me 🙂 Here is some info https://stats.stackexchange.com/questions/127041/mean-average-precision-vs-mean-reciprocal-rank
      8. It is supposed to be. Mentioned into the diagram. not in code

  2. 0
    votes

    Excellent work, guys. I like how you formed a whole solution for the problem, and even went the extra mile to include scraping of the original article text.

    Bulgarian is indeed missing from most of the pretrained language models, and to retrain one takes a while. Also usage of transfer learning from an existing or similar language (such as russian) would have taken time.

    Nice idea to use fastAI language model for the RNN, and thus creating recommendation based on the previously visited ones. Training that would also take a long time – unless you have access to strong GPUs. So just for the record – a newer, faster (x4-x16), version of it, called MultiFit, is available with FastAI v. 2, but the linked notebook is still using the older version (https://nlp.fast.ai/classification/2019/09/10/multifit.html).

    Also, very nice and clear writing and the illustrations helped a lot.

  3. 2
    votes

    Hi guys, I wasn’t able to find the output.csv prediction file – do you know where I can find it. I tried to understand the code (don’t have much experience with Python), and based on my limited understanding of the code the predictions are in the array called “vector”. But that seems to sometimes have multiple predictions. I am likely not understanding something.

  4. 0
    votes

    Hi, Sergi & Desi,
    The think i like most in your work is that you enriched the data with the content of the articles. The data enrichment is a very important step, which, if is made in the beginning of any project would give an advantage.
    Including the time in the consideration – the sequence of readings is also interesting for me. Actually, here the hypothesis is that there is a pattern in readings: usually an article raises particular interest and it would be expected that the reader would like to proceed in particular (reading) direction. So, with RNN you account for this hypothesis. Here I would be happy to see more about the investigation – whether accounting for the reading sequence really improves the recommendations…

Leave a Reply