ShopUp is working on the Article recommender as a part of the Datathon2020 check some other researches which they are doing at
Team intro:
We are team of experts from ShopUp (AI in Retail and eCommerce) Sergi Sergiev and Desislava Nikolova . Our goal was to experiment, play with data and share our experience. We hope you are going to enjoy the reading and please vote + .
The main objective of the Article recommender case, is to optimize the suggestions to the readers of articles online. The case has the final goal of engaging the user with topics which are the closest to his points of interest.
Case Summary
The main idea of the case is to predict the next best article for the visitor.
The evaluation of the model will be for the same users using the data for the next time period. The training dataset is for the 30 days. The articles are almost 60 000 and the visitors are over 2 300 000. Total datapoints: 2 350,470,700. Evaluation dataset is going to be for the next 1 day.
Figure: The chart of the tasks
Net Info has provided data with historical visits of articles per user. The data consists of user ID, URL, timestamp, article title and article views.
- User id
- Time
- Page Title
- Page Views
Output format: VisitorID, first_best_article, second_best_article for the next 1 day/hour/minute
We looked at different models and made a short list of them:
- Sentence transformer with BERT
- it seems to be on sentence level, can be trained with other languages
- there is russian model
- seems simple but training is the challenge we may have and on sentence range
- Doc2vec on gensim
- Doc2Vec & Logistic Regression with code
- Top2Vec – doc2vec, UMAP, HDBSCAN, centroid computation, n-closest word vectors
- Document Clustering – TFIDF, pretrained word embeddings and text hashing.
- Various word embeddings
- Feed Forward Neural Network Text Classifier
- DocMap
- List of recommenders
- Others
Solution and approach
We decided to create a hybrid recommender focusing on content and user preferences.
About the content we wanted on use Deep Learning in order to provide unsupervised embedings based on the text of the article.
The first step is to gather some additional data, so we decided to scrape some information from the provided articles. The scraping is done with selenium and it creates an additional file, in which we store some important things from the article: website_link, title, subtitle, text, date_of_posting and hashtags. It is not clear which one will describe the article the best. We can compare them and choose one.
- For BERT model we use Russian model because of the time limitations
BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context.
- Recurrent neural network (RNN)
A recurrent neural network (RNN) is a type of artificial neural network commonly used in natural language processing (NLP). RNNs are designed to recognize a data’s sequential characteristics and use patterns to predict the next likely scenario. We can use this method to predict the next cluster given the previous history we have for each user, in that case we can predict what is the future cluster of interest for that person. We used old code from article which we create in 2019 on summer school
Figure: Basic RNN
The clusters are representation of the different topics and the interest of the users. For the final decisions, we will use the idea that similar people with similar interests might like the articles which others like. With those clusters will be possible to group people, use some of them for verification and then improve the suggestions for the others.
In order to evaluate the performance of a model and thus be able to choose the highest performing model, a testing functionality is available.
The function makes use of the split into training data and test data.We suggest to convert the data to sequence of chosen articles and to take the first 20 per user and to predict the next 10. For ranking methods we decided to use three metrics which are relevant for measuring ranking:
- Mean Reciprocal Rank (MRR)
- mean Average Precision (MAP) – is supposed to be a classic and a ‘go-to’ metric for measuring the order
- Distribution coef which we defined as how many correct clusters are selected. For example 6 out of all (10) = 60%
The three functions are in the code mean_av_pres(),distribution_coef() and compute_mrr().
Final solution:
The following is the diagram of what out model looks like. The blue parts are ready and the red ones not yet.
Figure: Block scheme of the final solution
The code can be viewed and examined here:
1. Load data and packages¶
Requirement already up-to-date: sentence-transformers in /usr/local/lib/python3.6/dist-packages (
Requirement already satisfied, skipping upgrade: scipy in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (1.4.1)
Requirement already satisfied, skipping upgrade: torch>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (1.5.0+cu101)
Requirement already satisfied, skipping upgrade: scikit-learn in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (0.22.2.post1)
Requirement already satisfied, skipping upgrade: numpy in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (1.18.4)
Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (4.41.1)
Requirement already satisfied, skipping upgrade: transformers>=2.8.0 in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (2.9.1)
Requirement already satisfied, skipping upgrade: nltk in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (3.2.5)
Requirement already satisfied, skipping upgrade: future in /usr/local/lib/python3.6/dist-packages (from torch>=1.0.1->sentence-transformers) (0.16.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->sentence-transformers) (0.14.1)
Requirement already satisfied, skipping upgrade: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (2019.12.20)
Requirement already satisfied, skipping upgrade: tokenizers==0.7.0 in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (0.7.0)
Requirement already satisfied, skipping upgrade: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (0.1.90)
Requirement already satisfied, skipping upgrade: requests in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (2.23.0)
Requirement already satisfied, skipping upgrade: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (0.0.43)
Requirement already satisfied, skipping upgrade: filelock in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (3.0.12)
Requirement already satisfied, skipping upgrade: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (0.7)
Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from nltk->sentence-transformers) (1.12.0)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers) (2.9)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers) (3.0.4)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers) (2020.4.5.1)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers) (1.24.3)
Requirement already satisfied, skipping upgrade: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers>=2.8.0->sentence-transformers) (7.1.2)
Collecting git+
Cloning to /tmp/pip-req-build-8axyk952
Running command git clone -q /tmp/pip-req-build-8axyk952
Requirement already satisfied (use --upgrade to upgrade): sentence-transformers== from git+ in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: transformers>=2.8.0 in /usr/local/lib/python3.6/dist-packages (from sentence-transformers== (2.9.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from sentence-transformers== (4.41.1)
Requirement already satisfied: torch>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from sentence-transformers== (1.5.0+cu101)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from sentence-transformers== (1.18.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from sentence-transformers== (0.22.2.post1)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from sentence-transformers== (1.4.1)
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (from sentence-transformers== (3.2.5)
Requirement already satisfied: tokenizers==0.7.0 in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers== (0.7.0)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers== (2.23.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers== (3.0.12)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers== (0.0.43)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers== (0.1.90)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers== (2019.12.20)
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers== (0.7)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from torch>=1.0.1->sentence-transformers== (0.16.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->sentence-transformers== (0.14.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from nltk->sentence-transformers== (1.12.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers== (2.9)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers== (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers== (2020.4.5.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers== (1.24.3)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers>=2.8.0->sentence-transformers== (7.1.2)
Building wheels for collected packages: sentence-transformers
Building wheel for sentence-transformers ( ... done
Created wheel for sentence-transformers: filename=sentence_transformers- size=80985 sha256=e055d9a2c5d6e2141c636f5c135348ceb2c2cdafd20b7cbe68bac5c310623d3e
Stored in directory: /tmp/pip-ephem-wheel-cache-kw3r0m9e/wheels/97/ec/44/a63b7b633eae01893eea846a08788d4f8f921286b3f66dcd4e
Successfully built sentence-transformers
Mounted at /content/gdrive
2.1 Loading scraped data¶
We scrape data from provided links and capture info about Title, subtitle, date of creation, image links, hashtags and text
index |
link |
title |
subtitle |
date |
image |
hashtags |
text |
0 |
0 | |
Тежка катастрофа в София, кола се обърна по таван |
За щастие пътният инцидент се е разминал без с... |
2020-05-16 | |
['катастрофа', 'София'] |
['Тежка катастрофа между два леки автомобила е... |
1 |
1 | |
Катастрофата, разминаване между думите на Пенч... |
Инцидентът е станал при заход към приземяване,... |
2020-05-16 | |
['катастрофа'] |
['Наземният контрол е дал указание на Йвайло П... |
2 |
2 | |
Васил Иванов с нови разкрития за партия "Възра... |
След изборите, от панелния си апартамент в раб... |
2020-05-16 | |
[] |
['Партия "Възраждане" не само е получила парти... |
3 |
3 | |
Тайният космически самолет X-37B се готви за н... |
Той ще се завърне в ниска орбита на 16-ти май |
2020-05-16 | |
['самолет', 'космос', 'военен проект', 'военни... |
['Тайният космически самолет X-37B на американ... |
4 |
4 | |
DW: Доказателства за зверствата на Асад и Руси... |
Очевидци разказват за жестокостите |
2020-05-16 | |
['сирия', 'асад', 'русия'] |
['Сирийските правителствени войски и техните р... |
2.2 Can load content data provided by Netinfo¶
2.3. Sentences Embedding with a Pretrained Model¶
First download a pretrained model with Russian language which is supposed to be the closest to the Bulgarian.
Training option was not consider having in mind the short time.
index |
link |
title |
subtitle |
date |
image |
hashtags |
text |
0 |
0 | |
Тежка катастрофа в София, кола се обърна по таван |
За щастие пътният инцидент се е разминал без с... |
2020-05-16 | |
['катастрофа', 'София'] |
['Тежка катастрофа между два леки автомобила е... |
1 |
1 | |
Катастрофата, разминаване между думите на Пенч... |
Инцидентът е станал при заход към приземяване,... |
2020-05-16 | |
['катастрофа'] |
['Наземният контрол е дал указание на Йвайло П... |
2 |
2 | |
Васил Иванов с нови разкрития за партия "Възра... |
След изборите, от панелния си апартамент в раб... |
2020-05-16 | |
[] |
['Партия "Възраждане" не само е получила парти... |
[ 0.019099 0.000158 0.03033 -0.012973 ... 0.021687 0.051281 0.015552 0.041067]
And that's it already. We now have a list of numpy arrays with the embeddings.
2.4 Add clusters plus visualization¶
index |
link |
title |
subtitle |
date |
image |
hashtags |
text |
clusters |
0 |
0 | |
Тежка катастрофа в София, кола се обърна по таван |
За щастие пътният инцидент се е разминал без с... |
2020-05-16 | |
['катастрофа', 'София'] |
['Тежка катастрофа между два леки автомобила е... |
3 |
1 |
1 | |
Катастрофата, разминаване между думите на Пенч... |
Инцидентът е станал при заход към приземяване,... |
2020-05-16 | |
['катастрофа'] |
['Наземният контрол е дал указание на Йвайло П... |
17 |
2 |
2 | |
Васил Иванов с нови разкрития за партия "Възра... |
След изборите, от панелния си апартамент в раб... |
2020-05-16 | |
[] |
['Партия "Възраждане" не само е получила парти... |
16 |