Datathon 2020 SolutionsDatathons Solutions

ShopUp Datathon2020 – Article recommender case

ShopUp is working on the Article recommender as a part of the Datathon2020 check some other researches which they are doing at https://shopup.me/blog/ 

1
votes

Team intro:

We are team of experts from ShopUp  (AI in Retail and eCommerce) Sergi Sergiev and Desislava Nikolova . Our goal was to experiment, play with data and share our experience. We hope you are going to enjoy the reading and please vote  + .

 

Introduction

The main objective of the Article recommender case, is to optimize the suggestions to the readers of articles online. The case has the final goal of engaging the user with topics which are the closest to his points of interest.

Case Summary

The main idea of the case is to predict the next best article for the visitor.

The evaluation of the model will be for the same users using the data for the next time period. The training dataset is for the 30 days. The articles are almost 60 000 and the visitors are over 2 300 000. Total datapoints: 2 350,470,700. Evaluation dataset is going to be for the next 1 day.

Figure: The chart of the tasks

Net Info has provided data with historical visits of articles per user. The data consists of user ID, URL, timestamp, article title and article views.
Features:
  • User id
  • Time
  • URL
  • Page Title
  • Page Views
Output format: VisitorID, first_best_article, second_best_article for the next 1 day/hour/minute 

Research

We looked at different models and made a short list of them:

 

Solution and approach

We decided to create a hybrid recommender focusing on content and user preferences.

About the content we wanted on use Deep Learning in order to provide unsupervised embedings based on the text of the article.

 

The first step is to gather some additional data, so we decided to scrape some information from the provided articles. The scraping is done with selenium and it creates an additional file, in which we store some important things from the article: website_link, title, subtitle, text, date_of_posting and hashtags. It is not clear which one will describe the article the best. We can compare them and choose one.

 

Models

  • For BERT model we use Russian model because of the time limitations

BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context. https://searchenterpriseai.techtarget.com/definition/BERT-language-model

BERT chart

  • Recurrent neural network (RNN)

A recurrent neural network (RNN) is a type of artificial neural network commonly used in natural language processing (NLP). RNNs are designed to recognize a data’s sequential characteristics and use patterns to predict the next likely scenario. We can use this method to predict the next cluster given the previous history we have for each user, in that case we can predict what is the future cluster of interest for that person. We used old code from article which we create in 2019 on summer school https://shopup.me/blog/recommenders_systems/

Figure: Basic RNN

 

 

 

The clusters are representation of the different topics and the interest of the users. For the final decisions, we will use the idea that similar people with similar interests might like the articles which others like. With those clusters will be possible to group people, use some of them for verification and then improve the suggestions for the others.

 

Evaluation

In order to evaluate the performance of a model and thus be able to choose the highest performing model, a testing functionality is available. 

The function makes use of the split into training data and test data.We suggest to convert the data to sequence of chosen articles and to take the first 20 per user and to predict the next 10. For ranking methods we decided to use three metrics which are relevant for measuring ranking:

  •  Mean Reciprocal Rank (MRR) 

  • mean Average Precision (MAP) – is supposed to be a classic and a ‘go-to’ metric for measuring the order 

  • Distribution coef which we defined as how many correct clusters are selected. For example 6 out of all (10) = 60%

The three functions are in the code mean_av_pres(),distribution_coef() and compute_mrr().

 

Final solution:

The following is the diagram of what out model looks like. The blue parts are ready and the red ones not yet.

Figure: Block scheme of the final solution

The code can be viewed and examined here:

1. Load data and packages

In [0]:
!pip install -U sentence-transformers
!pip install  git+https://github.com/UKPLab/sentence-transformers#pretrained-models
Requirement already up-to-date: sentence-transformers in /usr/local/lib/python3.6/dist-packages (0.2.6.1)
Requirement already satisfied, skipping upgrade: scipy in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (1.4.1)
Requirement already satisfied, skipping upgrade: torch>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (1.5.0+cu101)
Requirement already satisfied, skipping upgrade: scikit-learn in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (0.22.2.post1)
Requirement already satisfied, skipping upgrade: numpy in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (1.18.4)
Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (4.41.1)
Requirement already satisfied, skipping upgrade: transformers>=2.8.0 in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (2.9.1)
Requirement already satisfied, skipping upgrade: nltk in /usr/local/lib/python3.6/dist-packages (from sentence-transformers) (3.2.5)
Requirement already satisfied, skipping upgrade: future in /usr/local/lib/python3.6/dist-packages (from torch>=1.0.1->sentence-transformers) (0.16.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->sentence-transformers) (0.14.1)
Requirement already satisfied, skipping upgrade: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (2019.12.20)
Requirement already satisfied, skipping upgrade: tokenizers==0.7.0 in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (0.7.0)
Requirement already satisfied, skipping upgrade: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (0.1.90)
Requirement already satisfied, skipping upgrade: requests in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (2.23.0)
Requirement already satisfied, skipping upgrade: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (0.0.43)
Requirement already satisfied, skipping upgrade: filelock in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (3.0.12)
Requirement already satisfied, skipping upgrade: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers) (0.7)
Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from nltk->sentence-transformers) (1.12.0)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers) (2.9)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers) (3.0.4)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers) (2020.4.5.1)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers) (1.24.3)
Requirement already satisfied, skipping upgrade: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers>=2.8.0->sentence-transformers) (7.1.2)
Collecting git+https://github.com/UKPLab/sentence-transformers#pretrained-models
  Cloning https://github.com/UKPLab/sentence-transformers to /tmp/pip-req-build-8axyk952
  Running command git clone -q https://github.com/UKPLab/sentence-transformers /tmp/pip-req-build-8axyk952
Requirement already satisfied (use --upgrade to upgrade): sentence-transformers==0.2.6.1 from git+https://github.com/UKPLab/sentence-transformers#pretrained-models in /usr/local/lib/python3.6/dist-packages
Requirement already satisfied: transformers>=2.8.0 in /usr/local/lib/python3.6/dist-packages (from sentence-transformers==0.2.6.1) (2.9.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from sentence-transformers==0.2.6.1) (4.41.1)
Requirement already satisfied: torch>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from sentence-transformers==0.2.6.1) (1.5.0+cu101)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from sentence-transformers==0.2.6.1) (1.18.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from sentence-transformers==0.2.6.1) (0.22.2.post1)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from sentence-transformers==0.2.6.1) (1.4.1)
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (from sentence-transformers==0.2.6.1) (3.2.5)
Requirement already satisfied: tokenizers==0.7.0 in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers==0.2.6.1) (0.7.0)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers==0.2.6.1) (2.23.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers==0.2.6.1) (3.0.12)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers==0.2.6.1) (0.0.43)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers==0.2.6.1) (0.1.90)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers==0.2.6.1) (2019.12.20)
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers>=2.8.0->sentence-transformers==0.2.6.1) (0.7)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from torch>=1.0.1->sentence-transformers==0.2.6.1) (0.16.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->sentence-transformers==0.2.6.1) (0.14.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from nltk->sentence-transformers==0.2.6.1) (1.12.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers==0.2.6.1) (2.9)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers==0.2.6.1) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers==0.2.6.1) (2020.4.5.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers>=2.8.0->sentence-transformers==0.2.6.1) (1.24.3)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers>=2.8.0->sentence-transformers==0.2.6.1) (7.1.2)
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... done
  Created wheel for sentence-transformers: filename=sentence_transformers-0.2.6.1-cp36-none-any.whl size=80985 sha256=e055d9a2c5d6e2141c636f5c135348ceb2c2cdafd20b7cbe68bac5c310623d3e
  Stored in directory: /tmp/pip-ephem-wheel-cache-kw3r0m9e/wheels/97/ec/44/a63b7b633eae01893eea846a08788d4f8f921286b3f66dcd4e
Successfully built sentence-transformers
In [0]:
from fastai.collab import *
from fastai.tabular import *
from fastai.text import *

import numpy as np
import os
import pandas as pd
from spacy.lang.bg.examples import sentences

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/Colab Notebooks/datathon2020/"
base_dir = root_dir + 'data/'
Mounted at /content/gdrive
In [0]:

2. ShopUp article embedding

2.1 Loading scraped data

We scrape data from provided links and capture info about Title, subtitle, date of creation, image links, hashtags and text

In [0]:
df = pd.DataFrame()
for filename in os.listdir(base_dir):
    if filename.startswith("scraped_pages"): 
        path =  os.path.join(base_dir, filename)
        df_ = pd.read_csv(path)
        df = df.append(df_)
        #print(os.path.join(base_dir, filename))
        #continue
df = df.reset_index()
df["date"] = pd.DatetimeIndex(df["date"]).date
df.head()
Out[0]:
index link title subtitle date image hashtags text
0 0 www.vesti.bg/bulgaria/valeri-simeonov-bozhkov-... Тежка катастрофа в София, кола се обърна по таван За щастие пътният инцидент се е разминал без с... 2020-05-16 https://m.netinfo.bg/media/images/43102/431022... ['катастрофа', 'София'] ['Тежка катастрофа между два леки автомобила е...
1 1 www.vesti.bg/bulgaria/katastrofata-razminavane... Катастрофата, разминаване между думите на Пенч... Инцидентът е станал при заход към приземяване,... 2020-05-16 https://www.vbox7.com/play:09b91178e5 ['катастрофа'] ['Наземният контрол е дал указание на Йвайло П...
2 2 www.vesti.bg/bulgaria/vasil-ivanov-s-novi-razk... Васил Иванов с нови разкрития за партия "Възра... След изборите, от панелния си апартамент в раб... 2020-05-16 https://www.vbox7.com/play:757ff3d1cb [] ['Партия "Възраждане" не само е получила парти...
3 3 www.vesti.bg/tehnologii/tajniiat-kosmicheski-s... Тайният космически самолет X-37B се готви за н... Той ще се завърне в ниска орбита на 16-ти май 2020-05-16 https://m4.netinfo.bg/media/images/40971/40971... ['самолет', 'космос', 'военен проект', 'военни... ['Тайният космически самолет X-37B на американ...
4 4 www.vesti.bg/sviat/dw-dokazatelstva-za-zverstv... DW: Доказателства за зверствата на Асад и Руси... Очевидци разказват за жестокостите 2020-05-16 https://m4.netinfo.bg/media/images/43047/43047... ['сирия', 'асад', 'русия'] ['Сирийските правителствени войски и техните р...

2.2 Can load content data provided by Netinfo

In [0]:
path =  os.path.join(base_dir+ 'datathon2020vesti.csv')
df1 = pd.read_csv(path, sep=";")
In [0]:

2.3. Sentences Embedding with a Pretrained Model

This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch

First download a pretrained model with Russian language which is supposed to be the closest to the Bulgarian. Training option was not consider having in mind the short time.

In [0]:

In [0]:
from sentence_transformers import SentenceTransformer
#model = SentenceTransformer('bert-base-nli-mean-tokens')
model = SentenceTransformer('distiluse-base-multilingual-cased')
In [0]:
df[0:3]
Out[0]:
index link title subtitle date image hashtags text
0 0 www.vesti.bg/bulgaria/valeri-simeonov-bozhkov-... Тежка катастрофа в София, кола се обърна по таван За щастие пътният инцидент се е разминал без с... 2020-05-16 https://m.netinfo.bg/media/images/43102/431022... ['катастрофа', 'София'] ['Тежка катастрофа между два леки автомобила е...
1 1 www.vesti.bg/bulgaria/katastrofata-razminavane... Катастрофата, разминаване между думите на Пенч... Инцидентът е станал при заход към приземяване,... 2020-05-16 https://www.vbox7.com/play:09b91178e5 ['катастрофа'] ['Наземният контрол е дал указание на Йвайло П...
2 2 www.vesti.bg/bulgaria/vasil-ivanov-s-novi-razk... Васил Иванов с нови разкрития за партия "Възра... След изборите, от панелния си апартамент в раб... 2020-05-16 https://www.vbox7.com/play:757ff3d1cb [] ['Партия "Възраждане" не само е получила парти...
In [0]:
len(df)
Out[0]:
1662
In [0]:
# run the model and generate embeddings
sentences = df.text
sentences
sentence_embeddings = model.encode(sentences)
print(sentence_embeddings[3])
[ 0.019099  0.000158  0.03033  -0.012973 ...  0.021687  0.051281  0.015552  0.041067]

And that's it already. We now have a list of numpy arrays with the embeddings.

In [0]:

2.4 Add clusters plus visualization

In [0]:
from sklearn.cluster import KMeans

num_clusters = 30
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(sentence_embeddings)
cluster_assignment = clustering_model.labels_
In [0]:
df["clusters"] = cluster_assignment
df[0:3]
Out[0]:
index link title subtitle date image hashtags text clusters
0 0 www.vesti.bg/bulgaria/valeri-simeonov-bozhkov-... Тежка катастрофа в София, кола се обърна по таван За щастие пътният инцидент се е разминал без с... 2020-05-16 https://m.netinfo.bg/media/images/43102/431022... ['катастрофа', 'София'] ['Тежка катастрофа между два леки автомобила е... 3
1 1 www.vesti.bg/bulgaria/katastrofata-razminavane... Катастрофата, разминаване между думите на Пенч... Инцидентът е станал при заход към приземяване,... 2020-05-16 https://www.vbox7.com/play:09b91178e5 ['катастрофа'] ['Наземният контрол е дал указание на Йвайло П... 17
2 2 www.vesti.bg/bulgaria/vasil-ivanov-s-novi-razk... Васил Иванов с нови разкрития за партия "Възра... След изборите, от панелния си апартамент в раб... 2020-05-16 https://www.vbox7.com/play:757ff3d1cb [] ['Партия "Възраждане" не само е получила парти... 16
In [0]:
# fit a 2d PCA model to the vectors
from sklearn.decomposition import PCA
#X = model[sentence]
pca = PCA(n_components=2)
result = pca.fit_transform(sentence_embeddings)
In [0]:
#Plot helpers
import matplotlib
import matplotlib.pyplot as plt
# create a plot of the projection
fig, ax = plt.subplots(figsize=(24,18))
ax.scatter(result[:, 0], result[:, 1],  c = df.clusters, s=40)
ax.set_title('Tweets')
#words = df.title
words = df.hashtags
for i, word in enumerate(words):
	plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.show()