Article recommendation
Team Fire
Initial Analysis:
The task of recommending and predicting a next best article is modeled as a function of the users POI (point-of-interest), the articles content and contextual information about it such as its popularity. A new user session can be decomposed of the following two parts:
- Initially selecting an article – This is probably mostly influenced by the specific system promotion of new articles (recency) and how they handle and show off popular articles
- Chaining down on the recommended articles – This is something that completely depends on the currently implemented recommender. Trying to predict the recommended article will be like trying to predict the behaviour of the recommender in place – so the currently picked articles by a user are so far biased by the current system. But doing this is not something that will be useful in a real system outside of a competition.
The focus here will be trying to model each article in a way that captures its content and its popularity.
Why is the popularity important? If lets say we have the following two articles with views in the past 5 days:
- [100,500,1000,10000,5000]
- [0,200,300,1000,4000]
Which one should we recommend? It’s obvious that the first one is losing popularity rapidly and the second one is gaining popularity. This may be because most of the users have read the first one, so it is normal to be descending. And it may be because some important event has happened, that is presented in the second article (hence its raising in viewers), in which case we would normally would like it to be recommended.
What about the content of the article? There is no question that the article’s title (and subtitle, if any) and its image cover (if any) are important. Imagine a user which only cares about flowers, but flower articles overall may not very popular.
Initial Solution
Traditional matrix factorization-based algorithms have seen huge success, but are lately more and more outperformed by deep neural networks. In addition to the scalability and sparsity issues (neural networks can learn small latent spaces), the reason is maybe because hybrid recommendation systems (collaborative filtering + content-based filtering) have a hard time including temporal information – i.e. its hard to make a contextualized article representation (in a way its always moving with time). If we have a new user, which views articles similar to other users, and all of them have viewed an year old article, we would still like to recommend him a similar, but newer one (that is in the context of a news recommendation system).
In this solution the following ideas will be tried, but they are not final and could be modified:
- Recurrent neural network, which predicts article popularity for the next day. Each sample will consists of the article’s views in the past N days (lets say N=3) – so from the dataset a sample for each article for each moving interval of 3 days can be extracted.
- Neural network, which maps each article in a latent space, producing embeddings, using the articles title or tags. As the news are in Bulgarian language, it may be hard finding a good pretrained language representation model.
- Recurrent neural network, which uses the users past N viewed articles represented with the above two learned embeddings as an input, to predict the next best article in the contextualized space.
The given dataset consists of:
- 23504707 rows
- 2361105 unique visitors
- 52458 unique page title
- 58654 unique page URLs (they are more maybe because of duplicating page titles for different articles)
It was noticed that there are many occurrences of the same page URL and visitor in a short period of time, and as this is mostly likely because of a page reloads, they are removed to clean up the data. Those are about 30%. If a more precise clean up was to be made, a clustering of the visit timestamps in “sessions” with a unsupervised learning algorithm like Agglomerative Clusterization (“An Adaptive Method of Numerical Attribute Merging for Quantitative Association Rule Mining” – Jiuyong Li, Hong Shen, and Rodney Topor) can be useful, which does not require explicitly providing the number of clusters and does a bottom-up hierarchical approach.
Idea
The following code implements this idea:
Each sample consists of
- Input: past articles represented by a combination of the [CLS] token from Multilingual BERT language model, which includes Bulgarian, and the popularity of the article at the time of sampling (its total viewers in the past N days)
- Labels in the form of:
[
positive_embeddings,
negative_embeddings
]
These embeddings include the popularity information!
The past_embeddings are computed for every non-overlaping interval of DAYS_STEP days
positive_embeddings and negative_embeddings are computed for the next day
Negative samples are sampled from all articles read that day by other users randomly
A possible strategy here would be using the most popular ones, that the user himself has not read, for that day
Negative embeddings are included in the gold labels as a workaround to compute the loss function, which minimises the cosine distance between the positive embeddings and the predicted embedding, and maximises the coside distance between the negative ones and the predicted one: - (cos(neg, pred) + cos(pos, pred)) / (cos(pos, pred) + eps)
where for cos(neg,pred) is averaged over all negative samples and eps is some small constant.
The implementation used here is from TF and is in the interval [-1,0].The closer the result is to -1, the smaller the angle, and the closer it is to 0 – the higher the angle.
Instead of using BERT, creating article embeddings would have been better. This is possible using the related article tags or categories as target attributes, because they would be more reliable than a language representation of the title since the title is not always descriptive enough, but because it was not part of the dataset and the time constraints were tight, this was not done.
Query parameters were removed from the article URLs before the deduplication, as they do not change the article.
The following packages need to be installed with pip:
- pandas
- numpy
- scipy
- feather-format
- tensorflow
- pydot
- graphviz
- tqdm
- bert-serving-client
To run BERT [CLS] token pooling, you need an environment with tensorflow 1 <= 1.15 and the Multilingual model.
After creating and starting a separate environment, you can do:
pip3 install -U –no-cache-dir pip setuptools bert-serving-server tensorflow-gpu==1.15
Then:
cd <absolute_path_to_downloaded_model> && bert-serving-start -pooling_strategy CLS_TOKEN -max_seq_len NONE -model_dir <path_to_downloaded_model_config_folder>
Code
import tensorflow as tf
import numpy as np
import os
RANDOM_SEED = 1234
DATA_DIR = 'data'
LOGS_DIR = 'logs'
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, 'custom')
ARTICLES_DATA_DIR = os.path.join(DATA_DIR, 'articles')
ARTICLES_DATA_ALL = os.path.join(PROCESSED_DATA_DIR, 'articles_new.feather')
ARTICLES_VIEWS = os.path.join(PROCESSED_DATA_DIR, 'articles_views.json')
DATASET = os.path.join(PROCESSED_DATA_DIR, 'dataset.npy')
tf.random.set_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
# Some settings to make sure we are using the GPU and float16 precision for the hidden layers!
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
tf.keras.mixed_precision.experimental.set_policy(tf.keras.mixed_precision.experimental.Policy('mixed_float16'))
import json
import pandas as pd
import datetime as dt
import errno
import re
from datetime import datetime
from tqdm import tqdm
from bert_serving.client import BertClient
from tensorflow.keras.layers import TimeDistributed, Bidirectional, Concatenate, Conv2D, Conv3D, ConvLSTM2D, Dense, UpSampling2D
from tensorflow.keras.layers import Dropout, Input, Flatten, LSTM, InputLayer, BatchNormalization, Activation, MaxPooling2D
from tensorflow.keras.layers import GlobalMaxPool1D, GlobalMaxPool2D, GlobalMaxPool3D
from tensorflow.keras.layers import GlobalAveragePooling1D, GlobalAveragePooling2D, GlobalAveragePooling3D
from tensorflow.keras.layers import Reshape
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.initializers import GlorotNormal, he_normal
DAYS_STEP = 5
BATCH_SIZE = 256
ARTICLE_EMBEDDING_SIZE = 768
MAX_PAST_ARTICLES = 10
NEGATIVE_ARTICLES = 10
EPOCHS = 100
VISITORS_PER_TIME_INTERVAL_TRAIN = 100
df = get_articles()
article_data = get_article_data(df)
dataset = get_dataset(df, article_data)
model, run_path = get_model()
history = model.fit(
dataset,
epochs=EPOCHS,
steps_per_epoch=int(VISITORS_PER_TIME_INTERVAL_TRAIN/BATCH_SIZE),
callbacks=callbacks(run_path, model),
)
# Reads the raw input files and creates a dataframe to be used later.
# It is persisted on disk in a single feather (binarized) format file and does a number of cleaning steps
# 1) Removes the query parameters of all page URLS, as URLs with the same base URL path have the same articles,
# although different query strings
# 2) Drops duplicating rows (page URL; visitor)
# 3) Maps each base page URL to a integer
# 4) Sorts the record on the timestamp
def get_articles():
if os.path.isfile(ARTICLES_DATA_ALL):
return pd.read_feather(ARTICLES_DATA_ALL)
dfs = []
tqdm_parts = tqdm(os.listdir(ARTICLES_DATA_DIR))
for part in tqdm(os.listdir(ARTICLES_DATA_DIR)):
tqdm_parts.set_description('Reading %s' % part)
dfs.append(pd.read_csv(os.path.join(ARTICLES_DATA_DIR, part)))
df = pd.concat(dfs)
del dfs
df['pagePath'] = df['pagePath'].apply(lambda x: urlparse(x).path)
df = df.drop_duplicates(subset=['visitor', 'pagePath'], keep='first')
df['page_id'] = df.pagePath.astype('category').cat.codes
df.rename({'pagePath': 'page_path', 'pageTitle': 'page_title'}, axis=1, inplace=True)
df['time'] = pd.to_datetime(df.time)
df = df.sort_values(by='time', ascending=True)
df.reset_index(inplace=True, drop=True)
df.to_feather(ARTICLES_DATA_ALL)
return df
# Returns and persist to a json file the article embeddings and their total viewers per day
def get_article_data(df):
if os.path.isfile(ARTICLES_VIEWS):
with open(ARTICLES_VIEWS, 'r') as f:
article_data = json.load(f)
return article_data
article_data = {}
bert_client = BertClient()
dmin= df.time.min().date()
dmax = df.time.max().date()
count = 0
for article_id, views_per_article in df.groupby('page_id'):
count += 1
article_data[article_id] = {}
article_views = [0] * (views_per_article.time.min().date() - dmin).days
article_views += [views.shape[0] for time, views in views_per_article.groupby(pd.Grouper(key='time', freq='1D'))]
article_views += [0] * (dmax - views_per_article.time.max().date()).days
article_data[article_id]['views'] = article_views
article_data[article_id]['embedding'] = bert_client.encode([views_per_article['pageTitle'].iloc[0]])[0].tolist()
with open(ARTICLES_VIEWS, 'w') as f:
json.dump(article_data, f)
return article_data
# This computes the data that is going to be fed through the model
# Every sample is in the form of:
# (
# past_embeddings,
# [
# positive_embeddings,
# negative_embeddings
# ]
# )
# These embeddings include the popularity information
# The past_embeddings are computed for every non-overlaping interval of DAYS_STEP days
# positive_embeddings and negative_embeddings are computed for the next day
# Negative samples are sampled from all articles read that day by other users randomly
# A possible strategy here would be using the most popular ones, that the user himself has not read, for that day
def get_dataset(df, article_data):
if os.path.isfile(DATASET):
return np.load(DATASET, allow_pickle=True)
data = []
for day_offset in range(0, 26, DAYS_STEP):
start_day = df.time.min().date() + pd.DateOffset(days=day_offset)
target_day = df.time.min().date() + pd.DateOffset(days=day_offset + DAYS_STEP)
x_data = df[(df['time'].dt.date >= start_day) & (df['time'].dt.date < target_day)]
y_data = df[(df['time'].dt.date == target_day)]
for visitor, group in x_data.groupby('visitor'):
test_visits = y_data[y_data.visitor == visitor]
if test_visits.shape[0] == 0:
continue
positive_embedding = get_sample_embedding(article_data, test_visits.iloc[0]['page_id'], day_offset, DAYS_STEP)
negative_embeddings = []
negative_samples = y_data[y_data['visitor'] != visitor].sample(n=NEGATIVE_ARTICLES)
for index, row in negative_samples.iterrows():
negative_embedding = get_sample_embedding(article_data, row['page_id'], day_offset, DAYS_STEP)
negative_embeddings.append(negative_embedding)
missing_count = MAX_PAST_ARTICLES - max(min(MAX_PAST_ARTICLES, group.shape[0]), 0)
past_embeddings = np.zeros((missing_count, ARTICLE_EMBEDDING_SIZE + 1)).tolist()
for index, row in group.tail(MAX_PAST_ARTICLES).iterrows():
past_embedding = get_sample_embedding(article_data, row['page_id'], day_offset, DAYS_STEP)
past_embeddings.append(past_embedding)
data.append((
past_embeddings,
[
[positive_embedding] * 10,
negative_embeddings
]
))
np.save(DATASET, data, allow_pickle=True)
return data
# A function returning a TF Dataset from the raw numpy data. It is working as a generator for samples
def tf_dataset(dataset):
past_embeddings, pooled_embeddings = np.vstack(dataset).T
print(past_embeddings.shape)
print(pooled_embeddings.shape)
past_embeddings = tf.convert_to_tensor(past_embeddings.astype(np.float32), dtype=tf.float32)
pooled_embeddings = tf.convert_to_tensor(pooled_embeddings.astype(np.float32), dtype=tf.float32)
def map_record(past_embeddings, pooled_embeddings):
return past_embeddings, pooled_embeddings
return tf.data.Dataset.from_tensor_slices((past_embeddings, pooled_embeddings)).map(map_record)
# Combines an article embedding with its popularity in the interval of
def get_sample_embedding(article_data, page_id, start, offset):
sample = article_data[str(page_id)]
embedding = sample['embedding']
views = sample['views']
return embedding + [sum(views[start:start + offset])]
# Returns the TF Model with custom loss function
def get_model():
run_path = next_available_filepath(LOGS_DIR, 'RecommendationModel')
inputs = Input(batch_shape=(BATCH_SIZE, MAX_PAST_ARTICLES, ARTICLE_EMBEDDING_SIZE + 1))
x = LSTM(256, return_sequences=False, dropout=0.2)(inputs)
x = Dropout(0.2)(x)
x = Dense(2048, kernel_initializer=he_normal(RANDOM_SEED), activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(1024, kernel_initializer=he_normal(RANDOM_SEED), activation='relu')(x)
x = Dropout(0.2)(x)
outputs = Dense(ARTICLE_EMBEDDING_SIZE + 1)(x)
model = Model(inputs=inputs, outputs=outputs)
# y_true will contain the positive AND negative samples
# The loss computes:
# (cos(neg, pred) + cos(pos, pred)) / (cos(pos, pred) + eps)
# where for cos(neg,pred) is average over all negative samples
# and eps is some small constant
def similarity_loss():
def loss(y_true, y_pred):
y_pred = tf.expand_dims(y_pred, axis=1)
y_pred = tf.expand_dims(y_pred, axis=2)
batched_pos, batched_neg = tf.split(y_true, 2, axis=1)
pos_similarity = tf.reduce_mean(tf.keras.losses.cosine_similarity(y_pred, batched_pos, axis=3), axis=2)
neg_similarity = tf.reduce_mean(tf.keras.losses.cosine_similarity(y_pred, batched_neg, axis=3), axis=2)
return tf.reduce_sum((neg_similarity + pos_similarity) / (pos_similarity + tf.constant(-1e-5, dtype=tf.float64)), axis=1)
return loss
model.compile(optimizer=tf.keras.optimizers.Adam(), loss=similarity_loss())
return model, run_path
# All of the callbacks during traning and evaluation of the model
# Most important ones are early_stopping and reduce_lr_on_plateau
# Which directly
def callbacks(run_path, model):
setup = tf.keras.callbacks.LambdaCallback(on_train_begin=lambda logs: setup_callback())
# The current delta may not be good
early_stopping = tf.keras.callbacks.EarlyStopping(
monitor='loss',
min_delta=0.0005,
patience=3,
mode='min',
restore_best_weights=True
)
reduce_lr_on_plateau = tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_binary_similarity_loss',
factor=0.8,
patience=1,
verbose=1,
mode='min',
min_delta=0.0005,
cooldown=0,
min_lr=0.0005,
)
terminate_on_nan = tf.keras.callbacks.TerminateOnNaN()
checkpoint = tf.keras.callbacks.ModelCheckpoint(
filepath='%s/%s/weights_.{epoch:02d}-{val_binary_similarity_loss:.3f}.h5' % (run_path, 'checkpoints'),
monitor='val_binary_similarity_loss',
verbose=1,
save_best_only=False,
save_weights_only=False,
mode='min',
save_freq='epoch',
)
tensorboard = tf.keras.callbacks.TensorBoard(
log_dir=run_path,
histogram_freq=0,
write_graph=False,
write_images=False,
update_freq=10,
profile_batch=0,
embeddings_freq=0,
embeddings_metadata=None,
)
return [
early_stopping,
reduce_lr_on_plateau,
terminate_on_nan,
checkpoint,
tensorboard,
setup,
]
# On every run, this callback will create a diagram of the architecture and will print a summary of the model
def setup_callback(run_path, model):
create_directory_if_not_exist(run_path)
create_directory_if_not_exist(os.path.join(run_path, 'checkpoints'))
create_directory_if_not_exist(os.path.join(run_path, 'plots'))
model.summary(line_length=100)
filepath = os.path.join(run_path, 'architecture.png')
tf.keras.utils.plot_model(model, filepath, show_shapes=True, show_layer_names=True, expand_nested=True, dpi=96*2)
# Returns the next available path string in a directory based on specific template.
# It is used so every run of the code persists the logs - a simple alternative to third party solutions
def next_available_filepath(directory, filename, extension = None):
create_directory_if_not_exist(directory)
extension = '.' + extension if extension else ''
path_matches = [re.match('^%s_\d+-\d+_(?P<number>\d+)%s$' % (filename, extension), path) for path in os.listdir(directory)]
ranks = [0] + [int(x.group('number')) for x in path_matches if x]
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
return os.path.join(directory, '%s_%s_%d%s' % (filename, timestamp, max(ranks) + 1, extension))
# Create an empty directory if it does not already exist
def create_directory_if_not_exist(directory):
try:
os.makedirs(directory)
except OSError as e:
if e.errno != errno.EEXIST:
raise
6 thoughts on “Datathon 2020 – Article recommendation”
I do not see any solution here …
Is this page final? It feels like a start. It starts very nicely, which great analysis, but then is feels like it was cut in the middle. Was anything actually implemented? Are there any actual results?
Predictions are not yet done, as they were taking a lot of time.
Very nice ideas.
it’s a pity you haven’t elaborated about it even further, plotting how you would approach it, and maybe how you would overcome the problems you’ve listed.
For example
– although Bulgarian has not enough resources or language models, one can use transfer learning from another richer language to it. Check out the work of Sebastian Ruder about this topic.
– Or for the RNN – what features would you use? what new features would you engineer?
– How would you combine your different approaches?
It seems you are in a great direction, and having the correct state of mind. Please consider communicating more broadly and elaborately in the future.
Hi, I’ve updated the article with the code and the idea explanation, although a bit late. Plots of the model achitecture, its training, and how the article popularity behaves over time can be uploaded too if needed.
@preslav – I was late and then struggled with uploading the notebook (I was also not using notebook for the development).
@liad –
1) I think that with so much data, a feasable solution would be using the tags (because they represent the topic better than the title) and learning embeddings from scratch from the current dataset. Also I think the Transfer learning idea is possible.
2) The RNN in the code is LSTM, with Contextualized Article embeddings as inputs (that is for the same article the embedding is different in time, because it depends on the current article popularity). The sequence is the last N articles read by the user. It can also be viewed as a user-preference “sessions”, as they continually over time.
3) The added section tries to explain it. Basically, we combine popularity with content based information to form Contextualized Article embeddings. The the model is trying to predict the embeddings minimising the cosine similarity between the actual read news and maximising it between other popular news from the same day. So essentially we are learning a User-specific embedding space.