Datathon 2020 SolutionsDatathons Solutions

NewsCo: rapid non-parametric recommender algorithm for NetInfo news articles

3
votes

Business understanding

Online news reading has become very popular as the web provides access to news articles around the world. A key challenge of news websites is to help users find the articles that are interesting to read.

The purpose of a recommender system is to suggest relevant items to users. Recommender systems can generate a huge amount of income when they are efficient.

Data understanding

We were given a dataset which contains the following fields:

  • pagePath - the url of the relevant article
  • pageTitle - the title of the article in Bulgarian language
  • time - a timestamp when the article was opened
  • visitor - a unique ID of the user

Timeframe of given dataset: from 12-04-2020 to 12-05-2020

We realized that the articles are structured by topic and subtopic (of course it's better to use information about subtopic as it introduces lower granularity level). We could quickly get the topic either from the article - Кой може да наследи Ким Чен Ун - Свят | Vesti.bg, or from the url - www.vesti.bg/temi-v-razvitie/tema-koronavirus/nosim-loshivesti-izviniavame-se-6108881. The better approach was to use the url as it has already subtopic (in the example - tema-koronavirus). However, some of the paths did not provide us with topic - www.vesti.bg/nov-sindrom-poraziava-deca-svyrzan-li-e-s-covid-19-6108867. A possible solution to solve this caveat was to replace it with the most frequent subtopic for this particular user, or just to drop in order not to introduce bias.

Moreover, sometimes there could be a topic, that in theory has subtopics but in a particular situation doesn't have. A solution to this was making customized subtopics e.g. "technologii_others". We have also found that some articles have different topics.

Several articles were opened multiple times consecutively by a single user. However, we saw a pattern that the last visit of this url was followed by quick change of the article - the median time difference is about 6 seconds. We assumed that this could be the case that article was visited after new start of browser and it was reloaded automatically but the user continued quickly afterwards with his activity. Our solution takes into account only the first visit to the article by a user as relevant for the analysis.

Also, in the title of the article we encountered two possible issues - there were unnecessary items (", easily replaced by the actual character ") and in some cases other language was used apart from Bulgarian ("Tesla Model S стана по-бърз от Bugatti Chiron"), which can cause some problems with eventual text processing.

Data exploration

We have started our data exploration part of the analysis with checking what topics were most popular in the given time frame. Not surprisingly the topic which had the highest visit rate was related to the Coronavirus outbreak. However, it was valuable to see that there was high interest in some seasonal topics - Velikden (Easter) and Kato dve kapki voda (TV show). Other topics with high interest were regular one. We visualize the most popular topics with words cloud:

Most popular topics

During given timeframe popularity of Coronavirus subtopic behaves in an interesting way. We can see that before Easter (April 19) the acitivity was decreasing, may be because people were interested in holiday articles. Government intentions of restrictions for the holidays could explain the increased activity before the next holiday season in Bulgaria in the beginning of May and the subsequent drop in visits. Corona

The seasonality of Easter related news is shown in the following graph: Easter Obviously, after the Easter passed the popularity of this topic has fallen dramatically.

Kato dve kapki voda is a popular Bulgarian TV show which airs on Monday evenenings. It is visible in the following graph that the most interest this topic gathers on Tuesday following the new episode (days of week are placed on x-axis, starting from Monday=0) Kapki

We have also investigated if there is a difference in article visits between weekdays and weekends. The next two graphs show that on weekdays users tend to start reading news earlier and show high activity level in the first part of the day, whereas at the weekend people read news not that active, however raising insterest later in the evening (first stays for weekdays, second for weekends): Weekdays Weekend

More interesting is that at the weekend visitors prefer to to abstract from Coronavirus subtopic more, and "study" as well other spheres of life (comparing to weekdays proportion between subtopics): Weekday_weekend_subtopics

Data Preparation

In this part of analysis we did feature creation and transformation. The features created and used in the modelling were:

  • freshness of article - time between the first opening of the article and 00:00 of the day of prediction (for the purpose of this we assumed that publishing time is equal to the first click from user as we did not have the actual publication time)
  • popularity of article - number of times the article was viewed by all the over users by 00:00 of the day of prediction,
  • time of day,
  • weekend/weekday,
  • vector representaions of article titles - we used pretrained word2vec in Bulgarian language to convert the title to their respective vectors (dim=100) and use this feature in our further analysis.

User-user interaction-based clustering

The most obvious approach for news articles prediction is to find some pattern of user-user or item-item interaction in form of clustering, and then to provide the best subtopic with following matching of best article inside subtopic (ethod for search of best article with given subtopic is provided later in Section "Rapid non-parametric recommender system"

However, due to limited resources and rather big datasets (about 2M visitors and 56K articles) we were able to apply methods inside this section only to subset of dataframe, nonetheless providing additional insights to data and visualizations.

Collaborative filtering

One of standard approaches for making recommendation used was Collaborative filtering (it could be used both for making clusers and making direct recommendations of articles based on matrix of interactions). In Appendix you can find the code producing Collaborative filtering applied to the sample data set (between 28 April and 29 April) and suggesting article on 30 April.

Word2Vec + t-SNE

This approach of combining embeddings representations of articles' titles in Bulgarian (using pre-trained vectors) with consequent spatial clustering has the following pipline:

  • convert the titles of the articles each user have read in vector representation and calculate the centroid for each user (using the idea of Medoid on both steps)
  • use t-SNE algorithm to get the final clusters

Application of this method to subset of users gives the 2D representation of visitors below:

tSNE clusters

Two clusters above were separated visually as having perfect vertical line at around x=18 for making split. Below is presented the distribution of articles' subtopics by clusters:

tSNE subtopics distributions

We can see that both clusters share similar top news subtopics. However, it is easily noticed that one of the clusters focuses mostly on news related to the Coronavirus outbreak - almost 2/3 of articles written by those visitors fall in this category. In the meantime the other group of users read those type of news less than half of the time.

Graph based community detection

In graph theory there's well known problem of community detection, that stays for finding clusters of nodes inside the graph. As we can imagine users to be nodes, and commonly read articles as edges (with corresponding number of common artciles as edge weight), we're able to apply community detection algorithm to such a data structure.

For example, the Girvan–Newman algorithm could be used, as it detects communities by progressively removing edges from the original network. The connected components of the remaining network are the communities. Instead of trying to construct a measure that tells us which edges are the most central to communities, the Girvan–Newman algorithm focuses on edges that are most likely "between" communities.

In could be seen on the figure below (provided on random subsample of users) that there's dense connected component that is assumed to be a cluster, and other weakly connected separated nodes (e.g. one-time visitors):

Graph

As the task was to predict "next best article for next day", we reformulate it as prediction of "first article for next day". This task is more direct, however our method is easy to generalize for predicting just "next best article" by taking mode of subtopics of articles visited at particular time of day/day of week, instead of articles that are first each day.

Personalised subtopic distribution-based approach

Proposed approach follows the idea of personal recommending (on first step) of the best matching subtopic, based on evaluating of distribution of previously visited articles. As we're predicting the first artcile to be opened daily, we analyze the consistent distribution. On figure below it's shown the distribution of subtopics of first daily visited article frequencies for the most active user during given time period: t5KxyQy8pKqIWmC0zH4cGw== Graph

On second step, after obtaining the mode subtopic (i.e. the one represented in the first bar on distribution plot), we need to somehow proceed to recommending not just subtopic but the particular article. For that reason we estimate for each artcile it's rating by making combination of freshness and popularity (these two metrics were discribed above):

\begin{equation*} \text{rating}_{i} = \frac{\log_{2} (1 + \text{popularity}_{i})}{\log_{10} (1 + \text{freshness}_{i})} \end{equation*}

1) Adding of 1 is required in order to avoid problem of logarithm calculation at 0 point

2) Taking logarithms with different bases relates to scales of popularity and freshness. We want to emphasize that possibility of choosing the bases of logarithms doesn't break the concept of non-parametric approach, as bases relate to nature of freshness and popularity variables, but don't depend on particular subset of data, so that attempt to perform any kind of their adjustment would lead to overfitting

Best article for particular user is then simply assigned as the one with highest rating from suggested subtopic.

Adavantages of approach:

  • interpretability (in contrast with black-box ML methods our approach is humanly understandable)
  • personality (giving recommendations with taking into account interestes of particular user, not like the output of model trained on bunch of users and "aggregating" their behaviour)
  • speed of calculations (prediction for 10.000 users is done in 3.07 seconds)
  • cheepness of implementation (deep learning methods would require expensive GPUs for training of NLP models)
  • novel methodology (non of proprietary or open-source code have been used)

Evaluation

You can find our predictions here: https://yadi.sk/d/fCi9L799v0uutg

Appendix (code)

Imports

In [ ]:
import pandas as pd
import datetime
import os
import scipy
import re
import time
import gensim
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from itertools import combinations
from sklearn.preprocessing import LabelEncoder
from networkx import edge_betweenness_centrality as betweenness
from networkx.algorithms.community import girvan_newman
from sklearn.metrics.pairwise import cosine_similarity
from wordcloud import WordCloud
import re
import warnings
import seaborn as sns
sns.set(style="darkgrid")
warnings.filterwarnings('ignore')

Utils

In [ ]:
def calc_popularity(data):
    df = data.copy()
    counts = df['pageTitle'].value_counts()
    df['pageVisits'] = df['pageTitle'].apply(lambda x: counts[x])
    return df

def calc_publication_time(data):
    df = data.copy()
    data.sort_values(by=['pageTitle', 'time'], inplace=True)
    first_time = data.groupby('pageTitle').first()['time']
    df['pub_time'] = df['pageTitle'].apply(lambda x: first_time[x])
    df.reset_index(inplace=True, drop=True)
    return df

def calc_freshness(data):
    df = data.copy()
    # .astimezone(utc)
    df['freshness'] = df['pub_time'].apply(lambda x: (datetime.datetime(2020, 5, 13, 0, 0)-x).total_seconds()/3600) # in hours
    return df

def take_first_click(data):
    df = data.copy()
    df.sort_values(by=['visitor', 'pageTitle', 'time'], inplace=True)
    df.reset_index(drop=True, inplace=True)
    df = df.groupby(['visitor', 'pageTitle']).first().reset_index()
    return df

def most_central_edge(G):
    centrality = betweenness(G, weight='weight')
    return max(centrality, key=centrality.get)

def preprocess_title(data):
    df = data.copy()
    df.pageTitle = df.pageTitle.apply(lambda x: x.split('|')[0])
    return df

def get_subtopic(url):
    url = url.replace('https://', '', 1)
    url = url.replace('https://', '', 1)
    url = url.replace('www.vesti.bg', '', 1)
    url = url.split('/')
    url = [string for string in url if string != ""]
    joined_url = ''.join(url)
    if 'razvlechenia' in joined_url:
        topic = 'razvlechenia'
    elif 'testove' in joined_url:
        topic = 'testove'
    elif 'koronavirus' in joined_url or 'covid' in joined_url:
        topic = 'tema-koronavirus'
    elif not joined_url or len(url) == 1 or '?' in joined_url:
        topic = 'no_topic'
    elif len(url) == 2 or url[0] == url[1]:
        topic = url[0] + '_others'
    else:
        topic = url[1]
    
    topic = topic.replace('tema-', '', 1)
    topic = re.sub('[^a-zA-Z]+', ' ', topic)
    topic = ' '.join(topic.split(' '))
    return topic

def preprocess_text(text):
    text = text.lower()
    text = re.sub('[^a-zA-Zа-яА-Я1-9]+', ' ', text)
    text = re.sub(' +', ' ', text)
    text = text.replace('"', '')
    return text.strip()

def create_vector(text):
    vector_list = []
    for word in preprocess_text(text):
        try:
            vector_list.append(model[word])
        except:
            vector_list.append(np.zeros_like(template_word))
    return np.mean(vector_list, axis=0)

def cos_sim(mat):
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np
    import pandas as pd
    cosine = cosine_similarity(mat)
    np.fill_diagonal(cosine, 0 )
    similarity_users =pd.DataFrame(cosine,index=mat.index)
    similarity_users.columns=mat.index
    return similarity_users

def find_n_neighbours(similarity,n):
    import numpy as np
    import pandas as pd
    order = np.argsort(similarity.values, axis=1)[:, :n]
    similarity = similarity.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:n].index, 
          index=['top{}'.format(i) for i in range(1, n+1)]), axis=1)
    return similarity

def User_item_score(user, df, mat, neighbours,similarity):
    import pandas as pd
    articles_read_by_user = mat.columns[mat.iloc[user,:]>0].tolist()
    a = neighbours[neighbours.index==user].values.squeeze().tolist()
    neighbours_articles = df[df.visitor_label.isin(a)]['page_label'].unique()
    aritcles_for_consideration = list(set(neighbours_articles)-set(articles_read_by_user))
    score = []
    for item in aritcles_for_consideration:
        c = mat.loc[:,item]
        d = c[c.index.isin(a)]
        f = d[d.notnull()]
        index = f.index.values.squeeze().tolist()
        corr = similarity.loc[user,index]
        fin = pd.concat([f, corr], axis=1)
        fin.columns = ['adj_score','correlation']
        fin['score']=fin.apply(lambda x:x['adj_score'] * x['correlation'],axis=1)
        nume = fin['score'].sum()
        deno = fin['correlation'].sum()
        final_score = (nume/deno)
        score.append(final_score)

def calc_weekday(df):
    days = list(range(12, 31)) + list(range(1, 13))
    wd_days = [days[0]]
    days = days[1:]
    i = 0
    for day in days:
        if i == 5:
            i += 1
            wd_days.append(day)
            continue
        if i == 6:
            i = 0
            wd_days.append(day)
            continue
        i += 1
    n = 0
    month = '04'
    dates = []
    for day in wd_days:
        dates.append(f'2020-{month}-{day}')
        if day == 26:
            month = '05'
#     print(dates)
    day_df = df[df.time.dt.normalize().isin(dates)]
    open_count = day_df.time.count()
    n = len(dates)
    average_open = open_count / n
    days_open = day_df.time.dt.weekday.value_counts().sort_index()
    days_open = days_open / average_open
    return days_open
    data = pd.DataFrame({'articles':aritcles_for_consideration,'score':score})
    data = data.sort_values(by='score',ascending=False)
    data = data.reset_index(drop=True)
    return data

def cluster_topics(cluster_data,cluster,data):
    c0 = pd.DataFrame(cluster_data[cluster_data.cluster==cluster]['visitor'].reset_index(drop=True))
    c0.columns = ['visitor']
    c0 = pd.merge(c0,data[['visitor','subtopic']], how='left', on='visitor')
    c0_top_topics = pd.DataFrame(c0.subtopic.value_counts()).reset_index()
    c0_top_topics.columns = ['subtopic','n']
    c0_top_topics['prob'] = c0_top_topics.n/sum(c0_top_topics.n)
    c0_top_topics['cluster'] = cluster
    return c0_top_topics

Data loading

In [ ]:
start_time = time.time()
folder_path = './vesti/'
data = pd.DataFrame()
for filename in os.listdir(folder_path):
# data = pd.read_csv('./vesti_sample.csv')
    df = pd.read_csv(folder_path+filename)
    data = pd.concat([data, df])
# Loading word2vec vocab
model = gensim.models.KeyedVectors.load_word2vec_format('./word2vec_bg/model.bin', binary=True)
template_word = model['на'] # needed for making zeros_like array
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Initial data preprocessing

In [ ]:
start_time = time.time()
# Converting to datetime
data.time = pd.to_datetime(data.time.str.slice(stop=19), format='%Y-%m-%d %H:%M:%S')
data['date'] = data.time.apply(lambda x: x.date())
data['weekday'] = data.time.apply(lambda x: x.weekday() + 1)
# data = preprocess_title(data)
data['subtopic'] = data['pagePath'].apply(lambda x: get_subtopic(x))
data = calc_popularity(data)
data = calc_publication_time(data)
data = calc_freshness(data)
data = take_first_click(data)
data = data[data.subtopic != 'velikden'] # removing 'velikden'topic in order to mitigate bias
data.reset_index(drop=True, inplace=True)
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Methods based on user-user interaction

1) Colaborative filtering

In [ ]:
# Encode the titles/urls
le_title = LabelEncoder()
le_title.fit(data.pagePath)
data['page_label'] = le_title.transform(data.pagePath)

# Encode the users
le_user = LabelEncoder()
le_user.fit(data.visitor)
data['visitor_label'] = le_user.transform(data.visitor)#

user_summary = data.groupby('visitor_label').agg({'freshness_w_read':'mean', 'pageVisits':'mean'}).reset_index()

mat = data.groupby(['visitor_label', 'page_label']).size().unstack(fill_value=0)
sim = utils.cos_sim(mat)
n = utils.find_n_neighbours(sim,10)

visitor=200
output = utils.User_item_score(visitor,data,mat,n,sim)
output = (pd.merge(output, data[['page_label','pub_time','pageVisits']],
                           how='left', left_on='articles', right_on='page_label')).drop_duplicates().reset_index(drop=True)
output = utils.calc_freshness_sample(output)
output['pop_flag'] = output.pageVisits > user_summary[user_summary.visitor_label==5]['pageVisits'].values[0]
output['fresh_flag'] = output.freshness < user_summary[user_summary.visitor_label==5]['freshness_w_read'].values[0]
recommend = output[(output.pop_flag==True) | (output.fresh_flag==True)]['articles'][0]
recommend_article = le_title.inverse_transform(np.array(recommend).reshape(-1))[0]

test_user = le_user.inverse_transform(np.array(visitor).reshape(-1))[0]
print(f"Recommended article was  {recommend_article}")

2) Word2Vec-based t-SNE clustering

In [ ]:
top_users = pd.DataFrame(data.visitor.value_counts()).reset_index(drop=False).iloc[:2000,0].to_list()

titles = data[['pageTitle', 'freshness', 'pageVisits']].drop_duplicates().reset_index(drop=True) # unique artciles

users_list = []
centroids_list = []
for user in top_users:
    users_list.append(user)
    centroid = [create_vector(title) for title in data[data.visitor == user].pageTitle]
    centroids_list.append(np.mean(centroid, axis=0))
    
tsne = TSNE(n_components=2,perplexity=80, learning_rate=400, n_iter=50000, min_grad_norm=1e-20, verbose=1)
user_clustered = pd.DataFrame(tsne.fit_transform(centroids_list))

user_clustered.columns = ['x','y']
user_clustered['cluster'] = np.where(user_clustered.x>18,1,0)
user_clustered['visitor'] = top_users
ax = sns.scatterplot(x="x", y="y", hue="cluster", data=user_clustered)

#topics by clusters
c0_top_topics = cluster_topics(user_clustered,0,data)
c1_top_topics = cluster_topics(user_clustered,1,data)

top_topics = c0_top_topics.append(c1_top_topics)
# get top 5 subtopics by cluster
top_topics = top_topics[((top_topics.prob>0.012) & (top_topics.cluster==0)) |((top_topics.prob>0.02) & (top_topics.cluster==1))]

# Draw a nested barplot to show survival for class and sex
g = sns.catplot(x="prob", y="subtopic", hue="cluster", data=top_topics, height=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_ylabels("topic probability")

3) Graph-based user-user communities

In [ ]:
# Creating graph
G = nx.Graph()
G.add_nodes_from(data.visitor.unique())
start_time = time.time()
for title in data.pageTitle.unique():
    for node1, node2 in combinations(data[data.pageTitle == title].visitor.unique(), 2):
        if G.has_edge(node1, node2):
            # we added this one before, just increase the weight by one
            G[node1][node2]['weight'] += 1
        else:
            # new edge. add with weight=1
            G.add_edge(node1, node2, weight=1)
            
# Communities detection
comp = girvan_newman(G, most_valuable_edge=most_central_edge)
c = tuple(sorted(c) for c in next(comp))
# c = list(greedy_modularity_communities(G))

plt.figure(figsize=(20, 15))
pos = nx.spring_layout(G)
# nx.draw(G, pos, node_size=5)
Gcc = sorted(nx.connected_components(G), key=len, reverse=True)
G0 = G.subgraph(Gcc[0])
# nx.draw(G0, pos, node_size=20)
elarge = [(u, v) for (u, v, d) in G0.edges(data=True) if d['weight'] > 1]
esmall = [(u, v) for (u, v, d) in G0.edges(data=True) if d['weight'] <= 1]
pos = nx.spring_layout(G0)  # positions for all nodes
# nodes
nx.draw_networkx_nodes(G0, pos, node_size=20, node_color='b')
# edges
nx.draw_networkx_edges(G0, pos, edgelist=elarge, width=2, edge_color='r')
nx.draw_networkx_edges(G0, pos, edgelist=esmall, width=1, alpha=0.5, style='dashed')

plt.axis('off')
plt.show()

Personalized suptopic distribution-based recommender system
In [ ]:
# Creating best artciles list
# For each subtopic we find the article with best rating
titles = data[['pageTitle', 'freshness', 'pageVisits', 'subtopic']].drop_duplicates().reset_index(drop=True)
titles['rating'] = np.log(1 + titles['pageVisits'])/np.log(titles['freshness'])
titles.sort_values(by=['subtopic', 'rating'], ascending=False, inplace=True)
titles.reset_index(drop=True, inplace=True)
best_titles = titles.groupby(['subtopic']).first().reset_index()

# For each user we find the most popular subtopic for the article he opens every day
# And then just recommend him the best article from table created before
results = data.groupby(['visitor', 'date']).first().groupby(['visitor', 'subtopic'])['subtopic'].agg('count')
results = results.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values(ascending=False).head(1)).to_frame()
results.rename(columns={'subtopic': 'counts'}, inplace=True)
results.reset_index(inplace=True)
results = pd.merge(results, best_titles, on=['subtopic'])[['visitor', 'pagePath']]
results.rename(columns={'visitor': 'VisitorID', 'pagePath': 'first_best_article'}, inplace=True)
# Write it to file
results.to_csv('results.csv', index=False)

Other plottings

In [ ]:
### word cloud #########
#########################
def get_topics_count():
    result = []
    keys = {}
    for i in range(26):
        if i < 10:
            i = '0' + str(i)
        else:
            i = str(i)
        df = pd.read_csv(f"data/vest0000000000{i}")
        for url in df.pagePath:
            url = url.replace('https://', '', 1)
            url = url.replace('https://', '', 1)
            url = url.replace('www.vesti.bg', '', 1)
            url = url.split('/')
            url = [string for string in url if string != ""]
            joined_url = ''.join(url)
            if 'razvlechenia' in joined_url:
                topic = 'razvlechenia'
            elif 'testove' in joined_url:
                topic = 'testove'
            elif 'koronavirus' in joined_url or 'covid' in joined_url:
                topic = 'tema-koronavirus'
            elif not joined_url or len(url) == 1 or '?' in joined_url:
                topic = 'no_topic'
            elif len(url) == 2 or url[0] == url[1]:
                topic = url[0] + '_others'
            else:
                topic = url[1]

            topic = topic.replace('tema-', '', 1)
            topic = re.sub('[^a-zA-Z]+', ' ', topic)
            topic = ' '.join(topic.split(' '))
            idx = keys.get(topic)
            if idx is None:
                idx = len(result)
                keys[topic] = idx
                row = {'topic': topic, 'count': 0}
                result.append(row)
            else:
                row = result[idx]
            row['count'] = row['count'] + 1
    return result
            
r = get_topics_count()
count_df = pd.DataFrame(r)
count_df = count_df.set_index('topic')
count_df = count_df.sort_values('count', ascending=False)
count_dict = {i.replace(' others', ''): row['count'] for i, row in count_df.iterrows()}

wc = WordCloud(background_color="white", contour_color='steelblue', width=800, height=400)

# generate word cloud
wc.fit_words(count_dict)

# store to file
wc.to_file("wordcloud.png")

### average activity per hour ###
#################################

def calc_work_day_hours(df):
    days = list(range(12, 31)) + list(range(1, 13))
    days = days[1:]
    work_days = []
    i = 0
    for day in days:
        if i == 5:
            i += 1
            continue
        if i == 6:
            i = 0
            continue
        work_days.append(day)
        i += 1
    n = 0
    month = '04'
    dates = []
    for day in work_days:
        dates.append(f'2020-{month}-{day}')
        if day == 30:
           month = '05'
    day_df = df[df.time.dt.normalize().isin(dates)]
    open_count = day_df.time.count()
    n = len(dates)
    average_open = open_count / n
    hours_open = day_df.time.dt.hour.value_counts().sort_index()
    hours_open = hours_open / average_open
    return hours_open

hours_open_work = calc_work_day_hours(data)

def calc_weekend_day_hours(df):
    days = list(range(12, 31)) + list(range(1, 13))
    wd_days = [days[0]]
    days = days[1:]
    i = 0
    for day in days:
        if i == 5:
            i += 1
            wd_days.append(day)
            continue
        if i == 6:
            i = 0
            wd_days.append(day)
            continue
        i += 1
    n = 0
    month = '04'
    dates = []
    for day in wd_days:
        dates.append(f'2020-{month}-{day}')
        if day == 26:
           month = '05'
    day_df = df[df.time.dt.normalize().isin(dates)]
    open_count = day_df.time.count()
    n = len(dates)
    average_open = open_count / n
    hours_open = day_df.time.dt.hour.value_counts().sort_index()
    hours_open = hours_open / average_open
    return hours_open

hours_open_wd = calc_weekend_day_hours(data)

plt.figure(figsize=(20,10))
ax = sns.barplot(y=hours_open_work * 100, x=[i for i in range(24)], palette="Blues_d")
ax.set(xlabel='hours', ylabel='number of visits relative to average')

plt.figure(figsize=(20,10))
ax = sns.barplot(y=hours_open_wd * 100, x=[i for i in range(24)], palette="Blues_d")
ax.set(xlabel='hours', ylabel='number of visits relative to average')


### average activity per hour for first article in a day ###
############################################################


first_artcle = data.groupby(['visitor', 'date']).first().reset_index()
hours_open_work = calc_work_day_hours(first_artcle)
hours_open_wd = calc_weekend_day_hours(first_artcle)

plt.figure(figsize=(20,10))
ax = sns.barplot(y=hours_open_work * 100, x=[i for i in range(24)], palette="Blues_d")
ax.set(xlabel='hours', ylabel='first visit in a day')

plt.figure(figsize=(20,10))
ax = sns.barplot(y=hours_open_wd * 100, x=[i for i in range(24)], palette="Blues_d")
ax.set(xlabel='hours', ylabel='first visit in a day')



### change of activity for topic in time ###
############################################

corona = data[data.subtopic == 'koronavirus']
days = list(range(12, 31)) + list(range(1, 13))
month = '04'
dates = []
for day in days:
    dates.append(f'2020-{month}-{day if day > 9 else "0"+str(day)}')
    if day == 30:
       month = '05'
dates = dates[1:]
corona_popular = pd.Series([], dtype='int64')
for date in dates:
    day = corona[corona['time'].dt.normalize() == date]
    corona_popular.at[date] = len(day)
corona_popular.index = corona_popular.index.str.replace('2020-', '')

plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
ax = sns.lineplot(data=corona_popular, marker='o')
plt.xticks(rotation=65)
ax.set(xlabel='day', ylabel='popularity')
plt.show()


velikden = data[data.subtopic == 'velikden']
days = list(range(12, 31)) + list(range(1, 13))
month = '04'
dates = []
for day in days:
    dates.append(f'2020-{month}-{day if day > 9 else "0"+str(day)}')
    if day == 30:
       month = '05'
dates = dates[1:]
velikden_popular = pd.Series([], dtype='int64')
for date in dates:
    day = velikden[velikden['time'].dt.normalize() == date]
    velikden_popular.at[date] = len(day)
velikden_popular.index = velikden_popular.index.str.replace('2020-', '')

plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
ax = sns.lineplot(data=velikden_popular, marker='o')
plt.xticks(rotation=65)
ax.set(xlabel='day', ylabel='popularity')
plt.show()

Share this

6 thoughts on “NewsCo: rapid non-parametric recommender algorithm for NetInfo news articles

  1. 0
    votes

    Using a graph to visualize the mutual interest of the users in an article is actually quite similar to the collaborative filtering matrix. The ‘huge black hole’ community corresponds to the common interest of most of the users, probably in the COVID-19 articles.

    There are other usages of graphs, specifically using graph convolutional neural networks (GCN), which could bring you closer to the task of articles recommendation. Specifically, if both the articles and the users are nodes, and edges signifies reading of an article by a user, using algorithms for “link prediction” can quickly predict which articles are going to be the next “best thing”. For instance, by summing the predicted edges between users and a newly published article.

    Overall very nicely written, and a great, clear, video as well. Great job!

    1. 0
      votes

      Hi,

      Thanks for giving a feedback!

      We agree in terms of having collaborative filtering matrix and graph adjacency matrix as the same data structures.
      Also we share the idea of formulating this problem as a link prediction problem in graph theory, as previously I’ve been doing research in this field, especially on co-authorship networks (https://dl.acm.org/doi/10.1145/3197026.3203911, https://link.springer.com/chapter/10.1007%2F978-3-030-11027-7_4, https://link.springer.com/chapter/10.1007%2F978-3-030-11027-7_3, https://peerj.com/articles/cs-172/). However, again due to limited time and resources, we decided to apply graph theory just in terms of subset visualization, but not solving the target problem

      And thanks for positive evaluation of video!

  2. 0
    votes

    Well written article and a good video – nice work!
    I like the fact that you have devised and implemented you own entire take on the issue. Here are some questions/observations:
    * It seems to me that you are making some assumptions about the data when defining you statistics (publication time, etc.) that have the potential to greatly affect the result.
    * do you guard against recommending article that someone has already read?
    * your rating definition stresses the use of different bases – but that is a multiplicative constant for all ratings (hence little bearing on comparing different values) vs the same base formula (and your code seems to use the same base?) – could you expand this point further?

    Best regards,

    1. 0
      votes

      Hi,

      Thanks for feeback!

      Answering to your questions:

      1) All the assumptions done are plausible and we don’t see the potential affect on result, as for instance publication time is assumed to be the first click on article. I guess this point could be weak for some small tabloids, however NetInfo is big enough organization so that time between publication and first click is to be negligible
      2) Currently the approach allows recommending the article that user has already read. We agree that it’s fair point to disable such a possibility, and in fact it could be easily done by restricting the pull of ranked articles inside the suggested subtopic to having no intersections with already viewed articles – by the way, we’ve already written code for that issue in collaborative filtering approach, but being in hurry forgotten to apply it to final solution
      3) In this article I’ve unfortunately copied the first version of code for this particular moment where we had equal bases. However, the idea of making different bases logarithm is better scaling and linearization. So that as equal bases just introduce scaling property, applying suitable bases guarantees linear influence of both freshness and popularity in order to introduce equal contribution of these metrics to final rating – we suppose this assumption to be fair with respect to visitors obtaining both fresh and “on-hype” recommendations

      Thanks for positive evaluation of article and video quality!

  3. 0
    votes

    Hi, Vrategov, Pscience πŸ™‚
    I agree with Liad – your article & video are very informative, so it was easy for me to understand your work. And just to add to the other comments: what is missing for me is the actual content of the articles – this is in fact what readers are interested in… Keeping in mind the case formulation, I would focus more on providing the whole text of the articles and on the NLP part of the workflow. Nevertheless, I like the analysis, which you made in the beginning, the techniques you used, the recommendation on individual level, etc.

    1. 0
      votes

      Hello, thanks for your positive feedback.

      I found myself often clicking on news because of the title, and we see a lot of news agencies trying to catch the attention of the visitors through it. The title is aimed to give the main point of the article and we emphasise it would give us enough information about the visitor’s interests.

      Moreover, we argue that use of article’s text as an input is wrong strategy of model training. Looking at tabloid’s pages, maximal available information about article except from title is it’s subtitle. It means that article’s text is a hidden state, and the only reasonable approach with it’s usage is multi-layer model with hidden variables staying for article’s text and probability of user to be interested in this text. So that taking article’s title as an input, the joint distribution over hidden layers of text and target variable could be fitted – but not pretending a hidden variable to be visible one.

Leave a Reply