Datathon cases

Datathon2020 – Article recommender case – provided by NetInfo

For the past two decades we have been witnessing a never seen before access to information on one hand and at the same time the volume of the information has been exponentially growing. The rule: 90% of all data has been created in the past 2 years is still standing. This has lead to information overloading and the rise of recommendation systems. Guiding the user in this pool of data has proven to be critical for business success as we can see from YouTube, Amazon, Netflix and many others. Net Info has prepared another challenge: The next best article.

2
votes

Business Overview

In an ever increasing paced world we want everything to happen fast and easy. Our attention span and our frustration fuse are both getting shorter. Nowadays we kind of expect from web services to know us intimately and to know our desires without us telling them. This is where recommendation systems kick in. They save us time and makes us feel like a web service has exactly what we need. Thus in time when users have access to millions of news sources it is vital to navigate the user to the ones most interesting to him.

The task is to create a model that can recommend the next best article for a user. Here besides taking account of the user reading history, one can also take account for the user history of each article thus learning from the history of people with similar interests.

Expected output

The task that participants should try to solve is at first glance is easy – to produce an algorithm that, based on articles already visited by one user on vesti.bg, will automatically / real-time recommend the next most interesting article for him in this site.
For this purpose, each team will have very easy-to-read data. For example URL, ID, Timestamp, but a very large volume – like two to three million rows.

Each model will be benchmarked against real data from the next day regarding recommending the next best article. In addition bonus points would be given to teams using Tensorflow.

 

Data

Net Info has provided data with historical visits of articles per user. The data consists of user ID, URL, timestamp, article title and article views.

Features:

  • User(anonymised)
  • Time
  • URL
  • Page Title
  • Page Views

You can access the sample dataset here…

You can access the full dataset here:

file 1

file 2

file 3

file 4

file 5

file 6

file 7

file 8

file 9

file 10

file 11

file 12

file 13

file 14

file 15

file 16

file 17

file 18

file 19

file 20

file 21

file 22

file 23

file 24

file 25

file 26

 

FAQ

1. Do we need to recommend an article or just a topic?

  • The main idea of the case is to predict the next best article (not topic) for the visitor. The article should be from the train dataset. We do not recommend parsing the text of all articles. The model will be to much resource consuming. We need fast working model. We need to refresh it on hourly base for example.

2. Would the evaluation be on the same users from train with their next article read from website, or on another bunch of users?

  • The evaluation of the model will be for the same users using the data for the next time period. The training dataset is for the 30 days. Evaluation dataset is for the next 1 day.

3. Would you provide the template of output format

  •  We do not have specific format for the output. It will be too limited for all the possible solutions. I expect something like VisitorID, first_best_article, second_best_article for the next 1 day/hour/minute

4. If someone is using Google Colab this link for the dataset will be useful: colab

Share this

Leave a Reply