Datathon 2020 SolutionsRecommendation systems

Datathon2020 – NetInfo Article Recommender – Solution – nextClick

Giving a recommendation to the user to catch his eye on and meet his preferences is essential task for a recommendation system. The amount of data is increasing significantly and the idea is to get some knowledge from it. Taking advantage of user similarities or news similarities will provide useful information to predict which article will the user find interesting.

1
votes

 

  1. Introduction
    Users generate huge amount of data which is a driving force for usage and improvement of Machine Learning (ML). The basic idea of many ML algorithms is to get knowledge from a given unstructured data. Many online platforms want to meet users preferences and give recommendations similar to their previous choices. Here recommendation systems come into play.
  2. Data
    The goal of this project is to recommend to a user which article should he visit next. The data is provided by NetInfo and has articles from the website vesti.bg which are gathered for the past 30 days. It consists of 23,504,707 rows with the following features: visitor, time, pageTitle and pagePath.
  3. Preprocessing
    Each article is viewed by many users, as well as by the same user many times. Usually many recommendation systems have rating given from a user towards given item, but in this case there is information only that a user visited some article. As a measurement how the user prefers some article we considered the following ‘a user has higher preference for an articles that he has visited many times’. So in this case we generated some weight (rating score) as : normalized visit of an article = number of times that article(A) has been visited by a user(U) / total number of visits of user(U).
    According to this assumption the data was transformed as tuples (visitor, pageTitle, weight) for further modeling.
  4. Model
    The task is implemented by using the collaborative filtering method with matrix factorization. This method allows working in a latent space to discover unknown preferences of some user for an item. The implementation takes predefined parameters such as:alpha (learning rate), beta (regularization parameter), k (dimension of the latent space) and number of iterations. The test set should be given as a pair of (user, article) and the predictions output the weight – rating from user to an article. To find the best article for a user this output is than grouped by users and the maximum value is taken as his highest rating, therefore the article  with highest rating is a user’s best next article.
  5. Results
    To calculate results the dataset is split into train and test set. The train is consisted of database records till  (till 12.05) while the test set contains one-day records from 12.05.
  6. Discussion
    The provided model is slow for such a large dataset. It should use some library implementation of matrix factorization, or maybe to be implemented with TensorFlow.

Share this

3 thoughts on “Datathon2020 – NetInfo Article Recommender – Solution – nextClick

  1. 0
    votes

    1. I would suggest to remove duplicates over article and user – it does not make sense to me to count how many times a user read an article
    2. I would suggest to use NLP to derive topics or categories from Titles
    3. What is “maunfinishedtrix”
    4. What is PQ decomposition?
    3. Why dont you use some matrix factorization implementation out of the box

    I do not see results from your model.

  2. 0
    votes

    The article goes in the right direction: discusses content-based vs. collaborative filtering. It feels a bit short though…

    At the end, the authors are in favor of collaborative filtering. Why is that? How about combining both? Shameless self-promotion: http://proceedings.mlr.press/v28/georgiev13.html

    Having a neural model is nice these days, but it is unclear to me what model exactly. What is “maunfinishedtrix factorization”?

    Were there any experiments actually performed? I see no results.

    Any ideas? Any insights from the data? Any graphs? What is next?

  3. 0
    votes

    Hi guys, very nice work, and indeed in the right direction.

    I understand that “maunfinishedtrix factorization” is a copy-paste mistake. It should have been “unfinished matrix factorization” 😉
    I join my colleague reviewers suggesting to use an out of the box implementation of such things in the future. Pay attention that matrix factorization on scale might be costly. There are approaches such as “word2Vec” which are exactly matrix factorization (https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf).

    You are correct – there is no clear measurement of whether the user liked the article or not: no amount of time in the URL, nor rating. Using recurrent visits is a clever idea (!). But, please consider that it could also be that returning to a previously read article is because the information in it is timeless.

Leave a Reply