Users generate huge amount of data which is a driving force for usage and improvement of Machine Learning (ML). The basic idea of many ML algorithms is to get knowledge from a given unstructured data. Many online platforms want to meet users preferences and give recommendations similar to their previous choices. Here recommendation systems come into play.
The goal of this project is to recommend to a user which article should he visit next. The data is provided by NetInfo and has articles from the website vesti.bg which are gathered for the past 30 days. It consists of 23,504,707 rows with the following features: visitor, time, pageTitle and pagePath.
Each article is viewed by many users, as well as by the same user many times. Usually many recommendation systems have rating given from a user towards given item, but in this case there is information only that a user visited some article. As a measurement how the user prefers some article we considered the following ‘a user has higher preference for an articles that he has visited many times’. So in this case we generated some weight (rating score) as : normalized visit of an article = number of times that article(A) has been visited by a user(U) / total number of visits of user(U).
According to this assumption the data was transformed as tuples (visitor, pageTitle, weight) for further modeling.
The task is implemented by using the collaborative filtering method with matrix factorization. This method allows working in a latent space to discover unknown preferences of some user for an item. The implementation takes predefined parameters such as:alpha (learning rate), beta (regularization parameter), k (dimension of the latent space) and number of iterations. The test set should be given as a pair of (user, article) and the predictions output the weight – rating from user to an article. To find the best article for a user this output is than grouped by users and the maximum value is taken as his highest rating, therefore the article with highest rating is a user’s best next article.
To calculate results the dataset is split into train and test set. The train is consisted of database records till (till 12.05) while the test set contains one-day records from 12.05.
The provided model is slow for such a large dataset. It should use some library implementation of matrix factorization, or maybe to be implemented with TensorFlow.