Datathon 2020 – NetInfo Article Recommender – Newbies

The project aims to build a recommender system for the website called Vesti.bg. As the company runs to serve its huge customer base (as clear from the given data!) completely and for their best interests. And in order to do that it wants to recommend its users with articles that they should read next (based on mimicking their reading pattern). This is expected to save a lot of its users’ time in thinking. Also, with better and faster recommendations come people’s interest and that results in the company’s growth. The Company actually has a huge customer base. So, providing them with what they might like can really help it in making good money.

2

10 thoughts on “Datathon 2020 – NetInfo Article Recommender – Newbies”

1. 0

This is a very well written article, with clear description of the model and its motivation, as well as with clear description of the steps taken. The evaluation results are very high; yet, it is hard for me to make sense of them, as there is no comparison to alternative models, or at least to some simple baseline model.

Overall, this is an instance of content-based recommendation: we recommend to the user articles that are similar to articles s/he has seen in the past. How about collaborative filtering: 1. finding users with similar interests, and then 2. recommending articles such users have seen? And, obviously, also combination of both? E.g., as here (shameless self-reference): http://proceedings.mlr.press/v28/georgiev13.html

1. 0

Thank You so much Sir for your suggestions.
The evaluation results are high because they were taken on the training data itself.
Sir, we will check out this example and try to combine both Content-Based and Collaborative filtering

1. 0

Hmmm, this is unrealistically high. I was suspicious… OK, did you try to reserve part of the data for testing and evaluate on it?

1. 0

No, we did not.
Actually due to less time, we went on with Predictions.
We checked it now. So it was 77.8% on unseen data

2. 1

1. Data Prep
1.1. What is the result from describe method in Pandas?

1.2. Since Pandas is based on dictonaries and numpy arrays (really fast and efficient). Not sure why the team does not use Pandas industry’s standard methods
and spent time on writing code to work with dictonaries.
Would you mind to elaborate more on this please?

For example:“
a = train[‘pagePath’]
a = a.to_list()
a = set(a)
a = list(a)“

This is just list(train[‘pagePath’].unique())

1.2. Formation of dictonary could be done with applying a function to the URL – You used column “PagePath” but said that the columns are:
[“User ID”, “Time Stamp”, “URL”, “Page Title” and “Page Views”]

I do not see how you derived this column list

1.3. I do not see sorting by timestamp in your code.

1.4. I do not see soring by visitor in the code – so all loops you are doing in step MODEL are not correct since:

while(visitor[j] == visitor[j+1])

depends on visitor (probably User ID)

1.5. KNN usually stands for K-Nearest Neighbours … What is it in your case?

1.6. Please show a snippet of your data … this code is puzzeling:

What are knn_data[1] and knn_data[0]?

2. MODEL

What are you predicting exacltly? Why do you use regression model?

1. 0

knn_data[0] means features
knn_data[1] means labels.
It is just a representation and has nothing to do with K-Nearest Neighbours.
While preparing the dictionary we forgot using pandas, it can be used too.
We are Predicting according to that user like we take all the Articles read by the user as his/her history and predict the next one.
We used the window concept like me use when there is time-series data.
The loop there helps us to differentiate one user from another

3. 0

I suggest you to use NLP and derive topics or categories from Titles (even if they are in Bugarian)

4. 0

Very nice work, overall. Here are some notes:

First, I join Zenpanik with the pandas recommendations. Also, if you can’t load the CSV to pandas memory, consider setting columns with repetitive data (such as the article URL) as categorical, while using pd.read_csv.

For me, it was a bit confusing to understand what you were doing.
If I understand correctly, you’re creating an embedding of the articles, using a sliding window on the past reading of the user, and then using Random Forest Regressor to classify it? If so, then why do you recommend a user with only one past read, the same article again? are you creating a random forest regressor for every user separately?

If that indeed was your approach, it is equivalent of extracting to performing singular value decomposition (SVD) on the history matrix.

1. 0