Datathon 2020 SolutionsRecommendation systems

Datathon 2020 – NetInfo Article Recommender – Newbies

The project aims to build a recommender system for the website called Vesti.bg. As the company runs to serve its huge customer base (as clear from the given data!) completely and for their best interests. And in order to do that it wants to recommend its users with articles that they should read next (based on mimicking their reading pattern). This is expected to save a lot of its users’ time in thinking. Also, with better and faster recommendations come people’s interest and that results in the company’s growth. The Company actually has a huge customer base. So, providing them with what they might like can really help it in making good money.

2
votes

10 thoughts on “Datathon 2020 – NetInfo Article Recommender – Newbies

  1. 0
    votes

    This is a very well written article, with clear description of the model and its motivation, as well as with clear description of the steps taken. The evaluation results are very high; yet, it is hard for me to make sense of them, as there is no comparison to alternative models, or at least to some simple baseline model.

    Overall, this is an instance of content-based recommendation: we recommend to the user articles that are similar to articles s/he has seen in the past. How about collaborative filtering: 1. finding users with similar interests, and then 2. recommending articles such users have seen? And, obviously, also combination of both? E.g., as here (shameless self-reference): http://proceedings.mlr.press/v28/georgiev13.html

    1. 0
      votes

      Thank You so much Sir for your suggestions.
      The evaluation results are high because they were taken on the training data itself.
      Sir, we will check out this example and try to combine both Content-Based and Collaborative filtering

  2. 1
    votes

    1. Data Prep
    1.1. What is the result from describe method in Pandas?

    1.2. Since Pandas is based on dictonaries and numpy arrays (really fast and efficient). Not sure why the team does not use Pandas industry’s standard methods
    and spent time on writing code to work with dictonaries.
    Would you mind to elaborate more on this please?

    For example:“`
    a = train[‘pagePath’]
    a = a.to_list()
    a = set(a)
    a = list(a)“`

    This is just list(train[‘pagePath’].unique())

    1.2. Formation of dictonary could be done with applying a function to the URL – You used column “PagePath” but said that the columns are:
    [“User ID”, “Time Stamp”, “URL”, “Page Title” and “Page Views”]

    I do not see how you derived this column list

    1.3. I do not see sorting by timestamp in your code.

    1.4. I do not see soring by visitor in the code – so all loops you are doing in step MODEL are not correct since:

    while(visitor[j] == visitor[j+1])

    depends on visitor (probably User ID)

    1.5. KNN usually stands for K-Nearest Neighbours … What is it in your case?

    1.6. Please show a snippet of your data … this code is puzzeling:

    What are knn_data[1] and knn_data[0]?

    2. MODEL

    What are you predicting exacltly? Why do you use regression model?

    3. Please include charts and samples from your data.

    1. 0
      votes

      knn_data[0] means features
      knn_data[1] means labels.
      It is just a representation and has nothing to do with K-Nearest Neighbours.
      While preparing the dictionary we forgot using pandas, it can be used too.
      We are Predicting according to that user like we take all the Articles read by the user as his/her history and predict the next one.
      We used the window concept like me use when there is time-series data.
      The loop there helps us to differentiate one user from another

  3. 0
    votes

    Very nice work, overall. Here are some notes:

    First, I join Zenpanik with the pandas recommendations. Also, if you can’t load the CSV to pandas memory, consider setting columns with repetitive data (such as the article URL) as categorical, while using pd.read_csv.

    For me, it was a bit confusing to understand what you were doing.
    If I understand correctly, you’re creating an embedding of the articles, using a sliding window on the past reading of the user, and then using Random Forest Regressor to classify it? If so, then why do you recommend a user with only one past read, the same article again? are you creating a random forest regressor for every user separately?

    If that indeed was your approach, it is equivalent of extracting to performing singular value decomposition (SVD) on the history matrix.

Leave a Reply