AN ARTICLE RECOMMENDER PROJECT
Introduction:
Recommender systems are our favourite definition of Data Science as they help people, in this busy and cynical world, to choose what’s best for them based on their interests. They provide a scalable way of personalizing content for users, without them even telling the system, fairly based on their history.
Methodology in use- CRISP-DM:
CRISP-DM stands for Cross-Industry Process for Data Mining. It is an open standard process that describes conventional approaches used by data mining experts.
This process breaks down into six phases:
- Business understanding
- Data Understanding
- Data preparation
- Modelling
- Evaluation
- Deployment
Business understanding:
- OBJECTIVE:
To build a model which will automatically recommend the best Article to read by Analyzing your history. This Model will ease the process of thinking what to read and will strengthen the Customer Base of the website: Vesti.bg
We aim to make more, better and faster recommendation to a user as then he is more likely to read it. As studies prove that an article has a better probability of being read by a user if it’s on top of the page, with more and faster recommendations.
- PROJECT PLAN:
The model should simulate the hidden traits that influence the reading patterns and observe their characteristics.
The primary task is to follow certain events/activities performed by the user.
Then build a Recommender System model that predicts which generates the best article for a user for his next day, so that he does not have to think much into what’s his needs. And providing services like these improves customer bonding and thus reduces the risk of a customer changing to a rival website.
- INVENTORY OF RESOURCES:
- We are a team of two members: Shalin Jain and Tanu Agarwal
- 30 days data of different users in 26 buckets each containing around 800-900 thousand rows (PRETTY HUGE!)
- Windows
- Anaconda, Jupyter Notebook, Python, Sci-Kit Learn
- BUSINESS SUCCESS CRITERIA:
The business is expected to take a good turn over with such good customer bonding like:
-
- Trust: Increases user confidence in the website.
- Effectiveness: Helps user make a better decision
- Persuasiveness: Convinces him to read
- Efficiency: Helps user make a faster decision
- Satisfaction: Increases ease of enjoyment
Data Understanding:
The data is provided in the form of a .csv file. It contains have features – [“User ID”, “Time Stamp”, “URL”, “Page Title” and “Page Views”].
Retrieving the data itself turned out to be a task as it contains around 21 million rows. The dataset carries the history of users (user details under ‘Visitor’) for 30 days in the form of links to the articles (under ‘page path’) that are read by them along with the time (under ‘time’) at which they did that.
Even though the data provided is almost completely sorted, we still need to perform certain tasks like variable transformation, dealing with missing values and outliers.
- In Variable transformation, we realized that each URL has its own unique number with it which directed us towards forming a dictionary with URLs and numbers such that numbers can be used to represent the links. Since the right variable transformation can minimize computational resources.
performing task like this on data so huge can be challenging. - No outliers or missing values have been found so far. False triggers have been removed.
Data Preparation:
- Formation of Dictionary
This holds all the key-Link pair. To build this dictionary, first, we extracted all the unique links from Data Set, then we extracted the 7-digit code from the URL. We saved the code as a key and the rest of the URL as a link. The Dictionary is then stored in a Pandas Data Frame.
- Removing the Bulgarian Language
The 2nd Column of the data was useless as it was Page Title written in Bulgarian, so we dropped it to clear up some memory.
- Extracting the Link Code
We used the Page Path column as our main training data, so we converted every link into its unique seven-digit code since this helped us from using Embeddings it saved a lot of time and data. This seven-digit code was used by the model to predict the next key.
- Data Featuring Functions – Label Feature, Window, Search
Several customized functions are used in the Model Since it was a time Series data we windowed it and then extracted labels and features.
WINDOW:
This function takes a series of data and converts it into n windows of the given window size where n = [length of the series – window_size +1]. It returns a 2-D NumPy Array
LABEL FEATURE:
This function takes a 2D array as input. It splits every row of the array in such a way that the last element of that row becomes a label while the rest of the elements as features. It returns a Tuple which contains Labels and Features.
SEARCH:
This function takes a value as an input. It searches for that value in the Dictionary and returns the complete URL linked with that value.
- Lists – [Visitor’s ID and User Recommendation ]
These lists are our Solution List. Visitor’s ID keeps a track of every Visitor’s Name while the other one stores the Prediction for that user.
- Formation of Features and Labels
Here we kept the default window size as 11. If the number of Searches by a user is less than 11, the window size changes accordingly. We use the Extracted Link Code Series for this process. We use the window function and Label Feature to extract our Train Data Set.
Modelling:
MAJOR MODELLING TECHNIQUE: Random Forest Regressor from Sklearn.
A random forest is a meta estimator that fits several classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
MODELLING ASSUMPTIONS :
Since the data used for training is a time-series data we assumed a window of 10 if we the history of a user is more than 10 Articles else “the number of Articles read – 1”. There is no missing data in the dataset.
TEST DESIGN:
- The Data is divided in such a way that the first 10 values are the features(X) and the very next value is the Target(Y).
- Training Set is the 30 days history of the user which is windowed using Window and Label Feature Function.
- The last window for each visitor is used as the test data for that user.
- We have assumed that if for a Person an Article is not found we suggest him/her read the last Article again.
PARAMETER SETTING:
All the parameters are set to default , n_estimators = 100, max_depth = None, max_leaf_nodes = None, verbose = 2, min_samples_split = 2.
A global Parameter (j) which keeps the check of the row_number.
MODEL :
We built the Model using Jupyter Notebook.
This Notebook contains the Process of Data Preparation, Various Functions used and the Main Model.
Article Recommender System for Vesti.bg¶
This project aims to build a recommender system for Vesti.bg. Vesti.bg is a website that believes in delivering actual value to its customers by providing them with the best articles of their interests. It actually has a huge customer base. So, providing them with what they might like can really help it in making good money.
Timeline:¶
Importing Libraries
General Data Processing
Setting the Global variables for the Model
Formation of Dictionary
Creating Functions
Model
Evaluation and Predictions
Importing Libraries General Data Processing¶
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
import seaborn as sns
Setting the Global variables¶
name = input() # name : ("00" to "25")
train = pd.read_csv('vest0000000000' + name + '.csv')
train = train.drop(['pageTitle'],axis = 1)
train.head()
train.describe()
a = train['pagePath']
a = a.to_list()
a = set(a)
a = list(a)
j = 0
VisitorName = []
User_recommendation = []
path = train['pagePath']
visitor = train['visitor']
n = len(visitor)
Formation of Dictionary¶
key = []
link = []
for i in a:
c = i[-7]
if ( ord(c) > 47 and ord(c) < 59):
key.append((i[-7:]))
link.append((i[:-7]))
dictionary = pd.DataFrame({'key':key,'url':link})
Creating Functions¶
#Data Creation Functions
def window(a,window_size=1): # Takes a 1-D Series and a number
b = len(a)
c = b-window_size
data = np.zeros([c+1,window_size])
for i in range(c+1):
data[i] = a[i:i+window_size]
return data
def label_feature(a): # Takes a numpy 2-D Array
m = a.shape[1]
copy_a = a.copy()
copy_a = copy_a.T
y = copy_a[m-1]
x = copy_a[0:m-1].T
return (x,y)
#Search in a dictionary
def Search(value):
if (dictionary[dictionary['key'] == str(value)]['url'].index.shape[0] == 0):
return "Not_Found"
index = dictionary[dictionary['key'] == str(value)]['url'].index[0]
link = dictionary['url'][index]+dictionary['key'][index]
return link
Model¶
while(j<n-1):
x = []
total = 1
while(visitor[j] == visitor[j+1]):
c = path[j][-7]
d = path[j][-1]
s = path[j][-2]
if ( ord(c) > 47 and ord(c) < 59 and ord(d) < 59 and ord(d) > 47 and ord(s) < 59 and ord(s) > 47):
x.append(int(path[j][-7:]))
total = total + 1
j = j+1
e = path[j][-7]
f = path[j][-1]
t = path[j][-2]
if ( ord(e) > 47 and ord(e) < 59 and ord(f) < 59 and ord(f) > 47 and ord(t) < 59 and ord(t) > 47):
x.append(int(path[j][-7:]))
j = j+1
window_size = min(total,11)
knn_data = window(x,window_size)
knn_data = label_feature(knn_data)
if (knn_data[0].shape[1]==0 or knn_data[0].shape[0]==0 ) :
VisitorName.append(visitor[j-1])
if (knn_data[1].shape[0]==0):
link = path[j-1]
User_recommendation.append(link)
else:
link = Search(int(knn_data[1][0]))
User_recommendation.append(link)
else:
rfr = RandomForestRegressor(verbose = 2)
rfr.fit(knn_data[0],knn_data[1])
last = max(-10,-(total-1))
checker = np.array(x[last:])
checker = checker.reshape(1,-1)
array = rfr.predict(checker)
value = int(array[0])
link = (Search(value))
VisitorName.append(visitor[j-1])
User_recommendation.append(link)
print(j)
Evaluation and Predictions¶
print(j)
len(VisitorName)
len(User_recommendation)
solution = pd.DataFrame({'VistorID':VisitorName,'NextURL':User_recommendation})
solution.head()
solution.describe()
file_name = "Solutin"+name+".csv"
solution.to_csv(file_name,index = False)
DONE !!¶
MODEL DESCRIPTION:
Our Models keeps a track of the various clusters which can be recommended one after the other but works separately for each visitor. The Main Model splits the data into Windows and Target for a Particular Visitor and then predict his/her Choice accordingly. The Model works on the Principle of Content-Based Filtering
DETAILED EXPLANATION OF THE MODEL:
The model runs in a while loop which ends when j is traversed through the whole bucket. It contains two local variables x[A list of articles read by a user] and total[ total articles read by a user].
We then traverse in the Visitor’s List and extract the links for one user at a time. We define the Window Size and create a Data using Window and Label_feature function. If a person has read only one Article we recommend him/her to read that again otherwise we use a Random Forest Regressor to predict the next key. We send that key to Search function which in return gives the Recommended link.
MODEL ASSESSMENT:
Our Model worked well with an Accuracy of 95.4% on the training set. It managed to predict the next Article for 98% of our users. We used MSPE (Mean Squared Percentage Error) as the loss metrics and we got a loss of 1.7%.
The above numbers gave us an assurance that the model is doing good and we can use it for the Prediction purpose.
Evaluation:
Once the Test Data is prepared by using the last Window for each user, we evaluate a key using our Model. The Model runs through the Row Numbers, so as soon as it sees a change in the Visitor ID, it automatically appends the Visitor ID in the Visitors list and the Prediction in the User_Recommendation List. The model took an average time of 10 mins to go through 1 million responses.
The Next Day Prediction for every visitor is in the following link-
https://drive.google.com/open?id=1mjGJYkLzLgs3Et10So-MCpSImeYMV7di
This link contains 26 files from solution0.csv to solution25.csv. The nth file contains the predictions for the visitors present in the nth bucket of the Full DataSet.
Deployment:
We ran the algorithm manually without the loop on the train data to detect errors (and, as expected, found some which we manually corrected) and got the final Model which can deal with all the possible cases. Then the predicted article is stored in a Dataframe along with the visitor’s name and then exported as a CSV file.
10 thoughts on “Datathon 2020 – NetInfo Article Recommender – Newbies”
This is a very well written article, with clear description of the model and its motivation, as well as with clear description of the steps taken. The evaluation results are very high; yet, it is hard for me to make sense of them, as there is no comparison to alternative models, or at least to some simple baseline model.
Overall, this is an instance of content-based recommendation: we recommend to the user articles that are similar to articles s/he has seen in the past. How about collaborative filtering: 1. finding users with similar interests, and then 2. recommending articles such users have seen? And, obviously, also combination of both? E.g., as here (shameless self-reference): http://proceedings.mlr.press/v28/georgiev13.html
Thank You so much Sir for your suggestions.
The evaluation results are high because they were taken on the training data itself.
Sir, we will check out this example and try to combine both Content-Based and Collaborative filtering
Hmmm, this is unrealistically high. I was suspicious… OK, did you try to reserve part of the data for testing and evaluate on it?
No, we did not.
Actually due to less time, we went on with Predictions.
We checked it now. So it was 77.8% on unseen data
ok, noted.
1. Data Prep
1.1. What is the result from describe method in Pandas?
1.2. Since Pandas is based on dictonaries and numpy arrays (really fast and efficient). Not sure why the team does not use Pandas industry’s standard methods
and spent time on writing code to work with dictonaries.
Would you mind to elaborate more on this please?
For example:“`
a = train[‘pagePath’]
a = a.to_list()
a = set(a)
a = list(a)“`
This is just list(train[‘pagePath’].unique())
1.2. Formation of dictonary could be done with applying a function to the URL – You used column “PagePath” but said that the columns are:
[“User ID”, “Time Stamp”, “URL”, “Page Title” and “Page Views”]
I do not see how you derived this column list
1.3. I do not see sorting by timestamp in your code.
1.4. I do not see soring by visitor in the code – so all loops you are doing in step MODEL are not correct since:
while(visitor[j] == visitor[j+1])
depends on visitor (probably User ID)
1.5. KNN usually stands for K-Nearest Neighbours … What is it in your case?
1.6. Please show a snippet of your data … this code is puzzeling:
What are knn_data[1] and knn_data[0]?
2. MODEL
What are you predicting exacltly? Why do you use regression model?
3. Please include charts and samples from your data.
knn_data[0] means features
knn_data[1] means labels.
It is just a representation and has nothing to do with K-Nearest Neighbours.
While preparing the dictionary we forgot using pandas, it can be used too.
We are Predicting according to that user like we take all the Articles read by the user as his/her history and predict the next one.
We used the window concept like me use when there is time-series data.
The loop there helps us to differentiate one user from another
I suggest you to use NLP and derive topics or categories from Titles (even if they are in Bugarian)
Very nice work, overall. Here are some notes:
First, I join Zenpanik with the pandas recommendations. Also, if you can’t load the CSV to pandas memory, consider setting columns with repetitive data (such as the article URL) as categorical, while using pd.read_csv.
For me, it was a bit confusing to understand what you were doing.
If I understand correctly, you’re creating an embedding of the articles, using a sliding window on the past reading of the user, and then using Random Forest Regressor to classify it? If so, then why do you recommend a user with only one past read, the same article again? are you creating a random forest regressor for every user separately?
If that indeed was your approach, it is equivalent of extracting to performing singular value decomposition (SVD) on the history matrix.
Yup, that was the approach. Thank you for the notes.
we will consider them and try to improve the Model