Warning: DOMDocument::loadHTMLFile(): s3:// wrapper is disabled in the server configuration by allow_url_fopen=0 in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447
Warning: DOMDocument::loadHTMLFile(s3://dss-www-production/uploads/2020/05/Final_Model-1.html): failed to open stream: no suitable wrapper could be found in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447
Warning: DOMDocument::loadHTMLFile(): I/O warning : failed to load external entity "s3://dss-www-production/uploads/2020/05/Final_Model-1.html" in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 447
Warning: Invalid argument supplied for foreach() in /home/keepthef/datasciencesociety.net/wp-content/plugins/dss-core/dss-core.php on line 452
AN ARTICLE RECOMMENDER PROJECT
Recommender systems are our favourite definition of Data Science as they help people, in this busy and cynical world, to choose what’s best for them based on their interests. They provide a scalable way of personalizing content for users, without them even telling the system, fairly based on their history.
Methodology in use- CRISP-DM:
CRISP-DM stands for Cross-Industry Process for Data Mining. It is an open standard process that describes conventional approaches used by data mining experts.
This process breaks down into six phases:
- Business understanding
- Data Understanding
- Data preparation
To build a model which will automatically recommend the best Article to read by Analyzing your history. This Model will ease the process of thinking what to read and will strengthen the Customer Base of the website: Vesti.bg
We aim to make more, better and faster recommendation to a user as then he is more likely to read it. As studies prove that an article has a better probability of being read by a user if it’s on top of the page, with more and faster recommendations.
- PROJECT PLAN:
The model should simulate the hidden traits that influence the reading patterns and observe their characteristics.
The primary task is to follow certain events/activities performed by the user.
Then build a Recommender System model that predicts which generates the best article for a user for his next day, so that he does not have to think much into what’s his needs. And providing services like these improves customer bonding and thus reduces the risk of a customer changing to a rival website.
- INVENTORY OF RESOURCES:
- We are a team of two members: Shalin Jain and Tanu Agarwal
- 30 days data of different users in 26 buckets each containing around 800-900 thousand rows (PRETTY HUGE!)
- Anaconda, Jupyter Notebook, Python, Sci-Kit Learn
- BUSINESS SUCCESS CRITERIA:
The business is expected to take a good turn over with such good customer bonding like:
- Trust: Increases user confidence in the website.
- Effectiveness: Helps user make a better decision
- Persuasiveness: Convinces him to read
- Efficiency: Helps user make a faster decision
- Satisfaction: Increases ease of enjoyment
The data is provided in the form of a .csv file. It contains have features – [“User ID”, “Time Stamp”, “URL”, “Page Title” and “Page Views”].
Retrieving the data itself turned out to be a task as it contains around 21 million rows. The dataset carries the history of users (user details under ‘Visitor’) for 30 days in the form of links to the articles (under ‘page path’) that are read by them along with the time (under ‘time’) at which they did that.
Even though the data provided is almost completely sorted, we still need to perform certain tasks like variable transformation, dealing with missing values and outliers.
- In Variable transformation, we realized that each URL has its own unique number with it which directed us towards forming a dictionary with URLs and numbers such that numbers can be used to represent the links. Since the right variable transformation can minimize computational resources.
performing task like this on data so huge can be challenging.
- No outliers or missing values have been found so far. False triggers have been removed.
- Formation of Dictionary
This holds all the key-Link pair. To build this dictionary, first, we extracted all the unique links from Data Set, then we extracted the 7-digit code from the URL. We saved the code as a key and the rest of the URL as a link. The Dictionary is then stored in a Pandas Data Frame.
- Removing the Bulgarian Language
The 2nd Column of the data was useless as it was Page Title written in Bulgarian, so we dropped it to clear up some memory.
- Extracting the Link Code
We used the Page Path column as our main training data, so we converted every link into its unique seven-digit code since this helped us from using Embeddings it saved a lot of time and data. This seven-digit code was used by the model to predict the next key.
- Data Featuring Functions – Label Feature, Window, Search
Several customized functions are used in the Model Since it was a time Series data we windowed it and then extracted labels and features.
This function takes a series of data and converts it into n windows of the given window size where n = [length of the series – window_size +1]. It returns a 2-D NumPy Array
This function takes a 2D array as input. It splits every row of the array in such a way that the last element of that row becomes a label while the rest of the elements as features. It returns a Tuple which contains Labels and Features.
This function takes a value as an input. It searches for that value in the Dictionary and returns the complete URL linked with that value.
- Lists – [Visitor’s ID and User Recommendation ]
These lists are our Solution List. Visitor’s ID keeps a track of every Visitor’s Name while the other one stores the Prediction for that user.
- Formation of Features and Labels
Here we kept the default window size as 11. If the number of Searches by a user is less than 11, the window size changes accordingly. We use the Extracted Link Code Series for this process. We use the window function and Label Feature to extract our Train Data Set.
MAJOR MODELLING TECHNIQUE: Random Forest Regressor from Sklearn.
A random forest is a meta estimator that fits several classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
MODELLING ASSUMPTIONS :
Since the data used for training is a time-series data we assumed a window of 10 if we the history of a user is more than 10 Articles else “the number of Articles read – 1”. There is no missing data in the dataset.
- The Data is divided in such a way that the first 10 values are the features(X) and the very next value is the Target(Y).
- Training Set is the 30 days history of the user which is windowed using Window and Label Feature Function.
- The last window for each visitor is used as the test data for that user.
- We have assumed that if for a Person an Article is not found we suggest him/her read the last Article again.
All the parameters are set to default , n_estimators = 100, max_depth = None, max_leaf_nodes = None, verbose = 2, min_samples_split = 2.
A global Parameter (j) which keeps the check of the row_number.
We built the Model using Jupyter Notebook.
This Notebook contains the Process of Data Preparation, Various Functions used and the Main Model.
Our Models keeps a track of the various clusters which can be recommended one after the other but works separately for each visitor. The Main Model splits the data into Windows and Target for a Particular Visitor and then predict his/her Choice accordingly. The Model works on the Principle of Content-Based Filtering
DETAILED EXPLANATION OF THE MODEL:
The model runs in a while loop which ends when j is traversed through the whole bucket. It contains two local variables x[A list of articles read by a user] and total[ total articles read by a user].
We then traverse in the Visitor’s List and extract the links for one user at a time. We define the Window Size and create a Data using Window and Label_feature function. If a person has read only one Article we recommend him/her to read that again otherwise we use a Random Forest Regressor to predict the next key. We send that key to Search function which in return gives the Recommended link.
Our Model worked well with an Accuracy of 95.4% on the training set. It managed to predict the next Article for 98% of our users. We used MSPE (Mean Squared Percentage Error) as the loss metrics and we got a loss of 1.7%.
The above numbers gave us an assurance that the model is doing good and we can use it for the Prediction purpose.
Once the Test Data is prepared by using the last Window for each user, we evaluate a key using our Model. The Model runs through the Row Numbers, so as soon as it sees a change in the Visitor ID, it automatically appends the Visitor ID in the Visitors list and the Prediction in the User_Recommendation List. The model took an average time of 10 mins to go through 1 million responses.
The Next Day Prediction for every visitor is in the following link-
This link contains 26 files from solution0.csv to solution25.csv. The nth file contains the predictions for the visitors present in the nth bucket of the Full DataSet.
We ran the algorithm manually without the loop on the train data to detect errors (and, as expected, found some which we manually corrected) and got the final Model which can deal with all the possible cases. Then the predicted article is stored in a Dataframe along with the visitor’s name and then exported as a CSV file.