Last Wednesday, we at Data Science Society were happy to organize yet another successful event. This time we were honored to have Maria Mateva as our speaker in her alma mater, the Faculty of Mathematics and Informatics at the Sofia University. There she obtained her BSc degree in Computer Science and her MSc degree in Artificial Intelligence. Her professional path is much more diverse – Maria has worked as a software developer in VMware, Comverse, Ontotext and Experian.
What Maria shared with us stems primarily from her experience gained during her stint at Ontotext and her position of a teaching assistant in Information Retrieval in FMI. As Maria explained, information retrieval is mostly about finding text documents from within large collections by predetermined criteria – exactly the task that Google Search does. In order to find documents efficiently, you need to preprocess them. By using NLP preprocessors that were explained in detail in a previous lecture by Yasen Kiprov, all the key words (terms) from each document are extracted and form a dataset (matrix) where the terms are the observations (rows) while the documents are the variables (columns). From there, the simplest way to satisfy a query on the documents is to just return all the documents containing the searched term. This is what the Boolean retrieval model is all about – assigning 1 to the columns containing the term, and 0 to those that don’t.
As appealing as this approach may be, it has a critical drawback – it doesn’t rank the documents. As Maria pointed out, to overcome this we need a metric for how specific is each term for each document. This can be achieved by assigning weights to each term-document couple and this is what the Term frequency – inverted document frequency (TF-IDF) metric serves for. The term frequency promotes terms that occur often in the specific documents, while the inverted document frequency expression punishes terms that occur too often in most of the documents by tending to zero if such is the case. The resultant TF-IDF score(weight) is higher for documents that are more relevant for the query.
As it often happens in data science, there are several ways to solve a problem, and this is no exception – Maria demonstrated a way to compare documents by representing them as vectors in order to obtain their cosine similarity in the vector space.
Cosine similarity proves useful in facing another challenge – automatically recommending items to users, for example friends to follow, products to buy online (on Amazon), videos to watch (on Youtube), articles to read. The first approach that Maria showed us is collaborative filtering. The idea behind is to recommend items that similar users preferred, because users with similar ratings have most probably similar taste and will rate items in a common fashion. The first step in the approach is to standardize each user’s rating by subtracting from each user’s rating their average rating. From that point, you apply two methods – user-to-user or item-to-item filtering, but both can make use of cosine similarity. In user-to-user filtering, for each user we need to filter the best similar users and predict the user’s taste based on the similar users’ rating. In item-to-item filtering, we find items similar between each other, based on the ratings. Maria explained that the latter approach does a better job, because items have more constant behaviour that humans.
Unfortunately, collaborative filtering has two problems combined under the “cold start” name – no information for preferences of new users and no ratings of new items. The latter can be resolved by the second major recommendations approach – the content-based one. The idea is to build a user profile based on the content of rated items. This profile can be represented by a vector of weights in the content representation space. Then, the user’s profile can be examined for proximity to items in this space. As Maria pointed out, this is where the vector space model for comparing documents comes into play, with a twist – the user profile can be viewed as a dynamically updated document.
But every solution brings new problems and challenges to solve – with millions of users and items, the size of the matrix utilized leaves us with a big data problem. Maria revealed the answer to this problem – latent semantic indexing, which finds relations between the objects thanks to the Singular Value Decomposition algorithm. It alleviates the memory consumption problem by substitution of the original high dimensional matrix with three much smaller auxiliary matrices. The model also aids the cleansing of noise in the original vector space.
After so much information in so little space, you certainly have a lot of questions. Some of the answers may be found by taking a look at the presentation and the full video from the event (speech in Bulgarian).
The lecture was followed by the traditional networking drinks. Don’t miss our great upcoming events – stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our twitter account.
Written by: Vladimir Labov