In our first event for 2015, held this Wednesday, 21.01.2015, Yasen Kiprov expanded our frontier even further by introducing the field of Natural Language Processing (NLP). Yasen has climbed up the academic ladder at the Sofia University – from a Bachelor in Informatics through a Master and PhD student in Artificial Intelligence, and has gained valuable experience as a software developer at Telerik, Melon, Ontotext, CNsys, BakeForce before landing his current job at Sentiment Metrics.
Yasen started by explaining what NLP is all about – enabling computers to derive meaning from a text. This is at the heart of Google Translate, Google Search and Smartphone assistants as Siri. A more specific application is classifying text into different categories – this is how e-mail services as Gmail and Yahoo detect spam e-mails, and lies in the core of sentiment analysis.
Sentiment analysis is utilized for determining the writer’s attitude and emotions contained in a text. One simple approach for automatically deduce something about a text is to collect all criteria in the form of IF-ELSE statements and try to assign a text to a category if it complies with those criteria, often designed by experts in the field. This idea faces hurdles when applied outside narrow tasks.
This is where Yasen’s explanation took a turn toward supervised machine learning with its two main representatives – the linear and logistic regression. When applying them to the text classification problem, the target variable is the sentiment for a given document, determined by human experts that have read it. The independent variable is a vector (one-hot vector) in which each word or useful word combinations are represented with their count in the documents analyzed.
Here Yasen gave a typical illustration of the most common problem in data science – the data quality. The text sentiment (the target variable) in the training sample is determined by human annotators that often don’t agree what the overall sentiment of the text is. On the top of that, subtle concepts like meaning in context and irony are notoriously difficult to catch by an algorithm.
But not all hope is lost and Yasen enlightened the audience in the end of his presentation by introducing it to the concepts of word representations and artificial neural networks in the field. The first concept uses analogies to detect word pairs similar to already established pairs and thus overcomes some shortcomings of the one-hot vector concept. The neural network approach can render some impressive results as usual, but at the cost of transparency – it’s difficult to explain to others and to yourself what really happens inside this black box mechanism.
The lecture was followed by the traditional networking drinks, for the first time held conveniently at the bar in the presentation room. Our hosts from Questers were so kind to provide even free drinks! If you do not want to miss such opportunities in the future, join us for our upcoming event – stay tuned by visiting our website, following our Facebook page or following our twitter account. For those of you that want to delve even deeper into the subject, you can find the presentation slides here.
Written by: Vladimir Labov