The Datathon 2018, presents us this time many cases which are Natural Language Processing related.
One of the cases, for example, involves extracting entities activities from unstructured documents, and determining their sentiment.
So, how should one begin working on such a problem?
Let’s break this problem down:
(1) we need to to detect which entities are mentioned in an article, and then (2) we need to detect the sentiment that is related, specifically to those entities.
Let’s think logically about it for a moment: In such sentences, the entities are normally nouns, and in order to get the sentiment we will probably need to observe the verbs, adverbs and adjectives that describe them.
So a first step could be (a) parsing the document into sentences, and then into words, and (b) determine the grammatical role of each word in the sentence, and then determine (c) in which way these words are related to each other.
Let’s start giving these actions names:
- In NLP, documents are often referred as corpus, and the parsing process is referred as preprocessing;
- Breaking up a corpus, as described in (a) is called tokenizing, while choosing the Part Of Speech (POS) of a word is referred as POS tagging;
- In order to determine the dependency of words in each other (c), the common way to represent it is as a tree (although some prefer to use a graph), and unsurprisingly it is called Dependency Tree;
- The secondary level, i.e. subject, object, type of verbs and other semantics, is called Semantic Role Labeling (SRL);
- Detecting that a specific word (an entity) is a name of a company/person/product, is called Named Entity Recognition (or NER since as you can see, we really love abbreviations)
here are already many tools and frameworks out there that can do this job, and even more, for you, i.e. tokenizing, tagging and parsing, and semantically labeling. A great place to start is the nltk library in Python, which besides offering the basic tools, also have a great guide for beginners. Spacy would give you even more power and speed for the above mentioned tasks.
Nevertheless, it’s always encouraged to recombine, hack, tweak and change the current models in order to create a better model. This is often done by retraining a model, and for that one needs a lot of data.
Word embedding is a process where the words in the sentence are being replaced with matching vectors. In order to do that, a neural-network based algorithm is first implemented on a big corpus, and through training, converts the words into vectors.
Before using these vectors for sentiment analysis, it’s important to remember that these representations are no more than semantic-representations. I.e. adjectives such as ‘good’ and ‘bad’ are represented by close vectors in the vector space.
Good places to check next are deep-learning based models, such as CNN and BiLSTM (bi-directional LSTM).
And if you are interested to get deeper into NLP, check out these great links: