In this article, the mentors give some preliminary guidelines, advice, and suggestions to the participants for the Hack the News Datathon case. Every mentor should write their name and chat name at the beginning of their texts so that there are no mix-ups with the other mentors.
Introduction to NLP
Natural Language Processing (NLP) is the field of computer science that is concerned with developing algorithms for analysis of human languages. Artificial Intelligence approaches( eg. Machine Learning) have been used for solving many tasks of NLP such as parsing, POS tagging, Named Entity Recognition, word sense disambiguation, document classification, machine translation, textual entailment, question answering, summarization, etc. Natural languages are notoriously difficult to understand and model by machines mostly because of ambiguity (eg. humor, sarcasm, puns), lack of clear structure, diversity (eg. models for English are not directly applicable to Chinese). Even so, in recent years we’re witnessing rapid progress in the field of NLP, due to deep learning models, which are becoming more and more complex and able to capture subtleties of human languages.
MENTORS’ GUIDELINES | Propaganda Detection
Before going into details, let us re-visit the three subtasks:
- Given document d, identify whether d is propagandistic or not.
- Given document d, identify which exact sentences in d are propagandistic.
- Given document d, identify which specific phrases are propagandistic and which technique they use to convey their message.
As one can observe, both tasks 1 and 2 are binary classification tasks, whereas task 3 is a multi-class tagging task.
For all three tasks there are multiple alternatives to compute representations; from manually-engineered to automatically-inferred features. Perhaps the most straighforward representation is the known as bag-of-words model (BoWaddress). In BoW the order of the words is neglected and each of them is weighted either on the basis of statistics of the single document, a collection, or both. Other valuable representations include the occurrence of certain words (e.g., particularly negative/positive ones) or the style in the writing. Consider for instance this MPQA’s or Bing Liu’s lexicons. Be creative! Try novel representations!
Another option is considering distributional representations: embeddings. These are models that map words, sentences, or full documents into a vector space. One good property of such vectors is that representations of semantically-similar words appear close to each other in such a space. There are multiple pre-computed embedding models available online, so you do not need to train your own model from large volumes of data. For instance, consider GLOVE, word2vec, or fastext. See Mikolov et al, 2013 for further details.
Usually the computation of such representations requires a number of pre-preprocessing steps, which may include stopword removal, stemming, and/or lemmatization, part-of-speech tagging, casefolding, punctuation removal, etc. Multiple libraries exist to perform these tasks (cf. Tools and Frameworks).
One of the simplest classification models is the k nearest-neighbours algorithm. In this case, there is no training stage, but a new item is assigned to the majority class with respect to the k closest elements in the representation step. More sophisticated models include naïve bayes, support-vector machines, or multi-layer perceptron, among many other alternatives.
Task three is a sequential task in which each fragment in the text (e.g., a token) has to be labeled as one of the propagandistc techniques or none of them. Perhaps the “standard” task resembling the most of that of named entity recognition. There is plenty of material online about this technique, including an introduction to the topic and a tutorial using sklearn.
Tools and Frameworks
These are non-exhaustive lists of resources. There are way more out there.
General machine learning
Don’t forget to apply your knowledge and skills in the challenge – The community works with advisors from top institutes in the world who are invited as experts and a jury of the best solutions which will be awarded out of a crowdfunding campaign.
The registration is free but mandatory – Join before 21. January!