Business Understanding
Newspapers, radio, TV, Internet – all these sources of information surround every person everywhere. But all of these are often used to exert some influence on a person, his opinion, behavior. This may be the innocuous influence of various shops in order to convince a person of the need to purchase not very necessary thing, or maybe an attempt to influence a person from a political point of view. A very indicative and horrible example is the game “Blue Whale”, which recently was very common on the social network “VKontakte” and brought a huge number of teenagers to suicide.
At the moment, the Internet, including social networks, is the most poorly controlled source of information. Every day, over 2 million posts are published in the blogosphere. This does not include various news sites and sources. Of course, it is not possible to control such information flow manually. Therefore, the development of automated tools to control this kind of content is a pressing issue now.
Data Understanding
The original dataset is a corpus of various texts of completely different content. Depending on the specific wording of the task, it is necessary to determine whether propaganda is present in the entire text together or in each specific sentence. It is clear that for the training, validation and test datasets, the propaganda markup should be present in the dataset itself for training and testing the quality of the constructed model.
So, in this case we have a file with text for training model which consists of:
text text_id true_label
…
All columns represents by tab-separated.
Data Preparation
It is clear that many texts taken from the Internet can be very “dirty”: many unnecessary symbols, emoticons, which do not carry useful information for the classification. Therefore, the standard step in preparing the source data for working is to clear the texts of such characters, different punctuation marks, and links to other articles.
Next, we divide each text into separate words. It should be noted that in many languages words can be used in different lexical forms, while having identical meaning from the point of view of the classifier. Therefore, after dividing the text into separate words, we apply lemmatization for each of them. To avoid overloading the model with large texts, we will process only the first 200 words from each, assuming that the nature of the entire text can be determined only by them. Thus, from the point of view of the model, each text is a collection of some tokens.
Modeling
At the moment there is a lot of research on the topic of text classification. Scientists from the University of Washington proposed to use the ELMo presentation (Embeddings from Language Models) (more info) . ELMo embeddings is one of many great pre-trained models available on Tensorflow Hub. They are learned from the internal state of a bidirectional LSTM and represent contextual features of the input text. It’s been shown to outperform GloVe and Word2Vec embeddings on a wide variety of NLP tasks.
For the task #1 I want to use Model with the follow topology:
This topology is very similar to the standard topology that is common for solving different tasks of NLP, with the exception of a few details. One of the two input branches is described above ELMO. Further, after their merging, the already-standard LSTM model is used. You can get acquainted with the topology in more detail in the proposed.
Evaluation
Although the training of such kind of a model takes quite a long time, it has an accuracy of 98% in the training sample after 3 epochs! Further training only improves this metric (the ideal number of epochs is 7, but such training will take at least a day).
Since I have a limited amount of time, II was forced to stop learning after 3 epochs in order to have time to post the results.. In this case, the f1-score for the dev samples are only 0.7. If I had one more day in store, I could go through several epochs of training and get even more accurate results. (I would like to note that I will definitely continue learning the model and will post the latest results on my repository).
More information and code sample you can find in my git repository or download zip-archive straight here .
Thank you for reading my humble works. Good luck to all! Liza
P.S. I want to thank organization staff of Hack the News Datathon 2019 for providing the opportunity to participate and try my hand at such a large-scale event. For me, a simple student from Russia, it was a very interesting experience!