Datathons Solutions

Detecting propaganda on sentence level


Detecting Propaganda on Sentence Level

Team Astea Wombats

Team members

mkraeva, backslash, givanov, jorro, apetkov, fire

Code Repository

Source Code: hack-the-news-master

The code repository structure is described in its file. To summarize, it contains all our models, feature transformer functions and our own pipeline for training, testing and validating our models.


The issue of propaganda is among the most pressing dangers of the current political landscape on a world-wide magnitude. The great amount of accessible information, the diminishing means of checking its veracity in up-to-date fashion and the accelerating loss of attention span among the general public all sum up to a field ripe for misinformation, polarization, tribalization and opinion engineering. Traditional media are at disadvantage as the efforts historically involved in veracity checking do not pay off economically in the landscape that user-generated media is creating.

Using technological means for detecting misinformation and manipulation is therefore seen as an important field of research and one that aligns closely with the public interest and the advancement of informed democracy. Independent and fact-checked information is crucial to the success of any contemporary society. It is technological evolution that has rapidly contributed to the deterioration the information landscape and it is therefore a moral imperative for companies, communities and individuals involved with tech to help combat this perilous development.

It is also important to narrow down particular issues with current media. While misinformation in general is problematic, conscious propaganda and the systematic incitement of particular opinions and emotions in a community deserves a focus on its own. This is because it has historically been an important tool for elites, populists and large-scale economic actors to distance public opinion from their own shortcomings and to distort a correct and informed perception of political reality and hence the democratic process in general.

0. Data

The data used for this research is provided by the Qatar Computing Research Institute and consists of about 300 articles annotated either by article or at sentence level as propaganda or not. Additional information for the data creation process can be found in [1].

1. Business Domain Understanding

Propaganda is information that is not objective and is used primarily to influence an audience and pushing an agenda. Propaganda is the deliberate spreading of ideas, facts, or allegations with the aim of advancing one’s cause or of damaging an opposing cause. There are at least 18 types of propaganda. [2]

In this analysis, we do not take into account the more detailed annotations on propaganda types and instead focus only on the presence of any type of propaganda.

2. Data Understanding

The analyzed training data consists of 293 articles, annotated as propaganda or non-propaganda on sentence level. Of these, there are 12 articles that contain no propanda examples and 281 articles with some propaganda sentences in them.

Excluding empty lines (which are always classified as non-propaganda), the articles contain a total of 14265 sentences, of which 3940 as tagged as containing propaganda.

The median length of sentences containing propaganda is 12, while for sentences without propaganda this value is 9. As the following plot shows, propaganda sentences are longer on average.

Number of words per propaganda and non-propaganda sentences:

We have also analysed the most commonly found words in propagandistic vs non-propagandistic sentences, after removing stop words using the available nltk corpora for English stop words.

This mertic shows that it is common for propagandistic sentences to mention groups and individuals that are known to take part in arguments and/or strongly defend their opinions. Examples in the data are the frequent appearances of words such as “Trump”, “god”, “church”, “catholics”, “papa francis”.

Most common words found in propaganda and non-propaganda sentences (excluding stop words):

3. Data Preparation

Article files were already split with one sentence per line. We removed empty sentences because they introduced even larger class imbalance. Where necessary, we have also removed stop words from the sentences – e.g. in extracting word2vec embeddings and in calculating some features related to counting classes of words in the sentences.

Feature engineering

We have invested a lot of effort in extracting low level (e.g. part of speech) and high level (e.g. subjectivity, readability) features.

Sentence Length

The longer the sentence, the more complex it is for reading. Such sentences can be more confusing for readers and may be intentionally vague.

Lexical Features

Counts of different Parts of Speech:

  • Adjectives and Adverbs

    Research has shown the presence of adjectives and adverbs is usually a good indicator of text subjectivity. [3] In other words, statements that use adjectives like “problematic” and “incredible” might be more likely to convey a subjective point of view than statements that do not include those adjectives.

  • Proper Nouns and plural proper nouns

    Proper nouns may be signifying various kinds of propaganda. Example include “appeal to authority” where popular figures are quoted, slandering a political opponent, flag-waving and patriotic feeling incitement, where nations or community groups may be cited with plural proper nouns, appeal to fear, etc.

  • Exclamation marks

    Exclamation marks express strong emotions such as joy, enthusiasm, disbelief, surprise, or urgency. These strong emotions contribute to exaggeration which is common in propaganda texts.

  • Question marks

    Question marks can be an indicator for questioning the credibility of someone or something in a propaganda text. The text might be conveying doubt to its readers.

Loaded Language

Using specific words and phrases with strong emotional implications (either positive or negative) to influence an audience is common in propaganda. We have compiled a list of such phrases and count their frequency in the sentences.


Anything objective sticks to the facts, but anything subjective has feelings. Usually, subjective means influenced by emotions or opinions. It is common for propaganda texts to be biased and subjective.

We use TextBlob to extract this feature. TextBlob uses a pattern library with a dictionary of words which make the text subjective (e.g. great, awful, etc.). It also accounts for intensifier words like ‘very’ and ‘much’, and polarity changing words like ‘not’. The subjectivity metric varies from 0 to 1, where 0 means that the text is objective and 1 that the text is subjective.


The polarity metric measures how positive or how negative the sentiment is in the text. We use TextBlob to extract it. It is a negative number between -1 and 1. Some propaganda texts can be extremely positive or negative. In our models, like the stacking ensemble, we rescale the range to [0, 1].


Emotion features are extracted from IBM Watson Natural Language Understanding API [4]. They include sadness, joy, fear, disgust, and anger. Propaganda texts can be very emotional. It is common to see a notion of fear or anger in propaganda texts.

Confusing Words

A popular kind of propaganda involves using carefully selected words with ambiguous or confusing meaning. We implement a proxy for this feature of the sentence by checking the sum of meanings of any word (grouped by part of speech) or its synonym nest in the sentence, by using WordNet data.


We use a selection of popular readability measures, such as SMOG, Fleisch-Kincaid, and others.


We used TF-IDF for our baseline model and we also combined it with other features to boost model evaluation performance.

Word Embeddings

Word embeddings help us model the semantic meaning of words.

4. Modeling

Model Baseline

A baseline model is useful to determine how much a more advanced model can contribute to improving the overall prediction accuracy. For our baseline model, we used a Logistic Regression with TF-IDF vectors as features for the model.

Single-model approach

We have used Logistic Regression, SVM, Random Forest and Feed Forward Neural Networks as models with various combinations of the features we have extracted.


We used a pre-trained Word2Vec word embedding model from Google News with gensim and we averaged the word vectors inside the article sentence. This average is then fed into a Logistic Regression. This was the first model which improved our baseline.

We also tried using a Feed Forward Neural Network instead of Logistic Regression, but didn’t have enough time to make it work.

Random Forest

We trained a Random Forrest with simple lexical features like the number of adjectives, adverbs, singular and plural pronouns, questions, exclamation marks and periods. But the result was worse than Word2Vec with Logistic Regression.

After that we included the Readability feature along the lexical features and the results were worse.

Finally, we tried Random Forrest only with Readability and results were the worst.


BERT [5], or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations. BERT obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. We used the Bert uncased pre-trained model (on the 12-layer architecture) and fine-tuned it for our task. The standalone BERT model gave F1 score for the positive class (‘propaganda’) of around 0.59.

The hyper-parameters used are as follows:

“do_lower_case”: True, # “Whether to lower case the input text. Should be True for uncased models and False for cased models.”

“max_seq_length”: 64, # “The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded.”

“train_batch_size”: 16, # “Total batch size for training.”

“eval_batch_size”: 8, # “Total batch size for eval.”

“predict_batch_size”: 8, # “Total batch size for predict.”

“learning_rate”: 5e-5, # “The initial learning rate for Adam.”

“num_train_epochs”: 3.0, # “Total number of training epochs to perform.”

“warmup_proportion”: 0.1, # “Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10% of training.”

“save_checkpoints_steps”: 1000, # “How often to save the model checkpoint.”

“iterations_per_loop”: 1000, # “How many steps to make in each estimator call.”

Ensemble approach

Our best model is a Stacking ensemble. The stacked models include:

  • a model using the TF-IDF features
  • a model using word2vec embeddings
  • a model using BERT with a softmax classification layer
  • a model combining polarity and subjectivity features
  • readability features
  • lexical features
  • emotions features

All models except the one using BERT are Logistic regressions.

Bert was trained for 3 epoches and was plugged alongside our other hand-crafted features. For our meta-learning model we have also used Logistic Regression. The input of the meta model when training is prediction probabilities of all the base models and the respective gold labels. You can see the ensemble in models/

5. Evaluation

We use Confusion Matrix based metrics like Accuracy, Precision, Recall and, F1 score.

BERT acquired F1 0.59 for the DEV set.

The following table shows the standalone performance of the base models on our validation set. All of the models used Logistic Regression classifier with l2 penalty and inverse regularization C=0.8.

Train set size: 11380

Validation set size: 2807

Description accuracy precision_pos recall_pos f1_pos
Subjectivity and polarity .5996 .2977 .4574 .3606
Sentiment features .5814 .2908 .4834 .3631
TF-IDF .6644 .3811 .5758 .4586
Proper nouns .4168 .2633 .7576 .3908
Loaded language .6776 .3695 .4329 .3987
Lexical features .5661 .3218 .684 .4377
Emotion .601 .3318 .6075 .4292
Confusing words .5276 .2675 .5253 .3544
Readability features .6249 .5368 .337  .414
word2vec 0.6858 0.4108 0.6277 0.4966

6. Deployment

Our proposed solution depends on several external resources:

  • Emotion features are extracted from IBM Watson Natural Language Understanding API. You need to have an IBM Cloud account with a NLU project. Instructions on the required configuration can be found in the repository
  • We use pretrained word2vec embeddings from GoogleNews-vectors
  • We use several nltk corpora for stopwords, POS tagging and WordNet

Future improvements

Topic Modelling

Topic Modelling is a technique which would help us to identify if the topic inside an article sentence changes from one to another, in order to detect introduction of irrelevant material. This would help us to identify the Red Herring propaganda technique.

Sentence location inside article

If we had more time, we would investigate if propaganda sentences occurred in specific locations in the article, e.g. in the beginning, middle or the end.

7. Documentation


[1] Identifying Propaganda in the News

[2] Propaganda Definitions

[3] Adjectives as Indicators of Subjectivity in Documents

[4] IBM Watson Natural Language Understanding API

[5] BERT

Share this

13 thoughts on “Detecting propaganda on sentence level

  1. 0

    Really nice article! I am particularly happy to see a combination of different models at work. Also great use of external resources. I’d really like to see a discussion on what do you think the neural net is missing, and the other models bring to the table.

  2. 0

    Hi Laura! The idea behind the stacking ensemble we did is to use different sentence representations, which are not correlated, at least not at first look, and that are also good enough when used with standalone estimators. The meta-model would fit a hypothesis which finds true hidden relations between these different aspects of the propagandistic sentences.
    So the inclusion of a model, that uses sentence vectors from fine-tuned pre-trained general language model (BERT), is meant to do just that – encoding the sentence semantics and finding a connection between them.
    Word embeddings from word2vec can also capture semantics, but they are different in other ways – they are context-free and some type of approximation of the sentences based on the word vectors is needed.
    TF-IDF is meant to catch words that are widely used in propagandistic fashion and less in any other way.
    The rest of the features, which are hand-crafted, result from either the eighteen propaganda definitions, or something else that we think is meaningful. For example the extraction of confusing words, loaded language words, lexical features (such as proper nouns for identifying name calling) are ideas based on the given propaganda definitions. Others such as readability scores (more complicated sentence and word structures to possibly mask intent), subjectiveness, sentiment and emotion scores are all things that we think relate to propaganda and contribute when used in a combination.
    So we use this specific neural net more as a complementary tool giving its fair share, not expecting it to work alone, because it can not get a feel for things like readability, sentence structures and other deeper ideas.

  3. 0

    One thing maybe worth noting is that due to personal hardware limitations, we lowered the maximum input length after WordPiece tokenization from 128 to 64 tokens and batch sizes from 64 to 32 for the uncased BERT fine-tuning. This may have caused some amount of sentences being cut and overall decrease in performance. So with other parameters, maybe overall better results can be reached.

  4. 0

    Hi guys. Good work and nice article. I have some questions for you:
    1. In the loaded language subsection you mention that you generated a list of phrases, but give no further details. Where did you get them from? How did you pick them? Is it part of the release?
    2. I appreciate the narrative of the different subsets of representations and learning models, but I miss a table with numbers. (You only tell that the performance worsened or improved). What are the exact numbers? What performance did your baseline or the other methods get? Not sure if this is supposed to appear in Section 5, but I cannot see it.

    1. 0

      Thank you, @alberto!

      Regarding the loaded language phrases, we searched the Internet for examples of such expressions. We found several promising resources – there was significant overlap in the listed words, but each article added some new words as well. These are all the lists we used:
      The final list of words and phrases is included in the code repository, it can be found in `data/external/loaded_language_phrases.txt`.

      As for the exact scores of our separate models, unfortunately we didn’t have time to produce a table with all of them before submitting our article. We’ll try to do so before today’s deadline.

  5. 0

    Very good article. I would like to see more content inside though. It would be great if you have uploaded the jupyter notebook with results and plots here. I have several comments and suggestions (apologies if these were done but I can not open the project).

    1. Since there are multiple already trained models for english language, the first chart can be extended to:
    – Multiple histograms depending on PoS taggings (part of speach) – for nouns, verbs, etc.
    – Histograms for Plural vs Singular distributions
    – Stacked bar charts (histograms) with count of Stop words vs Other words
    – Names and Special words – Entity recognition models

    2. The second chart of Most common words can be replicated for
    – different PoS
    – Plurals vs Singulars
    – Stop words (why do not you analyze stop words?)

    Then you can bucketize the sentenses based on this exploratory analysis

    3. Random Forest model – why word2vec was not included as features in random forest model?

    4. I do not see any metrics or graphs on the models performance. What is the best performance you were able to achieve?

    1. 0

      Hi zenpanik! Regarding 4., our best result on the DEV data set was an F1 score of 0.5979 using a Stacking ensemble with all our hand-crafted features, TF-IDF, word2vec, but without BERT in the ensemble. Our final submission for the TEST data set was with BERT in the ensemble as well.

    2. 0

      Hi Zepanik!
      I agree on point 1 and 2 – there is definitely need for more data visualizations than we have here. We could expand our article in that direction after the datathon.
      We actually first ran the most common words analysis without removing stop words, but – as expected – the top words in both propaganda and non-propaganda sentences were tokens like “the” and “an”. We decided they could not carry much (if any) predictive power for our task and removed them. It will be interesting to see if other teams used stop words as a predictor and achieved any good results!

  6. 0

    Hi Alberto, zenpanik!
    BERT acquired F1 0.59 for the DEV set.
    These are the performances of the base models when used alone. The logistic regression classifier hyper-parameters were not tuned. The split was train 11380, validation 2807.

    “description”: “Subjectivity and polarity”
    “f1_pos”: 0.3606
    “precision_pos”: 0.2977
    “recall_pos”: 0.4574
    “accuracy”: 0.5996
    “description”: “Sentiment features”
    “f1_pos”: 0.3631,
    “accuracy”: 0.5814
    “precision_pos”: 0.2908
    “recall_pos”: 0.4834
    “description”: “TFIDF”
    “f1_pos”: 0.4586
    “precision_pos”: 0.3811
    “recall_pos”: 0.5758
    “accuracy”: 0.6644
    “description”: “Proper nouns”
    “f1_pos”: 0.3908
    “precision_pos”: 0.2633
    “recall_pos”: 0.7576
    “accuracy”: 0.4168
    “description”: “Loaded language”
    “f1_pos”: 0.3987
    “precision_pos”: 0.3695
    “recall_pos”: 0.4329
    “accuracy”: 0.6776
    “description”: “Lexical features”
    “f1_pos”: 0.4377
    “precision_pos”: 0.3218
    “recall_pos”: 0.684
    “accuracy”: 0.5661
    “description”: “Emotion”
    “f1_pos”: 0.4292,
    “precision_pos”: 0.3318
    “recall_pos”: 0.6075
    “accuracy”: 0.601
    “description”: “Confusing words”
    “f1_pos”: 0.3544
    “precision_pos”: 0.2675
    “recall_pos”: 0.5253
    “accuracy”: 0.5276
    “description”: “Readability features”
    “f1_pos”: 0.414
    “precision_pos”: 0.337
    “recall_pos”: 0.5368
    “accuracy”: 0.6249

Leave a Reply