Datathons Solutions

Detecting propaganda on sentence level


13 thoughts on “Detecting propaganda on sentence level

  1. 0

    Really nice article! I am particularly happy to see a combination of different models at work. Also great use of external resources. I’d really like to see a discussion on what do you think the neural net is missing, and the other models bring to the table.

  2. 0

    Hi Laura! The idea behind the stacking ensemble we did is to use different sentence representations, which are not correlated, at least not at first look, and that are also good enough when used with standalone estimators. The meta-model would fit a hypothesis which finds true hidden relations between these different aspects of the propagandistic sentences.
    So the inclusion of a model, that uses sentence vectors from fine-tuned pre-trained general language model (BERT), is meant to do just that – encoding the sentence semantics and finding a connection between them.
    Word embeddings from word2vec can also capture semantics, but they are different in other ways – they are context-free and some type of approximation of the sentences based on the word vectors is needed.
    TF-IDF is meant to catch words that are widely used in propagandistic fashion and less in any other way.
    The rest of the features, which are hand-crafted, result from either the eighteen propaganda definitions, or something else that we think is meaningful. For example the extraction of confusing words, loaded language words, lexical features (such as proper nouns for identifying name calling) are ideas based on the given propaganda definitions. Others such as readability scores (more complicated sentence and word structures to possibly mask intent), subjectiveness, sentiment and emotion scores are all things that we think relate to propaganda and contribute when used in a combination.
    So we use this specific neural net more as a complementary tool giving its fair share, not expecting it to work alone, because it can not get a feel for things like readability, sentence structures and other deeper ideas.

  3. 0

    One thing maybe worth noting is that due to personal hardware limitations, we lowered the maximum input length after WordPiece tokenization from 128 to 64 tokens and batch sizes from 64 to 32 for the uncased BERT fine-tuning. This may have caused some amount of sentences being cut and overall decrease in performance. So with other parameters, maybe overall better results can be reached.

  4. 0

    Hi guys. Good work and nice article. I have some questions for you:
    1. In the loaded language subsection you mention that you generated a list of phrases, but give no further details. Where did you get them from? How did you pick them? Is it part of the release?
    2. I appreciate the narrative of the different subsets of representations and learning models, but I miss a table with numbers. (You only tell that the performance worsened or improved). What are the exact numbers? What performance did your baseline or the other methods get? Not sure if this is supposed to appear in Section 5, but I cannot see it.

    1. 0

      Thank you, @alberto!

      Regarding the loaded language phrases, we searched the Internet for examples of such expressions. We found several promising resources – there was significant overlap in the listed words, but each article added some new words as well. These are all the lists we used:
      The final list of words and phrases is included in the code repository, it can be found in `data/external/loaded_language_phrases.txt`.

      As for the exact scores of our separate models, unfortunately we didn’t have time to produce a table with all of them before submitting our article. We’ll try to do so before today’s deadline.

  5. 0

    Very good article. I would like to see more content inside though. It would be great if you have uploaded the jupyter notebook with results and plots here. I have several comments and suggestions (apologies if these were done but I can not open the project).

    1. Since there are multiple already trained models for english language, the first chart can be extended to:
    – Multiple histograms depending on PoS taggings (part of speach) – for nouns, verbs, etc.
    – Histograms for Plural vs Singular distributions
    – Stacked bar charts (histograms) with count of Stop words vs Other words
    – Names and Special words – Entity recognition models

    2. The second chart of Most common words can be replicated for
    – different PoS
    – Plurals vs Singulars
    – Stop words (why do not you analyze stop words?)

    Then you can bucketize the sentenses based on this exploratory analysis

    3. Random Forest model – why word2vec was not included as features in random forest model?

    4. I do not see any metrics or graphs on the models performance. What is the best performance you were able to achieve?

    1. 0

      Hi zenpanik! Regarding 4., our best result on the DEV data set was an F1 score of 0.5979 using a Stacking ensemble with all our hand-crafted features, TF-IDF, word2vec, but without BERT in the ensemble. Our final submission for the TEST data set was with BERT in the ensemble as well.

    2. 0

      Hi Zepanik!
      I agree on point 1 and 2 – there is definitely need for more data visualizations than we have here. We could expand our article in that direction after the datathon.
      We actually first ran the most common words analysis without removing stop words, but – as expected – the top words in both propaganda and non-propaganda sentences were tokens like “the” and “an”. We decided they could not carry much (if any) predictive power for our task and removed them. It will be interesting to see if other teams used stop words as a predictor and achieved any good results!

  6. 0

    Hi Alberto, zenpanik!
    BERT acquired F1 0.59 for the DEV set.
    These are the performances of the base models when used alone. The logistic regression classifier hyper-parameters were not tuned. The split was train 11380, validation 2807.

    “description”: “Subjectivity and polarity”
    “f1_pos”: 0.3606
    “precision_pos”: 0.2977
    “recall_pos”: 0.4574
    “accuracy”: 0.5996
    “description”: “Sentiment features”
    “f1_pos”: 0.3631,
    “accuracy”: 0.5814
    “precision_pos”: 0.2908
    “recall_pos”: 0.4834
    “description”: “TFIDF”
    “f1_pos”: 0.4586
    “precision_pos”: 0.3811
    “recall_pos”: 0.5758
    “accuracy”: 0.6644
    “description”: “Proper nouns”
    “f1_pos”: 0.3908
    “precision_pos”: 0.2633
    “recall_pos”: 0.7576
    “accuracy”: 0.4168
    “description”: “Loaded language”
    “f1_pos”: 0.3987
    “precision_pos”: 0.3695
    “recall_pos”: 0.4329
    “accuracy”: 0.6776
    “description”: “Lexical features”
    “f1_pos”: 0.4377
    “precision_pos”: 0.3218
    “recall_pos”: 0.684
    “accuracy”: 0.5661
    “description”: “Emotion”
    “f1_pos”: 0.4292,
    “precision_pos”: 0.3318
    “recall_pos”: 0.6075
    “accuracy”: 0.601
    “description”: “Confusing words”
    “f1_pos”: 0.3544
    “precision_pos”: 0.2675
    “recall_pos”: 0.5253
    “accuracy”: 0.5276
    “description”: “Readability features”
    “f1_pos”: 0.414
    “precision_pos”: 0.337
    “recall_pos”: 0.5368
    “accuracy”: 0.6249

Leave a Reply