Datathons Solutions

Datathon – HackNews – Solution – Leopards

This is a Leopards team’s submission for the Propaganda Detection datathon. Key findings: the best performing classifier is logistic regression, operating on Word2vec representations of the sentences plus several designed features like the proportion of sentiment-bearing words in the sentence.

1
votes

 

 

 

Team

Business Understanding

  • The team will focus on Task 2.
  • The goal of Task 2 is to automatically detect which sentences in a document contain propaganda.

Data Understanding

  • The classes are imbalanced (propaganda: 4730 vs. non-propaganda: 9534, disregarding labelled empty lines), so a class balancing/oversampling technique may help.
  • A simple baseline, “everything is propaganda”, is F1=49.8 (P=33.1, R=100.0).
  • Labels in Task 3 (specific types of propaganda, labels also on the sentence level) correspond to labels in Task 2.
  • An exploration of ngrams associated with Task 3 labels shows that each type of propaganda is characterized by its own set of features, there is little overlap between them.
  • For example, top 5 ngrams according to Chi^2:
     Exaggeration,Minimization  Appeal_to_Fear-Prejudice  Flag-Waving
     the most: 24.443
    most: 16.023
    important: 10.711
    worst: 10.518
    absolutely: 9.730
     threat: 18.133
    would be: 16.317
    die: 14.830
    act: 12.543
    death: 5.955
     the american: 73.349
    nation: 35.434
    americans: 29.580
    disgrace to: 26.282
    america: 19.111
  • This suggests the task should be approached with ensemble classifiers which would be able to account for the underlying structure in the data.
  • The features are semantic in nature, so some form of semantic representation may be helpful.
  • Some of the labels (e.g., Name-Calling, Appeal-to-Fear) relate to the sentiment polarity, so features capturing the sentiment of the sentences can also help.
  • Based on these observations, we designed the following features:
    • ngrams
    • word2vec
    • sentiment: the proportion of sentiment-bearing words in the the sentence; targeting Loaded Language
    • intensifying words: the proportion of intensifying adverbs (e.g., very, extremely) in the sentence; targeting Loaded Language
    • glittering words: the proportion of “glittering words” in the sentence (e.g., patriotism, justice, truth, democracy); targeting: Flag-waving, Slogans
    • superlatives: the proportion of superlative adverbs and adjectives: Loaded language
    • quotation_marks: presence of quotation marks: Appeal to Authority
    • disjunctives: presence of disjunctive conjunctions: or, either, then: Black-and-white fallacy
    • causals: Causal oversimplification: cause, because, therefore, thus, so
    • modal_verbs: Appeal to fear
    • generalizing_words: Generalizing words: all, everything, everyone, entire, whole; none, nothing, nobody: Slogans, Appeal to fear
    • imperatives: Imperative sentences: Slogans, exclamations

Data Preparation

Step 1. Align labels with sentences, perform POS tagging and lemmatization, and save the aligned data to a CSV file.

Step 2. Represent each sentence as word2vec vectors of individual words, append the columns to the file.

Step 3. Feature engineering. Extract features relating to sentiment, intensifying words, etc, append the columns to the file.

Modeling

Initial models are trained and tested using 5-fold cross-validation on the provided train set.

Step 1. Using simple classification methods that do not require extensive hyperparameter tuning (e.g., Naive Bayes, KNN, Logistic Regression), we first explore the effect of different ways to extract features from tokenized text: (1) unigrams, (2) unigrams + bigrams, (3) oversampling, (4) feature selection, (5) word2vec vectors.

Step 2. To the best representation based on tokens, we add additional features relating to sentiment, and then optimize hyperparameters and evaluate ensemble methods (AdaBoost, Gradient Boosting, Random Forest classifiers).

Evaluation

Step 1. Key findings:

  • Unigrams work just as well as unigrams + bigrams, but the resulting model is much smaller.
  • Feature selection using Fisher’s F-score did not improve on the full set of features, for any of the classification methods.
  • Oversampling (SMOTE) did not help, but the class balancing option in Logistic Regression did.
  • Word2Vec features outperform unigrams.
  • The best classification method is Logistic Regression.

Step 2. Key findings.

  • Ensemble decision trees (Random Forest, AdaBoost, GB) did not perform as well as Logistic Regression.
  • Gradient Boosting worked well, but overfit the training data, we tried various ways to reduce it (early stopping, tuned tree depth, minimum node split, max leaf size, etc), but still failed to improve on LR.
  • The most important features along with thei Gini index, obtained with the best config of Gradient Boosting:
sentiment 0.048
dim107 0.033
dim47 0.026
glitter 0.019
dim131 0.017
dim88 0.016
dim115 0.014
dim23 0.010
dim70 0.010
dim111 0.009

I.e., the “engineered” “sentiment” and “glitter” features appear very important.

A comparison of best-performing models of LR, GB, KNN on 5-fold cross-validation on the training set:

prec train rec train f1 train prec test rec test f1 test
LR 0.47 0.72 0.57 0.46 0.71 0.55
GB 0.97 0.91 0.94 0.54 0.39 0.46
KNN 0.62 0.57 0.59 0.45 0.47 0.46

Code

Below are zipped Jupyter notebooks.

Code task 1: Task-1-Leopards

Code task 2: Task 2 – Leopards

Share this

7 thoughts on “Datathon – HackNews – Solution – Leopards

  1. 1
    votes

    Good article, nice approach, interesting feature analysis. It definitely seems plausible that the quoted words are relevant for various types of propaganda. I am wondering what are those scores that you cite, i was expecting p-values for the Chi2 test. “`threat: 18.133
    would be: 16.317
    die: 14.830
    act: 12.543
    death: 5.955“`

  2. 1
    votes

    Hi guys. Good work and nice article. I have a few questions for you:
    1. You say that you computed word2vec and added it to represent the sentence. Did you average the vectors for each word, or how did you do the combination?
    2. I like it that in the evaluation section you tell us the impact of different features/decisions. Nevertheless, you do not provide any numbers to better understand such an impact.
    3. It would be nice if you could package the software, rather than just pasting it here (I hope I did not miss the link!)

    1. 1
      votes

      Hi, Thank you for yor comment. Our answers:
      1. Yes, we took word2vec vectors of non-stopwords in each sentence and concatenated them.
      2. There is a small table in the evaluation section. The features are sorted there by their Gini Importance index, which is output by the Gradient Boosting implementation in scitkit-learn. We agree it would be useful to have a better understanding of the importances of the features, e.g. look at other ways to measure feature importance, e.g., via feature elimination in different learning methods. A task for future work.
      3. There are now two links to two zipped Jupyter notebooks – for Tasks 1 and 2.

  3. 2
    votes

    What I like most about your article is the feature engineering process. As it seems you’ve invested a lot of time in it. As a consequence the feature space is well-grounded.

Leave a Reply