- Viktor, @vpekar, email@example.com
- Mario, @wariodoor, firstname.lastname@example.org
- The team will focus on Task 2.
- The goal of Task 2 is to automatically detect which sentences in a document contain propaganda.
- The classes are imbalanced (propaganda: 4730 vs. non-propaganda: 9534, disregarding labelled empty lines), so a class balancing/oversampling technique may help.
- A simple baseline, “everything is propaganda”, is F1=49.8 (P=33.1, R=100.0).
- Labels in Task 3 (specific types of propaganda, labels also on the sentence level) correspond to labels in Task 2.
- An exploration of ngrams associated with Task 3 labels shows that each type of propaganda is characterized by its own set of features, there is little overlap between them.
- For example, top 5 ngrams according to Chi^2:
Exaggeration,Minimization Appeal_to_Fear-Prejudice Flag-Waving the most: 24.443
would be: 16.317
the american: 73.349
disgrace to: 26.282
- This suggests the task should be approached with ensemble classifiers which would be able to account for the underlying structure in the data.
- The features are semantic in nature, so some form of semantic representation may be helpful.
- Some of the labels (e.g., Name-Calling, Appeal-to-Fear) relate to the sentiment polarity, so features capturing the sentiment of the sentences can also help.
- Based on these observations, we designed the following features:
- sentiment: the proportion of sentiment-bearing words in the the sentence; targeting Loaded Language
- intensifying words: the proportion of intensifying adverbs (e.g., very, extremely) in the sentence; targeting Loaded Language
- glittering words: the proportion of “glittering words” in the sentence (e.g., patriotism, justice, truth, democracy); targeting: Flag-waving, Slogans
- superlatives: the proportion of superlative adverbs and adjectives: Loaded language
- quotation_marks: presence of quotation marks: Appeal to Authority
- disjunctives: presence of disjunctive conjunctions: or, either, then: Black-and-white fallacy
- causals: Causal oversimplification: cause, because, therefore, thus, so
- modal_verbs: Appeal to fear
- generalizing_words: Generalizing words: all, everything, everyone, entire, whole; none, nothing, nobody: Slogans, Appeal to fear
- imperatives: Imperative sentences: Slogans, exclamations
Step 1. Align labels with sentences, perform POS tagging and lemmatization, and save the aligned data to a CSV file.
Step 2. Represent each sentence as word2vec vectors of individual words, append the columns to the file.
Step 3. Feature engineering. Extract features relating to sentiment, intensifying words, etc, append the columns to the file.
Initial models are trained and tested using 5-fold cross-validation on the provided train set.
Step 1. Using simple classification methods that do not require extensive hyperparameter tuning (e.g., Naive Bayes, KNN, Logistic Regression), we first explore the effect of different ways to extract features from tokenized text: (1) unigrams, (2) unigrams + bigrams, (3) oversampling, (4) feature selection, (5) word2vec vectors.
Step 2. To the best representation based on tokens, we add additional features relating to sentiment, and then optimize hyperparameters and evaluate ensemble methods (AdaBoost, Gradient Boosting, Random Forest classifiers).
Step 1. Key findings:
- Unigrams work just as well as unigrams + bigrams, but the resulting model is much smaller.
- Feature selection using Fisher’s F-score did not improve on the full set of features, for any of the classification methods.
- Oversampling (SMOTE) did not help, but the class balancing option in Logistic Regression did.
- Word2Vec features outperform unigrams.
- The best classification method is Logistic Regression.
Step 2. Key findings.
- Ensemble decision trees (Random Forest, AdaBoost, GB) did not perform as well as Logistic Regression.
- Gradient Boosting worked well, but overfit the training data, we tried various ways to reduce it (early stopping, tuned tree depth, minimum node split, max leaf size, etc), but still failed to improve on LR.
- The most important features along with thei Gini index, obtained with the best config of Gradient Boosting:
I.e., the “engineered” “sentiment” and “glitter” features appear very important.
A comparison of best-performing models of LR, GB, KNN on 5-fold cross-validation on the training set:
|prec train||rec train||f1 train||prec test||rec test||f1 test|
Below are zipped Jupyter notebooks.
Code task 1: Task-1-Leopards
Code task 2: Task 2 – Leopards