Team
- Viktor, @vpekar, [email protected]
- Mario, @wariodoor, [email protected]
Business Understanding
- The team will focus on Task 2.
- The goal of Task 2 is to automatically detect which sentences in a document contain propaganda.
Data Understanding
- The classes are imbalanced (propaganda: 4730 vs. non-propaganda: 9534, disregarding labelled empty lines), so a class balancing/oversampling technique may help.
- A simple baseline, “everything is propaganda”, is F1=49.8 (P=33.1, R=100.0).
- Labels in Task 3 (specific types of propaganda, labels also on the sentence level) correspond to labels in Task 2.
- An exploration of ngrams associated with Task 3 labels shows that each type of propaganda is characterized by its own set of features, there is little overlap between them.
- For example, top 5 ngrams according to Chi^2:
Exaggeration,Minimization Appeal_to_Fear-Prejudice Flag-Waving the most: 24.443
most: 16.023
important: 10.711
worst: 10.518
absolutely: 9.730threat: 18.133
would be: 16.317
die: 14.830
act: 12.543
death: 5.955the american: 73.349
nation: 35.434
americans: 29.580
disgrace to: 26.282
america: 19.111 - This suggests the task should be approached with ensemble classifiers which would be able to account for the underlying structure in the data.
- The features are semantic in nature, so some form of semantic representation may be helpful.
- Some of the labels (e.g., Name-Calling, Appeal-to-Fear) relate to the sentiment polarity, so features capturing the sentiment of the sentences can also help.
- Based on these observations, we designed the following features:
- ngrams
- word2vec
- sentiment: the proportion of sentiment-bearing words in the the sentence; targeting Loaded Language
- intensifying words: the proportion of intensifying adverbs (e.g., very, extremely) in the sentence; targeting Loaded Language
- glittering words: the proportion of “glittering words” in the sentence (e.g., patriotism, justice, truth, democracy); targeting: Flag-waving, Slogans
- superlatives: the proportion of superlative adverbs and adjectives: Loaded language
- quotation_marks: presence of quotation marks: Appeal to Authority
- disjunctives: presence of disjunctive conjunctions: or, either, then: Black-and-white fallacy
- causals: Causal oversimplification: cause, because, therefore, thus, so
- modal_verbs: Appeal to fear
- generalizing_words: Generalizing words: all, everything, everyone, entire, whole; none, nothing, nobody: Slogans, Appeal to fear
- imperatives: Imperative sentences: Slogans, exclamations
Data Preparation
Step 1. Align labels with sentences, perform POS tagging and lemmatization, and save the aligned data to a CSV file.
Step 2. Represent each sentence as word2vec vectors of individual words, append the columns to the file.
Step 3. Feature engineering. Extract features relating to sentiment, intensifying words, etc, append the columns to the file.
Modeling
Initial models are trained and tested using 5-fold cross-validation on the provided train set.
Step 1. Using simple classification methods that do not require extensive hyperparameter tuning (e.g., Naive Bayes, KNN, Logistic Regression), we first explore the effect of different ways to extract features from tokenized text: (1) unigrams, (2) unigrams + bigrams, (3) oversampling, (4) feature selection, (5) word2vec vectors.
Step 2. To the best representation based on tokens, we add additional features relating to sentiment, and then optimize hyperparameters and evaluate ensemble methods (AdaBoost, Gradient Boosting, Random Forest classifiers).
Evaluation
Step 1. Key findings:
- Unigrams work just as well as unigrams + bigrams, but the resulting model is much smaller.
- Feature selection using Fisher’s F-score did not improve on the full set of features, for any of the classification methods.
- Oversampling (SMOTE) did not help, but the class balancing option in Logistic Regression did.
- Word2Vec features outperform unigrams.
- The best classification method is Logistic Regression.
Step 2. Key findings.
- Ensemble decision trees (Random Forest, AdaBoost, GB) did not perform as well as Logistic Regression.
- Gradient Boosting worked well, but overfit the training data, we tried various ways to reduce it (early stopping, tuned tree depth, minimum node split, max leaf size, etc), but still failed to improve on LR.
- The most important features along with thei Gini index, obtained with the best config of Gradient Boosting:
sentiment | 0.048 |
dim107 | 0.033 |
dim47 | 0.026 |
glitter | 0.019 |
dim131 | 0.017 |
dim88 | 0.016 |
dim115 | 0.014 |
dim23 | 0.010 |
dim70 | 0.010 |
dim111 | 0.009 |
I.e., the “engineered” “sentiment” and “glitter” features appear very important.
A comparison of best-performing models of LR, GB, KNN on 5-fold cross-validation on the training set:
prec train | rec train | f1 train | prec test | rec test | f1 test | |
LR | 0.47 | 0.72 | 0.57 | 0.46 | 0.71 | 0.55 |
GB | 0.97 | 0.91 | 0.94 | 0.54 | 0.39 | 0.46 |
KNN | 0.62 | 0.57 | 0.59 | 0.45 | 0.47 | 0.46 |
Code
Below are zipped Jupyter notebooks.
Code task 1: Task-1-Leopards
Code task 2: Task 2 – Leopards
7 thoughts on “Datathon – HackNews – Solution – Leopards”
Good article, nice approach, interesting feature analysis. It definitely seems plausible that the quoted words are relevant for various types of propaganda. I am wondering what are those scores that you cite, i was expecting p-values for the Chi2 test. “`threat: 18.133
would be: 16.317
die: 14.830
act: 12.543
death: 5.955“`
Thank you for the comment. These values are actual CHi^2 values.
Nice article, with a lot of detail and a feature-rich approach that makes a lot of sense.
Thank you!
Hi guys. Good work and nice article. I have a few questions for you:
1. You say that you computed word2vec and added it to represent the sentence. Did you average the vectors for each word, or how did you do the combination?
2. I like it that in the evaluation section you tell us the impact of different features/decisions. Nevertheless, you do not provide any numbers to better understand such an impact.
3. It would be nice if you could package the software, rather than just pasting it here (I hope I did not miss the link!)
Hi, Thank you for yor comment. Our answers:
1. Yes, we took word2vec vectors of non-stopwords in each sentence and concatenated them.
2. There is a small table in the evaluation section. The features are sorted there by their Gini Importance index, which is output by the Gradient Boosting implementation in scitkit-learn. We agree it would be useful to have a better understanding of the importances of the features, e.g. look at other ways to measure feature importance, e.g., via feature elimination in different learning methods. A task for future work.
3. There are now two links to two zipped Jupyter notebooks – for Tasks 1 and 2.
What I like most about your article is the feature engineering process. As it seems you’ve invested a lot of time in it. As a consequence the feature space is well-grounded.