Team Members (datachat user name)
- Thomas Arnold (thomasarnold)
- Gisela Vallejo (gvallejo)
- Yang Gao (yg211)
- Tilman Beck (tbtuda)
- Nils Reimers (reimers)
- Jonas Pfeiffer (jopfeiff)
- Keras – Github
- Pytorch – Site
- ScikitLearn – Site
- Flair – Paper Github
- BERT – Paper Github
- Stanford Politeness API: Github
- Sentence Specificity Predictor: Github
- InferSent: Github
The data is split into 3 sub-tasks: 1) document classification, 2) sentence classification, 3) sequence labeling. The data is provided by the task organizers
Besides sentence splitting and tokenization no specific data preparation was performed.
Task 1. Document Classification
- Dependency Based Embeddings (Komninos, A., & Manandhar, S. (2016)) – Download
- Distributed Representations of Words (Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013)) – Download
- Global Vectors for Word Representation (Pennington, J., Socher, R., & Manning, C. (2014)) – Download
- Dependency-based word embeddings ((Levy, Omer, and Yoav Goldberg (2014)) – Download
- GloVe + Sentiment Embeddings (urban dictionary) (Tang, Duyu, et al. (2016)) – Download
- FastText embeddings (Joulin, Armand, et al (2016)) – Download
- CNN + LSTM
We tested every combination of embeddings and methods. In addition, we tried an ensemble approach that combines these predictions with a majority vote / another CNN on top.
Task 2. Sentence Classification
We explored both hand-craft features plus classic classification algorithms, as well as modern neural based methods.
For hand-craft features, we used the logarithm of the length of the sentence (log_length), the politeness score of the sentence (plt), and the specificity score of the sentence (spec). plt is computed with the Stanford Politeness API, and spec is computed by the Sentence Specificity Predictor. Links for both tools can be found in the ‘Toolset’ section. We have used logistic regression and svm as the classifier.
For neural based method, we tried multiple different techniques:
- Infersent sentence embedding: we use the pretrained InferSent embedding to represent sentences, augment our hand-craft features, and feed the concatenated feature to a logistic regression based classifier.
- CNN: we use the text classification CNN architecture proposed by Kim 2014. The pre-trained Glove+UrbanDictionary embedding are used as our word representation.
- BERT: we used the huggingface interface to train and evaluate BERT language model on the task
Task 3. Sequence Labeling
Here is our Repository for NER
We use Flair (Paper, GitHub) for sequence labeling. 200 dimensional GloVe word embeddings were extended by Urban Dictionary embeddings. Further, we added some manual features for each token where were one-hot-encoded and added to the embedding vector. The following 30 categories were added:
- ethnic slur
- potentially offensive
- racial slur
- by synecdoche
- figure of speech
- rhetorical question
(Taken from Christian Meyer’s Dissertation, citation will follow)
As Flair has the SOTA for many tasks in NER we used it pretty much out of the box and only changed the embeddings. This gave a small boost on our own training/dev/test split compared to normal glove embeddings. We therefore went forward with only using our combined embeddings. A big problem that we faced is the label ‘repetition’ as we only consider single sentences. It is thus impossible for the model to identify entities which have been mentioned in earlier sentences in a document. We did not tackle this problem due to limitations in time. We also faced the problem of classes not being uniformly distributed, some labels only occurring very infrequently. Our models were not able to learn a good representation of these classes. We also did not tackle this problem due to limitations in time.
1. Document Classification
For this binary classification task with very unbalanced class distributions, we evaluate on precision, recall and F1 score.
2. Sentence Classification
We compute the F1 score for both propaganda and non-propaganda sentences. In addition, we compute the accuracy. As suggested by the guideline, we use propaganda F1 as the primary metric for measuring the performance of our models.
3. Sequence Labeling
We compute and optimize the per token accuracy and F1 score
Data Analysis (sentence-level)
We analyze the distribution of the plt, spec and length of both propaganda and non-propaganda sentences.
- label non-propaganda, politeness min -0.7457724, max 0.8695088, mean 0.12661527089278823, median 0.120053355, std 0.18571401615227837
- label propaganda, politeness min -0.5694593, max 0.8022774, mean 0.11355040703585392, median 0.112377345, std 0.20276169406821773
- both non-propaganda and propaganda sentences’ politeness scores are normally distributed, and their distributions are very similar.
- non-propaganda sentences on average are more polite than propaganda sentences. This observation is consistent with the definition of propaganda sentences adopted in this hackathon.
- label propaganda, specificity min 0.000233, max 1.0, mean 0.552421702536998, median 0.6192215, std 0.3894592453286332
- label non-propaganda, specificity min 0.000151, max 1.0, mean 0.4703012315921964, median 0.4089935, std 0.38341124768256996
- Observation: propaganda sentences are on average more specific than non-propaganda sentences. In particular, both propaganda and non-propaganda sentences have the U-shape distribution: a high fraction of the sentences receive the maximum and minimum scores, and only a small fraction receive the middle-range scores.
- label propaganda, length min 1, max 153, mean 28.77674418604651, median 25.0, std 18.2704849951449
- label non-propaganda, length min 1, max 132, mean 21.42825676526117, median 18.0, std 14.900982278994249
- Observation: propaganda sentences are longer.