Team Members (datachat user name)
- Thomas Arnold (thomasarnold)
- Gisela Vallejo (gvallejo)
- Yang Gao (yg211)
- Tilman Beck (tbtuda)
- Nils Reimers (reimers)
- Jonas Pfeiffer (jopfeiff)
- Keras – Github
- Pytorch – Site
- ScikitLearn – Site
- Flair – Paper Github
- BERT – Paper Github
- Stanford Politeness API: Github
- Sentence Specificity Predictor: Github
- InferSent: Github
The data is split into 3 sub-tasks: 1) document classification, 2) sentence classification, 3) sequence labeling. The data is provided by the task organizers
Besides sentence splitting and tokenization no specific data preparation was performed.
Task 1. Document Classification
- Dependency Based Embeddings (Komninos, A., & Manandhar, S. (2016)) – Download
- Distributed Representations of Words (Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013)) – Download
- Global Vectors for Word Representation (Pennington, J., Socher, R., & Manning, C. (2014)) – Download
- Dependency-based word embeddings ((Levy, Omer, and Yoav Goldberg (2014)) – Download
- GloVe + Sentiment Embeddings (urban dictionary) (Tang, Duyu, et al. (2016)) – Download
- FastText embeddings (Joulin, Armand, et al (2016)) – Download
- CNN + LSTM
We tested every combination of embeddings and methods. In addition, we tried an ensemble approach that combines these predictions with a majority vote / another CNN on top.
Task 2. Sentence Classification
We explored both hand-craft features plus classic classification algorithms, as well as modern neural based methods.
For hand-craft features, we used the logarithm of the length of the sentence (log_length), the politeness score of the sentence (plt), and the specificity score of the sentence (spec). plt is computed with the Stanford Politeness API, and spec is computed by the Sentence Specificity Predictor. Links for both tools can be found in the ‘Toolset’ section. We have used logistic regression and svm as the classifier.
For neural based method, we tried multiple different techniques:
- Infersent sentence embedding: we use the pretrained InferSent embedding to represent sentences, augment our hand-craft features, and feed the concatenated feature to a logistic regression based classifier.
- CNN: we use the text classification CNN architecture proposed by Kim 2014. The pre-trained Glove+UrbanDictionary embedding are used as our word representation.
- BERT: we used the huggingface interface to train and evaluate BERT language model on the task
Task 3. Sequence Labeling
Here is our Repository for NER
We use Flair (Paper, GitHub) for sequence labeling. 200 dimensional GloVe word embeddings were extended by Urban Dictionary embeddings. Further, we added some manual features for each token where were one-hot-encoded and added to the embedding vector. The following 30 categories were added:
- ethnic slur
- potentially offensive
- racial slur
- by synecdoche
- figure of speech
- rhetorical question
(Taken from Christian Meyer’s Dissertation, citation will follow)
As Flair has the SOTA for many tasks in NER we used it pretty much out of the box and only changed the embeddings. This gave a small boost on our own training/dev/test split compared to normal glove embeddings. We therefore went forward with only using our combined embeddings. A big problem that we faced is the label ‘repetition’ as we only consider single sentences. It is thus impossible for the model to identify entities which have been mentioned in earlier sentences in a document. We did not tackle this problem due to limitations in time. We also faced the problem of classes not being uniformly distributed, some labels only occurring very infrequently. Our models were not able to learn a good representation of these classes. We also did not tackle this problem due to limitations in time.
1. Document Classification
For this binary classification task with very unbalanced class distributions, we evaluate on precision, recall and F1 score.
2. Sentence Classification
We compute the F1 score for both propaganda and non-propaganda sentences. In addition, we compute the accuracy. As suggested by the guideline, we use propaganda F1 as the primary metric for measuring the performance of our models.
3. Sequence Labeling
We compute and optimize the per token accuracy and F1 score
Data Analysis (sentence-level)
We analyze the distribution of the plt, spec and length of both propaganda and non-propaganda sentences.
- label non-propaganda, politeness min -0.7457724, max 0.8695088, mean 0.12661527089278823, median 0.120053355, std 0.18571401615227837
- label propaganda, politeness min -0.5694593, max 0.8022774, mean 0.11355040703585392, median 0.112377345, std 0.20276169406821773
- both non-propaganda and propaganda sentences’ politeness scores are normally distributed, and their distributions are very similar.
- non-propaganda sentences on average are more polite than propaganda sentences. This observation is consistent with the definition of propaganda sentences adopted in this hackathon.
- label propaganda, specificity min 0.000233, max 1.0, mean 0.552421702536998, median 0.6192215, std 0.3894592453286332
- label non-propaganda, specificity min 0.000151, max 1.0, mean 0.4703012315921964, median 0.4089935, std 0.38341124768256996
- Observation: propaganda sentences are on average more specific than non-propaganda sentences. In particular, both propaganda and non-propaganda sentences have the U-shape distribution: a high fraction of the sentences receive the maximum and minimum scores, and only a small fraction receive the middle-range scores.
- label propaganda, length min 1, max 153, mean 28.77674418604651, median 25.0, std 18.2704849951449
- label non-propaganda, length min 1, max 132, mean 21.42825676526117, median 18.0, std 14.900982278994249
- Observation: propaganda sentences are longer.
Nice article, nice approach, and great results on Task 3!
Just one thing: it is unclear what resources need to be downloaded to make the attached code work. The code has many hardcoded paths to files that do not exist. E.g., where do we get the Urban Dictionary from?
Hello Preslav, I did not upload any data because I was not sure if I am allowed to upload the datathon data. I will make my whole repository including the data and embeddings available in google drive rep
The whole rep which also includes the winning model can be downloaded here:
The winning model can be found in `resources/taggers/best_model/*` which was used for both the dev and the test prediction
for Task1 the article was updated with all links where you can download the pre-trained embeddings used in our experiments
Hi guys. Good work and nice article. I have a question for you:
You mention that one problem for your model was the class imbalance (the fact that some classes have very few representatives). Where is the threshold for the frequency? I mean, how many instances of a class do you consider you would need in order to come out with a reasonable predictor?
This is very difficult to define because, as always, it depends. In this case also on the difficulty of the problem. The easier a class is to predict, the less data you need. Much more important is however to have unnoisy class labels, which in this data set often didn’t seem to be the case. I would suggest to have more annotators and calculate the interannotator agreement. It seems to me that different annotators have worked on each document separately and often the understanding of the task between annotators was different, which lead to noisy labels.
Hello, you are a clear winner in the hardest Task 3, and you did reasonably well in Task 1: even though there are quite a few teams ahead of you, the difference to the top team is just 0.05 or so. You also seem to have built one of the most complex models. Yet, on Task 2, you did not place in the top 8. How do you explain that given that you did reasonably well in Task 1, which is similar?
You actually answer the question in your video, slightly before the three minute mark. You used a CNN model instead of BERT due to some implementation difficulties. The winners in Task 2 did use BERT but it is commendable that you are aware of the model and tried it.
Thanks for your comment, we all chose different approaches for each task, regarding Task 2 we were not able to identify the error with the BERT implementation. Finally, there was not enough time to adapt the architectures used in Task 1. However, as you said, we’re aware that BERT is outperforming as solution for different problems and we will definitely use it in the future.