We are developing machine learning based natural language processing tools to identify propaganda related information in news articles in the scope of Hack the News Datathon. The task is divided into three tasks that are news article classification, sentence classification, and token level information extraction.
The project is open source and any collaborative effort is more than welcome. The code we have developed can be found on Github repository. Please contact us if you would like to try the models yourself.
Our exploration of the data yielded that the data is imbalanced both in terms of label distribution, document length, sentence length, and annotated text span length. For instance, the training data for task 1 consists of 4,021 propaganda and 31,972 non-propaganda news articles. The training data for task 2 contains a similar imbalance. Remarkably, the task 2 data contains 907 empty instances.
Since we do not control the labeling and annotation processes, we did not mostly do any preparation other than converting the data to a format that is required by our scripts that are used to train the machine learning models.
We used the data for task 1 as it is provided to us. For task 2, the empty instances were excluded from any step of the analysis and training. The empty sentences in the development and test sets for task 2 were assigned as being non-propaganda directly. Finally, we have eliminated the stop words for task 3.
Please find the details of the data preparation and processing steps in the code repository that is provided at the bottom of this article.
TASK 1 & TASK 2
We used the recently published Google’s Bidirectional Encoder Representations from Transformers (BERT)  pretrained model. For each task after processing the given training data, we fine tune the “bert-base-uncased” pretrained model with appropriate data and get the best scoring model at previously separated held-out data, which is %10 of the training data. We preserved the label ratio of the training set in the held-out sample. Using the fine-tuned model, dev and test sets are predicted and outputted according to the given examples.
Bert-base consists of 12 heads, 12 layers and 110M parameters. We use 256 maximum sequence length for both tasks. This configuration was chosen in light of our experience in classifying news articles for a different task. For instance, the sequence length of 512 had not yielded better performance than the 256 sequence length. The task-specific optimization was performed for task 1 and task 2 separately. Consequently, we have created two models one for task 1 and one for task 2.
The challenge of this optimization was in the time it takes to create a model and perform predictions. Therefore, we could only benefit from our previous experience for choosing the configuration that may perform well. Another iteration of optimization that take the characteristics of this particular dataset and task could possibly improve the overall performance.
We have used a keyword counter based system to detect keywords that occur frequently in the training dataset for each label. For each label frequently occurring words are extracted from the training set. Secondly, these extracted keywords are given weights according to their orthographic features (capitalization etc.) and frequency of occurrence inside the dataset. In order to obtain stable results the log value of the inverse of the frequency is used for each keyword.
During prediction we have used a sentence level approach and evaluated the score of each label on a single sentence. This approach has the advantage of obtaining fast results with a tradeoff for ignoring the predictions made for the neighboring sentence. Highest scoring label is returned as the prediction for a given fragment.
Below is an example of 15 keywords extracted for the causal oversimplification label from the training set.
All the scores are on the positive class as this is the actual evaluation metric, which is F1, provided by the Datathon organizers. The F1 score for task 3 take into account the partial matches as well.
Our results for each task and evaluation set are provided in terms of F1 score below. The results on held-out data set were calculated by us. The rest of the results were calculated by the online submission system of the Datathon.
- TASK 1
- Held-out set: 0.8685
- Development set: 0.8631
- Test set: 0.8530
- TASK 2
- Held-out set: 0.6279
- Development set: 0.6307
- Test set: 0.6336
- TASK 3
- Development set: 0.0398
- Test set: 0.0291
Our submissions were ranked second, first, and fifth for task 1, task 2, and task 3 on the test set respectively.
The project is open source and any collaborative effort is more than welcome. The code we have developed can be found on Github repository. Please contact us if you would like to try the models we created yourself.
 Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.