Business Understanding
We are developing machine learning based natural language processing tools to identify propaganda related information in news articles in the scope of Hack the News Datathon. The task is divided into three tasks that are news article classification, sentence classification, and token level information extraction.
The project is open source and any collaborative effort is more than welcome. The code we have developed can be found on Github repository. Please contact us if you would like to try the models yourself.
Data Understanding
Our exploration of the data yielded that the data is imbalanced both in terms of label distribution, document length, sentence length, and annotated text span length. For instance, the training data for task 1 consists of 4,021 propaganda and 31,972 non-propaganda news articles. The training data for task 2 contains a similar imbalance. Remarkably, the task 2 data contains 907 empty instances.
Data Preparation
Since we do not control the labeling and annotation processes, we did not mostly do any preparation other than converting the data to a format that is required by our scripts that are used to train the machine learning models.
We used the data for task 1 as it is provided to us. For task 2, the empty instances were excluded from any step of the analysis and training. The empty sentences in the development and test sets for task 2 were assigned as being non-propaganda directly. Finally, we have eliminated the stop words for task 3.
Please find the details of the data preparation and processing steps in the code repository that is provided at the bottom of this article.
Modeling
TASK 1 & TASK 2
We used the recently published Google’s Bidirectional Encoder Representations from Transformers (BERT) [1] pretrained model. For each task after processing the given training data, we fine tune the “bert-base-uncased” pretrained model with appropriate data and get the best scoring model at previously separated held-out data, which is %10 of the training data. We preserved the label ratio of the training set in the held-out sample. Using the fine-tuned model, dev and test sets are predicted and outputted according to the given examples.
Bert-base consists of 12 heads, 12 layers and 110M parameters. We use 256 maximum sequence length for both tasks. This configuration was chosen in light of our experience in classifying news articles for a different task. For instance, the sequence length of 512 had not yielded better performance than the 256 sequence length. The task-specific optimization was performed for task 1 and task 2 separately. Consequently, we have created two models one for task 1 and one for task 2.
The challenge of this optimization was in the time it takes to create a model and perform predictions. Therefore, we could only benefit from our previous experience for choosing the configuration that may perform well. Another iteration of optimization that take the characteristics of this particular dataset and task could possibly improve the overall performance.
TASK 3
We have used a keyword counter based system to detect keywords that occur frequently in the training dataset for each label. For each label frequently occurring words are extracted from the training set. Secondly, these extracted keywords are given weights according to their orthographic features (capitalization etc.) and frequency of occurrence inside the dataset. In order to obtain stable results the log value of the inverse of the frequency is used for each keyword.
During prediction we have used a sentence level approach and evaluated the score of each label on a single sentence. This approach has the advantage of obtaining fast results with a tradeoff for ignoring the predictions made for the neighboring sentence. Highest scoring label is returned as the prediction for a given fragment.
Below is an example of 15 keywords extracted for the causal oversimplification label from the training set.
Evaluation
All the scores are on the positive class as this is the actual evaluation metric, which is F1, provided by the Datathon organizers. The F1 score for task 3 take into account the partial matches as well.
Our results for each task and evaluation set are provided in terms of F1 score below. The results on held-out data set were calculated by us. The rest of the results were calculated by the online submission system of the Datathon.
- TASK 1
- Held-out set: 0.8685
- Development set: 0.8631
- Test set: 0.8530
- TASK 2
- Held-out set: 0.6279
- Development set: 0.6307
- Test set: 0.6336
- TASK 3
- Development set: 0.0398
- Test set: 0.0291
Our submissions were ranked second, first, and fifth for task 1, task 2, and task 3 on the test set respectively.
Deployment
The project is open source and any collaborative effort is more than welcome. The code we have developed can be found on Github repository. Please contact us if you would like to try the models we created yourself.
References
[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
10 thoughts on “Datathon – HackNews – Solution – LAMAs”
LAMA:
* Summary:
– The source code is made publicly available on github.
– The article is somewhat short, but gives sufficient detail.
– The approach is standard but efficient (for tasks 1 and 2).
– This is best-ranked team overall:
– DEV: 3-4th, 1st, and 5th for task 1, task 2, and task 3
– TEST: 2nd, 1st, and 5th for task 1, task 2, and task 3
– Remarkably, on task 2, the team wins by a large margin.
* Detailed comments:
This is an exercise in using BERT (for tasks 1 and 2):
– paper: https://arxiv.org/abs/1810.04805
– code: https://github.com/google-research/bert
– other code: https://github.com/hanxiao/bert-as-service
BERT is a state-of-the-art model for Natural Language Processing (NLP), and beats earlier advancements such as ElMo. See more here:
https://medium.com/syncedreview/best-nlp-model-ever-google-bert-sets-new-standards-in-11-language-tasks-4a2a189bc155
The authors used fine-tuning based on parameters they have found in earlier experiments for other tasks. Fine-tuning BERT takes a lot of time…
* Questions:
1. Which model did you use for tasks 1 and 2? Is it model (b) from Figure 3? https://arxiv.org/pdf/1810.04805.pdf
2. Why did you use the uncased version of BERT?
3. Do you think that the large BERT model would help?
4. Did you try BERT without fine-tuning? If so, how much did you gain from fine-tuning?
5. Do you think you could be losing something by truncating the input to 256 for task 1?
Dear Ramybaly and Preslav,
Thank you for your remarks and questions. Regarding your questions :
1. Yes, indeed I used the 3b fine-tuning schema. This is the one used for single sentence (any chunk of text in this case) classification tasks.
2. Since this is not NER task (or any task that would need case information), we assume a cased model would not contribute to the outcome, in fact it may even harm it.
3. As the authors of BERT showed previously with their experiments, large BERT model indeed gets higher results than BERT-base. But it is not a significant gain in our opinion and also it is impossible for us to use it without a TPU (or a high number of GPU’s) at the moment, since the large model’s size is huge. Also we needed to consider the time restriction.
4. The pretrained BERT model is only trained using Masked LM and Next Sentence Prediction tasks, which are unsupervised tasks. Therefore, out of the box, it is not suitable to use for classification tasks, or any other task for that matter.
5. We have been using BERT for other tasks as well, for submissions in Hyperpartisan News Detection, a SemEval 2019 shared task and also in preliminary experiments for our own CLEF 2019 lab https://emw.ku.edu.tr/clef-protestnews-2019/ . The document types of these two tasks are also news articles as in this datathon. We tried in these previous experiments 128, 256 and 512 as our maximum sequence lengths and found out that 256 gives us the best results, hence we use it here. This may be that, using the lead sentences of a news article has been found more effective for some NLP tasks [1,2]. From our experiments, we can argue that, for news articles first 128 tokens does not carry enough information and first 512 tokens has too much irrelevant information.
Hope my answers were satisfactory. Please let me know if you have any further questions.
[1] Brandow, R., Mitze, K., & Rau, L. F. (1995). Automatic condensation of electronic publications by sentence selection. Information Processing & Management, 31(5), 675-685.
[2] Wasson, M. (1998, August). Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2 (pp. 1364-1368). Association for Computational Linguistics.
Very good job. It is good to see how BERT performs in brand-new tasks. I just have a few questions:
1. I’m interested in the fact that heavily truncating the article does not have an impact. This can be a task-specific behavior. Can you please indicate, based on which task you came up with that conclusion?
2. Regarding task 3, are the dictionaries you generated across the labels mutually exclusive, or you allowed some overlap?
3. How would you spot the location of the propaganda technique? Did you assume it is used over the whole sentence?
Thank you very much for your questions.
Regarding questions 2 and 3:
2) We have allowed overlaps among dictionaries, yet each dictionary has its own weight for a given keyword based on the term frequency. So, even if a keyword found in a given sentence is included in multiple dictionaries, its contribution to each label types` score will be unique (with high probability).
3) We have assumed it is used over the whole sentence. Our initial explorations for trying to detect the fragment boundaries showed that this simplification gives the best results with this approach. Yet, we believe that a similar approach for learning the frequently occurring words in the fragment boundaries for each label type can improve the results.
Please let me know if you have any other questions regarding task 3 🙂
Dear Ramybaly,
You can refer to my answer to the 5th question of Preslav’s comment for your first question.
Thanks
It is admirable that you’ve put efforts on solving the three tasks rather than focusing on a single one. Even though the report is concise it presents clearly major results.
Hi guys. Good work and nice article. I have a concern.
You mention that you have been using (roughly) this same model for other news-related tasks. Also that you neglected to perform any tuning on this task and instead opted for taking advantage of your previous experience. In order to make this as self contained as possible, it would be nice to explain what the other is and what you did about it (tuning, decisions). I hope you have included this in your video.
Hi Alberto, thanks for the comment.
Let me be more explicit about how we facilitated BERT base models. We did not use the same model we created in our previous work. We started by using the base BERT models, which are trained on only two unsupervised tasks that are more or less context detection or facilitation. This is all. The rest is all fine-tuned for this task. The experience we used is about some configuration/parameter that we think works for classification tasks and the news domain when we use this paradigm. Like an SVM model, if you have some experience with it, you mostly know which hyperparameters to optimize in creating a new model for a new task. In order words, we do not do full hyperparameter search for all possible parameters of a machine learning algorithm in cases we know or have experience with that algorithm, domain, or task. Given that the time was limited and deep learning models require a lot of time to be created, the options we could explore remained unfortunately limited. But, it does not mean that nothing was done. You can see all steps of this hard work in the Github repository. Unfortunately, the video is too short as well for explaining details of our submissions for three tasks. Please let us know if you think we should improve this article or provide more details during the live panel discussion.
An additional point as a response to BERT vs. other algorithms: We have been experimenting with other algorithms such as LSTM, SVM, and Random Forests for tackling classification tasks as well. However, in all our experiments, BERT has outperformed the rest.
Nice work!