Datathons Solutions

Datathon – HackNews – Solution – LAMAs


10 thoughts on “Datathon – HackNews – Solution – LAMAs

  1. 2


    * Summary:

    – The source code is made publicly available on github.

    – The article is somewhat short, but gives sufficient detail.

    – The approach is standard but efficient (for tasks 1 and 2).

    – This is best-ranked team overall:
    – DEV: 3-4th, 1st, and 5th for task 1, task 2, and task 3
    – TEST: 2nd, 1st, and 5th for task 1, task 2, and task 3
    – Remarkably, on task 2, the team wins by a large margin.

    * Detailed comments:

    This is an exercise in using BERT (for tasks 1 and 2):
    – paper:
    – code:
    – other code:

    BERT is a state-of-the-art model for Natural Language Processing (NLP), and beats earlier advancements such as ElMo. See more here:

    The authors used fine-tuning based on parameters they have found in earlier experiments for other tasks. Fine-tuning BERT takes a lot of time…

    * Questions:

    1. Which model did you use for tasks 1 and 2? Is it model (b) from Figure 3?

    2. Why did you use the uncased version of BERT?

    3. Do you think that the large BERT model would help?

    4. Did you try BERT without fine-tuning? If so, how much did you gain from fine-tuning?

    5. Do you think you could be losing something by truncating the input to 256 for task 1?

    1. 1

      Dear Ramybaly and Preslav,
      Thank you for your remarks and questions. Regarding your questions :
      1. Yes, indeed I used the 3b fine-tuning schema. This is the one used for single sentence (any chunk of text in this case) classification tasks.
      2. Since this is not NER task (or any task that would need case information), we assume a cased model would not contribute to the outcome, in fact it may even harm it.
      3. As the authors of BERT showed previously with their experiments, large BERT model indeed gets higher results than BERT-base. But it is not a significant gain in our opinion and also it is impossible for us to use it without a TPU (or a high number of GPU’s) at the moment, since the large model’s size is huge. Also we needed to consider the time restriction.
      4. The pretrained BERT model is only trained using Masked LM and Next Sentence Prediction tasks, which are unsupervised tasks. Therefore, out of the box, it is not suitable to use for classification tasks, or any other task for that matter.
      5. We have been using BERT for other tasks as well, for submissions in Hyperpartisan News Detection, a SemEval 2019 shared task and also in preliminary experiments for our own CLEF 2019 lab . The document types of these two tasks are also news articles as in this datathon. We tried in these previous experiments 128, 256 and 512 as our maximum sequence lengths and found out that 256 gives us the best results, hence we use it here. This may be that, using the lead sentences of a news article has been found more effective for some NLP tasks [1,2]. From our experiments, we can argue that, for news articles first 128 tokens does not carry enough information and first 512 tokens has too much irrelevant information.

      Hope my answers were satisfactory. Please let me know if you have any further questions.

      [1] Brandow, R., Mitze, K., & Rau, L. F. (1995). Automatic condensation of electronic publications by sentence selection. Information Processing & Management, 31(5), 675-685.
      [2] Wasson, M. (1998, August). Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2 (pp. 1364-1368). Association for Computational Linguistics.

  2. 1

    Very good job. It is good to see how BERT performs in brand-new tasks. I just have a few questions:

    1. I’m interested in the fact that heavily truncating the article does not have an impact. This can be a task-specific behavior. Can you please indicate, based on which task you came up with that conclusion?

    2. Regarding task 3, are the dictionaries you generated across the labels mutually exclusive, or you allowed some overlap?

    3. How would you spot the location of the propaganda technique? Did you assume it is used over the whole sentence?

    1. 0

      Thank you very much for your questions.

      Regarding questions 2 and 3:

      2) We have allowed overlaps among dictionaries, yet each dictionary has its own weight for a given keyword based on the term frequency. So, even if a keyword found in a given sentence is included in multiple dictionaries, its contribution to each label types` score will be unique (with high probability).

      3) We have assumed it is used over the whole sentence. Our initial explorations for trying to detect the fragment boundaries showed that this simplification gives the best results with this approach. Yet, we believe that a similar approach for learning the frequently occurring words in the fragment boundaries for each label type can improve the results.

      Please let me know if you have any other questions regarding task 3 🙂

  3. 1

    It is admirable that you’ve put efforts on solving the three tasks rather than focusing on a single one. Even though the report is concise it presents clearly major results.

  4. -1

    Hi guys. Good work and nice article. I have a concern.
    You mention that you have been using (roughly) this same model for other news-related tasks. Also that you neglected to perform any tuning on this task and instead opted for taking advantage of your previous experience. In order to make this as self contained as possible, it would be nice to explain what the other is and what you did about it (tuning, decisions). I hope you have included this in your video.

    1. 1

      Hi Alberto, thanks for the comment.
      Let me be more explicit about how we facilitated BERT base models. We did not use the same model we created in our previous work. We started by using the base BERT models, which are trained on only two unsupervised tasks that are more or less context detection or facilitation. This is all. The rest is all fine-tuned for this task. The experience we used is about some configuration/parameter that we think works for classification tasks and the news domain when we use this paradigm. Like an SVM model, if you have some experience with it, you mostly know which hyperparameters to optimize in creating a new model for a new task. In order words, we do not do full hyperparameter search for all possible parameters of a machine learning algorithm in cases we know or have experience with that algorithm, domain, or task. Given that the time was limited and deep learning models require a lot of time to be created, the options we could explore remained unfortunately limited. But, it does not mean that nothing was done. You can see all steps of this hard work in the Github repository. Unfortunately, the video is too short as well for explaining details of our submissions for three tasks. Please let us know if you think we should improve this article or provide more details during the live panel discussion.

      1. 0

        An additional point as a response to BERT vs. other algorithms: We have been experimenting with other algorithms such as LSTM, SVM, and Random Forests for tackling classification tasks as well. However, in all our experiments, BERT has outperformed the rest.

Leave a Reply