Datathons Solutions

Datathon – HackNews – Solution – PIG (Propaganda Identification Group)

4
votes

9 thoughts on “Datathon – HackNews – Solution – PIG (Propaganda Identification Group)

  1. 2
    votes

    Nice article, nice approach, and great results on Task 3!

    Just one thing: it is unclear what resources need to be downloaded to make the attached code work. The code has many hardcoded paths to files that do not exist. E.g., where do we get the Urban Dictionary from?

    1. 1
      votes

      Hello Preslav, I did not upload any data because I was not sure if I am allowed to upload the datathon data. I will make my whole repository including the data and embeddings available in google drive rep

  2. 2
    votes

    Hi guys. Good work and nice article. I have a question for you:
    You mention that one problem for your model was the class imbalance (the fact that some classes have very few representatives). Where is the threshold for the frequency? I mean, how many instances of a class do you consider you would need in order to come out with a reasonable predictor?

    1. 2
      votes

      This is very difficult to define because, as always, it depends. In this case also on the difficulty of the problem. The easier a class is to predict, the less data you need. Much more important is however to have unnoisy class labels, which in this data set often didn’t seem to be the case. I would suggest to have more annotators and calculate the interannotator agreement. It seems to me that different annotators have worked on each document separately and often the understanding of the task between annotators was different, which lead to noisy labels.

  3. 0
    votes

    Hello, you are a clear winner in the hardest Task 3, and you did reasonably well in Task 1: even though there are quite a few teams ahead of you, the difference to the top team is just 0.05 or so. You also seem to have built one of the most complex models. Yet, on Task 2, you did not place in the top 8. How do you explain that given that you did reasonably well in Task 1, which is similar?

    1. 1
      votes

      You actually answer the question in your video, slightly before the three minute mark. You used a CNN model instead of BERT due to some implementation difficulties. The winners in Task 2 did use BERT but it is commendable that you are aware of the model and tried it.

      1. 2
        votes

        Hello vsenderov!
        Thanks for your comment, we all chose different approaches for each task, regarding Task 2 we were not able to identify the error with the BERT implementation. Finally, there was not enough time to adapt the architectures used in Task 1. However, as you said, we’re aware that BERT is outperforming as solution for different problems and we will definitely use it in the future.

Leave a Reply