Datathons Solutions

Datathon – HackNews – Solution – Leopards

This is a Leopards team’s submission for the Propaganda Detection datathon. Key findings: the best performing classifier is logistic regression, operating on Word2vec representations of the sentences plus several designed features like the proportion of sentiment-bearing words in the sentence.


7 thoughts on “Datathon – HackNews – Solution – Leopards

  1. 1

    Good article, nice approach, interesting feature analysis. It definitely seems plausible that the quoted words are relevant for various types of propaganda. I am wondering what are those scores that you cite, i was expecting p-values for the Chi2 test. “`threat: 18.133
    would be: 16.317
    die: 14.830
    act: 12.543
    death: 5.955“`

  2. 1

    Hi guys. Good work and nice article. I have a few questions for you:
    1. You say that you computed word2vec and added it to represent the sentence. Did you average the vectors for each word, or how did you do the combination?
    2. I like it that in the evaluation section you tell us the impact of different features/decisions. Nevertheless, you do not provide any numbers to better understand such an impact.
    3. It would be nice if you could package the software, rather than just pasting it here (I hope I did not miss the link!)

    1. 1

      Hi, Thank you for yor comment. Our answers:
      1. Yes, we took word2vec vectors of non-stopwords in each sentence and concatenated them.
      2. There is a small table in the evaluation section. The features are sorted there by their Gini Importance index, which is output by the Gradient Boosting implementation in scitkit-learn. We agree it would be useful to have a better understanding of the importances of the features, e.g. look at other ways to measure feature importance, e.g., via feature elimination in different learning methods. A task for future work.
      3. There are now two links to two zipped Jupyter notebooks – for Tasks 1 and 2.

  3. 2

    What I like most about your article is the feature engineering process. As it seems you’ve invested a lot of time in it. As a consequence the feature space is well-grounded.

Leave a Reply