Datathons Solutions

Datathon – HackNews – Solution – Team: UoM-NLP

0
votes

Business Understanding

Fake news is a massive problem for the multiple industries and government that needs to be addressed in a more automated format.  Providing an automated method to examine text and classify it as fake or propaganda can help reduce the effect of fake news. It is easier said than done though as even companies such as Facebook and Google cannot solve this problem which create widespread confuction and cause controversy in elections such as the latest US presidential election. 

While there is no solution to date.  Progress is made by rigorous research and incremental improvements. Artificial Intelligence and Deep Learning could act as a solution for this. We explore this potential solution through the use of various machine learning and deep learning methods, using Inductive Transfer Learning of Language Models and fine-tuning them to classify articles that are propaganda or non-propaganda.

Data Understanding

The dataset we used had over 60 thousand documents labeled as either propaganda or non-propaganda

Data Preparation

We created a pandas dataframe out of the text and labels and performed a 90% holdout method.

Modeling and Evaluation

Model for task 1: In-order to leverage large amounts of unsupervised data (form of transfer learning), we use the universal language model approach and fine-tune it for the task at hand [1]. The broad idea is to learn a language model — 3-layer LSTM architecture, on vast amount of text, Wikipedia in this case, then fine-tune the language with propaganda news text dataset. This pre-trained architecture is used for the document classification task, with a softmax layer on the top. This model gave a performance of 0.9057 on dev set.

Model for task 2: For task 2, we evaluated simple bag-of-words (n-grams, n=1,2 and 3) supervised classifiers — logistic regression, Gaussian Naive Bayes, RandomForest, AdaBoost and SVM. Among those Gaussian Naive Bayes with uniform priors gave the best performance (0.5196 on the dev set). We also evaluated the universal language model approach for this sentence-level task, but was not fruitful. We believe it requires further investigation to see possibilities to improve the performance.

Model for task 3: Here a NER-inspired model is built with SpaCy [2]. It is a document-level model, where for each document the spans of tokens and the propaganda type are used as supervision signals.

 

References:

[1] Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018.

[2] https://github.com/explosion/spaCy/blob/master/examples/training/train_ner.py

 

 

*Deployment – optional

Share this

One thought on “Datathon – HackNews – Solution – Team: UoM-NLP

Leave a Reply