Hack the News Challenge
Data Science Life Cycle
- Business Problem Formulation
The current political landscape is shaped by extreme polarization of opinions and by the proliferation of fake news. For example, a recent study published in Science has found that rumour’s and fake news tend to spread six times faster than truthful information. This situation both damages the reputation of respectable news outlets and it also undermines the very foundations of democracy, which needs free and reliable press to thrive. Therefore, it is in the interest of the public as well as of the news organizations to be able to detect and fight disinformation in all its forms. While most previous work has focused on “fake news”, here we are interested in propaganda.
Propaganda is the deliberate spreading of ideas, facts, or allegations with the aim of advancing one’s cause or of damaging an opposing cause. While it may include falsehoods, this is not really necessary; rather, propaganda can be seen as a form of extreme bias. Yet, propagandistic news articles usually use certain techniques such as Whataboutism, Red Herring, and Name Calling, among many others. Detecting the use of such techniques can help identify potentially propagandistic articles. Here, we are trying to create a tool that can help with this endeavour.
Note that any news article—even such coming from a reputable source—can reflect the unconscious bias of the author, and thus it could possibly be propagandistic.
Our objective here is:
(i) to flag the article as a whole
(ii) to detect the potentially propagandistic sentences in a news article
(iii) to identify the exact type and span of use of propagandistic techniques
Data Understanding
Two sets of documents were provided:
PAL – Propaganda detection at the article level (PAL)– and
PSL – Propaganda detection at the sentence level (PSL) and PTR – Propaganda type recognition (PTR).
Platform to perform Data Analysis and Modeling : Python (Jupyter Notebook), Tableau
Packages used : pandas, numpy, matplotlib, seaborn, sklearn
Task 1
Propaganda detection at the article level (PAL) using a classical supervised document classification problem. We are given a set of news articles, and we classified each article in one of two possible classes: “propagandistic article” vs. “non-propagandistic article.”
Data Preprocessing
In Task 1, the train file was directly imported to excel file and the data cleaning was performed.
The cleaned excel file was imported to jupyter notebook.
The dataset has 35955 rows and 3 columns. The attributes are: news_text, news_number, news_type
There were no null values in the data.
Vectorization
Text data requires special preparation before we start using it for predictive modelling.
The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
For our analysis, the vectorizers used are:
- Count vectorizer – The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
- Tfidf vectorizer – The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.
Modelling
The data were split into train and validation sets.
For modelling purpose, the algorithms we have used are:
1. Naïve Bayes Multinomial Classifier Model
2. Passive Aggressive Classifier Model
3. Decision Tree Classifier Model
4. Random Forest Classifier Model
5. k-NN Classifier Model
6. Logistic Model
7. Support Vector Machine
8. XGBoost Model
9. Multiple Layer Perceptron Classification Model (Feed Forward Neural Network)
Method 1 – Using Count vectorizer and applying the above given models
Method 2 – Using Tfidf vectorizer and applying the above models
Method 3 – Using Hash vectorizer and applying the above models
Jupyter Notebook
Visualization of results using Tableau
For Count Vectorizer
Accuracy and F1 Score
Sensitivity and Specificity
True Positive, True Negative, False Positive, False Negative
For Tfidf
Accuracy and F1 Score
Sensitivity and Specificity
True Positive, True Negative, False Positive, False Negative
Important Results
- For Count Vectorizer, Decision Tree, RandomForest classifier and SVM are giving high specificity, as it is predicting most of the prediction as non propagandistic. Hence Negative class are rightly predicted, but positive class is not being predicted correctly.
- Balancing sensitivity and specificity, we observe that MLP classifier performs the best with accuracy and f1 score of 96.07 and 83.98 respectively
Task 2
Propaganda detection at the sentence level (PSL). This is another classification task, but of different granularity. The objective is to classify each sentence in a news article as either “sentence that contains propaganda” or “sentence that does not contain propaganda.”
Data Cleaning
Task 2 folder had three patterns of files, with task2labels, task3labels and text files. The data files were separated out into different folders, cleaned in python and then finally merged into a single file.
The new file has 14264 rows and 5 columns. The attributes of the data are :
Source.Name, article_id, line_number, news_text, news_type
Jupyter Notebook
Visualization of results using Tableau
For Count Vectorizer
Accuracy and F1 Score
Sensitivity and Specificity
True Positive, True Negative, False Positive, False Negative
For Tfidf
Accuracy and F1 Score
Sensitivity and Specificity
True Positive, True Negative, False Positive, False Negative
Important Results
- We can see from sensitivity and specificity graphs of different model, specificity is pretty high for XGBoost, svm, decision tree. High accuracy can be seen here, due to imbalanced class. But considering sensitivity as well, we see that passive aggressive and multinomialNB classifiers works better for predicting positive class.
- Using Count vectorizer, Top 3 F1 Score is given by Passive Aggressive classifier Model, MultinomialNB classifier and MLP classifier
- We would be using Passive Aggressive Classifier to predict on Test Data. F1 -Score for Passive Aggressive model is 59.6
Task 3
Propaganda type recognition (PTR) is task of Named Entity Recognition, but applied in the propaganda detection setting. We detected the occurrences and to correctly assign the type of propaganda to text fragments.
Jupyter Notebook
Important Results
- Observing the F1 score from various classifier models, we get RandomForestClassifier, Passive Aggressive classifier and MultinomialNB classifier with maximum F1 score of 0.46
Output/Prediction Files
These are the prediction files for Task 1 and Task 2
Task 1
submission-task1-predictions_test
Task 2
submission-task2-predictions_test
Task 3
example-submission-task3-predictions-test
Doc Files
Here is the zipped file containing Jupyter notebooks
6 thoughts on “Datathon – HackNews – Solution – data_monks”
This is an amazing article! So much detail!
Thanks for the compliment
I have a question about how you modeled task 3. It is different from Task1 and Task2, it is NOT a document classification type of problem. It is a sequence tagging problem. You don’t explain how you modify your algorithms to be able to solve task 3. Out of the box, a Random Forest classifier can’t – correct me if I am wrong – tag sequences.
Frankly Speaking, We don’t have much knowledge on sequence tagging. we have followed the same approach as task-2 and solved it. we still don’t know if our approach was correct. This task3 was totally a new thing for us. May be that’s the reason even though we were able to upload output file, we haven’t got proper score.
The article is very well-organized. Results are presented in comprehensive manner.
Thanks for the compliment.