Datathons Solutions

Datathon – HackNews – Solution – Dream_Invaders

Due to the extreme divergence of social discussions happening in the political space, rumours and fake news becoming inferno which is difficult for anyone who reads to differentiate it from the truth.
What we are going to achieve?
To detect the propagandas at article level, sentence level and recognizing its type.
Using supervised machine learning technique, model shall be created to identify and flag the false news propaganda.

0
votes

Datathon – HackNews – Solution – Dream_Invaders

Business understanding:

Due to the extreme divergence of social discussions happening in the political space, rumors and fake news becoming inferno which is difficult for anyone who reads to differentiate it from the truth.

What we are going to achieve?

To detect the propaganda at Article level, Sentence level and Recognizing its type.

Using supervised machine learning technique, model shall be created to identify and flag the false news propaganda.

Task 1: Detecting Propaganda at Article Level

Data Understanding:

Text file of ‘Train’ data with 35993 propagandist article with article id and label of the article (“propaganda”, “non-propaganda”) has been used to train the model. Each article would appear on one single line with newlines replaced by two-spaces. Below is the snapshot the train data set.

 

This train data set has been converted into a .csv file and read by Python.

1) The first column contains the title and the contents of the article

2) The second column is the article id;

3) The third column is the label of the article

The data obtained is unstructured i.e. the presence of NA values, irregular column placement, unnamed columns and so on.

Data Cleansing

The next step of the analysis is to clean the data and make it ready for using it in the model – which s is done using the various built in functions present in Python.

The NA values are removed, the unnamed columns are dropped, the columns are arranged in the ordered format, and index is set.

 

Modelling:

We have considered nine machine learning algorithms namely –

1) Passive Aggressive Classifier

2) Naïve Bayes

3) Random Forest,

4) Decision Tree

5) K-NN Classifier  Model

6) Support Vector Machine (SVM)

7) XG Boost Model

8) Logistic Model and

9) Multiple Layer Perceptron Classification Model.

Text data requires special preparation before you can start using it for modelling.The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization). The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.

Three vectorization methods have been employeed namely,

1) CountVectorizer

2) TfidfVectorizer

3) HashingVectorizer

The data set has been split as training data and testing data with 25% of the data being assigned to the test dataset and the remaining 75% of the data as training data. This is done using the ‘train_test_split’ function of sklearn library (as shown below)

 Task 1 Program and Output Files: Task 1

List of external Libraries used: 

  • import pandas as pd
  • import numpy as np
  • from sklearn import metrics
  • from sklearn.model_selection import train_test_split
  • from sklearn.feature_extraction.text import TfidfVectorizer
  • from sklearn.linear_model import PassiveAggressiveClassifier

Evaluation:

Using the 9 models accuracy and F1 score have been calculated as shown below

Count Vectorization HashingVectorizer TfidfVectorizer
Model Accuracy F1 Accuracy F1 Accuracy F1
I. Naïve Bayes Multinomial Classifier Model 0.92709 0.69172 0.89308 NaN
II. Passive Aggressive Classifier Model 0.95128 0.7637 0.956 0.79175 0.96452 0.82111
III. Decision Tree Classifier Model 0.92229 0.63505 0.91358 0.60348 0.91784 0.620778
IV. Random Forest Classifier Model 0.91531 0.32164 0.90797 0.25876 0.91231 0.33846
V. k-NN Classifier Model 0.90264 0.18435 0.89786 0.09813 0.89686 0.07753
VI. Logistic Model 0.95799 0.78457 0.94609 0.68689 0.94565 0.67551
VII. Support Vector Machine 0.8946 0.042 0.89247 NaN 0.89247
VIII. XGBoost Model 0.89142 0.01213 0.89264 0.01226 0.89164 0.01215
IX. Multiple Layer Perceptron Classification Model (Feed Forward Neural Network) 0.95732 0.78737 0.96397 0.815201 0.96288 0.80804

Based on the accuracy and F1-score, it was determined that the PassiveAgressiveClassifier algorithm provided the best fit for the data while using the TF-IDF verifier.

 

Task 2: Detecting Propaganda at Sentence Level

Data Understanding:

There are around 293 text files. One file per news article. Each file contains the title and the content of the article, split by sentences. Classification of each news sentences of each file is available in another file. These news text and the news classification files to be merged using NLTK package or using simple raw code to extract the .txt files and label files separately and create two vectors and merge respective file.

Data Cleansing

Final file after consolidation will have 15170 records after blank news removal it will be 14263 records. Save this file as csv.

Modelling:

As mentioned in the task 1, repeat the process of modelling using 9 alogorithms X  3 vectorization methods

Evaluation:

Record the accuracy and F1 scores of all 9X3=27 models and take consider the best model to fit the test data.

Count Vectorization HashingVectorizer TfidfVectorizer
Acccuracy F1 Acccuracy F1 Acccuracy F1
I. Naïve Bayes Multinomial Classifier Model 0.75883 0.50058 0.72966 0.06589
II. Passive Aggressive Classifier Model 0.72742 0.4329 0.73808 0.45571 0.738081 0.5139
III. Decision Tree Classifier Model 0.69798 0.39596 0.69181 0.40433 0.68367 0.39807
IV. Random Forest Classifier Model 0.74369 0.24711 0.73948 0.19427 0.73107 0.19479
V. k-NN Classifier Model 0.7162 0.12758 0.7106 0.19751 0.72237 0.019801
VI. Logistic Model 0.75771 0.47762 0.74565 0.25959 0.75042 0.31007
VII. Support Vector Machine 0.72153 0.721536 0.721536
VIII. XGBoost Model 0.73667 0.30392 0.72349 0.24617 0.72181 0.231
IX. Multiple Layer Perceptron Classification Model (Feed Forward Neural Network) 0.71592 0.50416 0.72125 0.49285

Based on the accuracy and F1-score, it was determined that the PassiveAgressiveClassifier algorithm provided the best fit for the data while using the TF-IDF verifier. (this is same as Task1 )

 Task 2 Program and Output Files: Task 2

Thanks!

 

 

 

Share this

Leave a Reply