Datathon – HackNews – Solution – Dream_Invaders
Business understanding:
Due to the extreme divergence of social discussions happening in the political space, rumors and fake news becoming inferno which is difficult for anyone who reads to differentiate it from the truth.
What we are going to achieve?
To detect the propaganda at Article level, Sentence level and Recognizing its type.
Using supervised machine learning technique, model shall be created to identify and flag the false news propaganda.
Task 1: Detecting Propaganda at Article Level
Data Understanding:
Text file of ‘Train’ data with 35993 propagandist article with article id and label of the article (“propaganda”, “non-propaganda”) has been used to train the model. Each article would appear on one single line with newlines replaced by two-spaces. Below is the snapshot the train data set.
This train data set has been converted into a .csv file and read by Python.
1) The first column contains the title and the contents of the article
2) The second column is the article id;
3) The third column is the label of the article
The data obtained is unstructured i.e. the presence of NA values, irregular column placement, unnamed columns and so on.
Data Cleansing
The next step of the analysis is to clean the data and make it ready for using it in the model – which s is done using the various built in functions present in Python.
The NA values are removed, the unnamed columns are dropped, the columns are arranged in the ordered format, and index is set.
Modelling:
We have considered nine machine learning algorithms namely –
1) Passive Aggressive Classifier
2) Naïve Bayes
3) Random Forest,
4) Decision Tree
5) K-NN Classifier Model
6) Support Vector Machine (SVM)
7) XG Boost Model
8) Logistic Model and
9) Multiple Layer Perceptron Classification Model.
Text data requires special preparation before you can start using it for modelling.The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization). The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.
Three vectorization methods have been employeed namely,
1) CountVectorizer
2) TfidfVectorizer
3) HashingVectorizer
The data set has been split as training data and testing data with 25% of the data being assigned to the test dataset and the remaining 75% of the data as training data. This is done using the ‘train_test_split’ function of sklearn library (as shown below)
Task 1 Program and Output Files: Task 1
List of external Libraries used:
- import pandas as pd
- import numpy as np
- from sklearn import metrics
- from sklearn.model_selection import train_test_split
- from sklearn.feature_extraction.text import TfidfVectorizer
- from sklearn.linear_model import PassiveAggressiveClassifier
Evaluation:
Using the 9 models accuracy and F1 score have been calculated as shown below
Count Vectorization | HashingVectorizer | TfidfVectorizer | ||||
Model | Accuracy | F1 | Accuracy | F1 | Accuracy | F1 |
I. Naïve Bayes Multinomial Classifier Model | 0.92709 | 0.69172 | – | – | 0.89308 | NaN |
II. Passive Aggressive Classifier Model | 0.95128 | 0.7637 | 0.956 | 0.79175 | 0.96452 | 0.82111 |
III. Decision Tree Classifier Model | 0.92229 | 0.63505 | 0.91358 | 0.60348 | 0.91784 | 0.620778 |
IV. Random Forest Classifier Model | 0.91531 | 0.32164 | 0.90797 | 0.25876 | 0.91231 | 0.33846 |
V. k-NN Classifier Model | 0.90264 | 0.18435 | 0.89786 | 0.09813 | 0.89686 | 0.07753 |
VI. Logistic Model | 0.95799 | 0.78457 | 0.94609 | 0.68689 | 0.94565 | 0.67551 |
VII. Support Vector Machine | 0.8946 | 0.042 | 0.89247 | NaN | 0.89247 | |
VIII. XGBoost Model | 0.89142 | 0.01213 | 0.89264 | 0.01226 | 0.89164 | 0.01215 |
IX. Multiple Layer Perceptron Classification Model (Feed Forward Neural Network) | 0.95732 | 0.78737 | 0.96397 | 0.815201 | 0.96288 | 0.80804 |
Based on the accuracy and F1-score, it was determined that the PassiveAgressiveClassifier algorithm provided the best fit for the data while using the TF-IDF verifier.
Task 2: Detecting Propaganda at Sentence Level
Data Understanding:
There are around 293 text files. One file per news article. Each file contains the title and the content of the article, split by sentences. Classification of each news sentences of each file is available in another file. These news text and the news classification files to be merged using NLTK package or using simple raw code to extract the .txt files and label files separately and create two vectors and merge respective file.
Data Cleansing
Final file after consolidation will have 15170 records after blank news removal it will be 14263 records. Save this file as csv.
Modelling:
As mentioned in the task 1, repeat the process of modelling using 9 alogorithms X 3 vectorization methods
Evaluation:
Record the accuracy and F1 scores of all 9X3=27 models and take consider the best model to fit the test data.
Count Vectorization | HashingVectorizer | TfidfVectorizer | ||||
Acccuracy | F1 | Acccuracy | F1 | Acccuracy | F1 | |
I. Naïve Bayes Multinomial Classifier Model | 0.75883 | 0.50058 | – | – | 0.72966 | 0.06589 |
II. Passive Aggressive Classifier Model | 0.72742 | 0.4329 | 0.73808 | 0.45571 | 0.738081 | 0.5139 |
III. Decision Tree Classifier Model | 0.69798 | 0.39596 | 0.69181 | 0.40433 | 0.68367 | 0.39807 |
IV. Random Forest Classifier Model | 0.74369 | 0.24711 | 0.73948 | 0.19427 | 0.73107 | 0.19479 |
V. k-NN Classifier Model | 0.7162 | 0.12758 | 0.7106 | 0.19751 | 0.72237 | 0.019801 |
VI. Logistic Model | 0.75771 | 0.47762 | 0.74565 | 0.25959 | 0.75042 | 0.31007 |
VII. Support Vector Machine | 0.72153 | – | 0.721536 | – | 0.721536 | – |
VIII. XGBoost Model | 0.73667 | 0.30392 | 0.72349 | 0.24617 | 0.72181 | 0.231 |
IX. Multiple Layer Perceptron Classification Model (Feed Forward Neural Network) | 0.71592 | 0.50416 | 0.72125 | 0.49285 |
Based on the accuracy and F1-score, it was determined that the PassiveAgressiveClassifier algorithm provided the best fit for the data while using the TF-IDF verifier. (this is same as Task1 )
Task 2 Program and Output Files: Task 2
Thanks!