News Classification : Real or Fake
Team: Abhineet Singh, N.P. Ganesh, Nandan Gowda, Pooja Mohan, Rithvik Vorkady
News Classification is done on various level with increasing level of complexity and different objectives as mentioned below:
- Propaganda detection at the article level (PAL). This is the easiest task, albeit not easy in absolute terms. It is a classical supervised document classification problem. You are given a set of news articles, and you have to classify each article in one of two possible classes: “propagandistic article” vs. “non-propagandistic article.”
- Propaganda detection at the sentence level (PSL). This is another classification task, but of different granularity. The objective is to classify each sentence in a news article as either “sentence that contains propaganda” or “sentence that does not contain propaganda.”
- Propaganda type recognition (PTR). This task is similar to the task of Named Entity Recognition, but applied in the propaganda detection setting. The goal is to detect the occurrences and to correctly assign the type of propaganda to text fragments.
Propaganda detection at the article level (PAL):
Methodology Used:
We found the F1 score first – ‘The computation of the F1 score will be done as follows: First, we compute the precision P as the ratio of the number of true positives (documents/sentences that the system has labeled as propagandistic and they are indeed such) and the total number of documents/sentences that the system has classified as propagandistic. Second, we compute the recall R as the ratio of the number of true positives and the number of all documents that are indeed propaganda. Then, F1 = 2PR / (P + R). This measure takes into consideration the class imbalance in the testing dataset.’. this was all computed using Python.
PACKAGES USED:
We used the scikit learn package of the Python which has the Countvectorizer, TfidfVectorizer, HashingVectorizer
COUNTVECTORIZER():
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.
TFIDFVECTORIZER():
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer(Apply Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts.)
HASHINGVECTORIZER():
Convert a collection of text documents to a matrix of token occurrences
It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1′ or projected on the euclidean unit sphere if norm=’l2′.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
- it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
- it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
- it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
Business understanding:
with the plethora of varying opinion in public discourse right now , news has been playing a critical role in informing people with the current developments, its more critical than ever that the news that is being circulated is known to be accurate and closer tot he truth rather than a propaganda piece.
our aim is to detect the news pieces at article level, then at the sentence level and then classify into the propaganda type.
we will be using supervised machine learning techniques to build a model that would be able to identify and flag the false news propaganda.
Data Understanding and Prepping:
First step in any sort of Data Classification is to first convert it into a workable format the development data is not in the “CSV” format , The news text, news number and whether it is a propaganda news or non-propaganda news are mentioned side-by-side and all are in a jumbled format. It needs to be modified into a “CSV” file so that we can use the machine learning algorithm to predict the type of news
Train data is in textual format with 35993 propagandistic article with article id and label of the article (“propaganda”, “non-propaganda”) This train data set has been converted into a .csv file:
1) The first column contains the title and the contents of the article
2) The second column is the article id;
3) The third column is the label of the article
Data Cleansing
The next step of the analysis is to clean the data and make it ready for using it in the model – which s is done using the various built in functions present in Python.
The NA values are removed, the unnamed columns are dropped, the columns are arranged in the ordered format, and index is set.
Modelling:
The F1 score and accuracy level were calculated using the following ML classifier algorithms:
- Naïve Bayes Multinomial Classifier Model
- Passive Aggressive Classifier Model
- Decision Tree Classifier Model
- Random Forest Classifier Model
- k-NN Classifier Model
- Logistic Model
- Support Vector Machine
- XGBoost Model
- Multiple Layer Perceptron Classification Model (Feed Forward Neural Network)
Text data requires special preparation before you can start using it for modelling.The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization). The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.
The data set has been split as training data and testing data with 30% of the data being assigned to the test dataset and the remaining 70% of the data as training data. This is done using the ‘train_test_split’ function of sklearn library (as shown below)
|
Count Vectorization |
HashingVectorizer |
TfidfVectorizer |
|||
|
Acccuracy(%) |
F1(Absolute) |
Acccuracy(%) |
F1(Absolute) |
Acccuracy(%) |
F1(Absolute) |
I. Naïve Bayes Multinomial Classifier Model |
92.81 |
0.717 |
– |
– |
88.3 |
NaN |
II. Passive Aggressive Classifier Model |
95.00 |
0.771 |
95.71 |
0.805 |
96.10 |
0.820 |
III. Decision Tree Classifier Model |
92.90 |
0.677 |
92.40 |
0.665 |
91.71 |
0.632 |
IV. Random Forest Classifier Model |
90.55 |
0.335 |
89.9 |
0.231 |
90.41 |
0.321 |
V. k-NN Classifier Model |
89.41 |
0.195 |
92.21 |
0.580 |
92.1 |
0.555 |
VI. Logistic Model |
95.41 |
0.785 |
93.91 |
0.674 |
93.61 |
0.645 |
VII. Support Vector Machine |
88.51 |
0.375 |
88.31 |
NaN |
88.3 |
NaN |
VIII. XGBoost Model |
94.19 |
0.699 |
93.61 |
0.679 |
93.71 |
0.691 |
IX.MLP Classification(Feed Forward Neural Network)(5) |
95.80 |
0.801 |
96.15 |
0.819 |
96.2 |
0.817 |
IX.MLP Classification(Feed Forward Neural Network)(5,5) |
95.90 |
0.804 |
96.14 |
0.819 |
96.11 |
0.817 |
IX.MLP Classification(Feed Forward Neural Network)(5,5,5) |
96.01 |
0.801 |
95.83 |
0.796 |
96.10 |
0.814 |
INFERENCE:
As per the results obtained it is determined that the PassiveAgressiveClassifier ML Classifier Algorithm provided the best fit for the data while using the TF-IDF vecotrizer as the accuracy and F1 score obtained were the highest in this setup.
NOTE: FOR TRAIN DATA THE RANDOM STATE WE USED IS 42 ,AS RANDOM STATE NEEDS TO HAVE FIXED INTEGER VALUE OTHERWISE IT RANDOMLY TAKES VALUES.
CODE:
https://dss-www-production.s3.amazonaws.com/uploads/2019/01/Task1-Code-indianhunters.ipynb
OUTPUT:
Propaganda detection at the Sentence level (PSL):
The objective is to classify each sentence in a news article as either “sentence that contains propaganda” or “sentence that does not contain propaganda.”
|
Count Vectorization |
HashingVectorizer |
TfidfVectorizer |
|||
|
Acccuracy(%) |
F1(Absolute) |
Acccuracy(%) |
F1(Absolute) |
Acccuracy(%) |
F1(Absolute) |
I. Naïve Bayes Multinomial Classifier Model |
71.50 |
0.495 |
– |
– |
73.30 |
0.117 |
II. Passive Aggressive Classifier Model |
71.21 |
0.442 |
73.10 |
0.460 |
71.70 |
0.462 |
III. Decision Tree Classifier Model |
69.91 |
0.462 |
70.10 |
0.424 |
69.70 |
0.422 |
IV. Random Forest Classifier Model |
72.51 |
0.402 |
73.50 |
0.265 |
73.50 |
0.313 |
V. k-NN Classifier Model |
71.82 |
0.128 |
71.70 |
0.503 |
72.00 |
0.052 |
VI. Logistic Model |
75.42 |
0.451 |
74.10 |
0.250 |
74.12 |
0.242 |
VII. Support Vector Machine |
71.94 |
NaN |
71.90 |
NaN |
71.90 |
NaN |
VIII. XGBoost Model |
73.18 |
0.250 |
72.90 |
0.239 |
73.00 |
0.255 |
IX.MLP Classification(Feed Forward Neural Network)(5) |
70.34 |
0.444 |
70.30 |
0.448 |
70.60 |
0.456 |
IX.MLP Classification(Feed Forward Neural Network)(5,5) |
70.00 |
0.442 |
70.30 |
0.482 |
69.80 |
0.495 |
IX.MLP Classification(Feed Forward Neural Network)(5,5,5) |
0.692 |
0.446 |
70.10 |
0.442 |
70.15 |
0.448 |
INFERENCE:
As per the results obtained it is determined that the K-NN ML Classifier Algorithm provided the best fit for the data while using the Hashing vectorizer as the accuracy and F1 score obtained were the highest in this setup.
NOTE: FOR TRAIN DATA THE RANDOM STATE WE USED IS 42 ,AS RANDOM STATE NEEDS TO HAVE FIXED INTEGER VALUE OTHERWISE IT RANDOMLY TAKES VALUES.
CODE:
https://dss-www-production.s3.amazonaws.com/uploads/2019/01/indianhunters_Task2.ipynb
OUTPUT:
https://dss-www-production.s3.amazonaws.com/uploads/2019/01/indianhunters_Task2_test_predicted.txt
CONCLUSION:
Fake news is a problem that is heavily affecting society and our perception of not only the media but also facts and opinions themselves. I believe that this problem is solvable using AI and ML, but it will only be possible if the different communities with expertise about this work together, namely journalists, machine learning experts and product developers. In addition, I strongly believe that very little can be done without dividing the problem into smaller problems and then combining each one of the potential solutions.
From a more general perspective, I believe that the technology will allow a change in the information consumption habits by showing us different points of view for an interesting event and then empowering the user to decide what to believe. This will not only improve our understanding of the world, but also minimize the polarization in society.