FAKE NEWS : Detecting Media Bias and Propaganda
Names – Ananya Jena , Gayathri Pamuluru, Apoorva R, T S S N Sainadh, Phanikrishna V
1. Business Understanding
Propaganda has been used throughout history ever since a government based system has been institutionalized in society. It is the information, especially of a biased or misleading nature, used to promote a particular cause or point of view. It is a form of communication, which may or may not be accurate. It is used to spread information and ideas for the improvement or destruction of a cause. When propaganda is used in a right way, with the right timing, it is highly effective. But fake news tend to spread faster than truthful information. it distracts people from thinking too deeply about major issues. Therefore, it is in the interest of the public as well as of the news organizations to be able to detect disinformation in all its forms.
Based on news articles, we have to use classification technique and detect the Propaganda and Non-Propaganda news.
To analyse the data Set , we need to perform some tasks using Machin Learning Models.
2. Data Understanding
We are working with a bunch of Text datasets consists of news articles.
Datasets provided :
Task 1: Propaganda detection at the article level (PAL).
We have to classify each article as propagandistic article vs. non-propagandistic article.
The given train dataset contains 35986 rows and 3 columns in TAB-separated format :
1) First column contains the title and the contents of the article
2) Second column is the article id
3) Third column is the label of the article. Values are: “propaganda”, “non-propaganda”
Task 2: Propaganda detection at the sentence level (PSL).
This is another classification task, but of different granularity.
The objective is to classify each sentence in a news article as either any propaganda technique is used in the sentence or the sentence contains no propaganda.
After converting the train text data into csv format, the dataframe contains 15170 rows and 3 columns including null values.
Task 3: Propaganda type recognition (PTR).
The goal of this task is to detect the occurrences of propagandistic techniques in the text and to correctly assign a type of propaganda to each text fragment.
After converting the train text data into csv format, the dataframe contains 5114 rows and 3 columns including null values.
Test dataset is same for all the tasks and it contains 10152 rows and 3 columns.
3. The Algorithm used is as follows :-
# Vectorizers used –
- Count
- Tfidf
- Hash
# ML Algorithms used –
- Passive Aggressive
- MLP
- Logistic Regression
- AdaBoost
- Decision Tree
- Random Forest
- KNN
- SVM
- Naive bayes
# Accuracy Measures
- Accuracy
- F1 Score calculation : For evaluation
4. Task 1
Problem Statement: For a given test data, identify whether it is Propagandistic or not.
MACHINE LEARNING CLASSIFIERS & F1 SCORE CALCULATION AND RESULTS :
(The Detailed Code is included in the attached Jupyter Notebook)
The aim of our work is to investigate the performance of different classification method for a set of large data. We have used 9 different algorithms (as mentioned above) to calculate accuracy values and also we did the calculation for F1 Score.
Count Vectorizer:
Passive Aggressive | MLP | Logistic Regression | AdaBoost | Decision Tree | Random Forest | KNN | SVM | Naive Bayes | |
Model Accuracy | 0.953000 | 0.954000 | 0.956000 | 0.940000 | 0.908000 | 0.914000 | 0.898000 | 0.801000 | 0.925000 |
F1 Score | 0.772463 | 0.780531 | 0.781235 | 0.679602 | 0.287554 | 0.379611 | 0.160673 | 0.113036 | 0.698696 |
Based on the above table, we can clearly see that the highest accuracy is 95.60 % and the highest F1 Score is 0.781235. In fact, both the highest accuracy and F1 Score belongs to Logistic Regression Classifier.
Tfidf Vectorizer:
Passive Aggressive | MLP | Logistic Regression | AdaBoost | Decision Tree | Random Forest | KNN | SVM | Naive Bayes | |
Model Accuracy | 0.960000 | 0.961000 | 0.943000 | 0.939000 | 0.922000 | 0.912000 | 0.925000 | 0.936000 | 0.946000 |
F1 Score | 0.832773 | 0.811738 | 0.664488 | 0.678082 | 0.641860 | 0.343967 | 0.553425 | 0.589001 | 0.729867 |
Here in this above table, we can clearly see that the highest accuracy is 96.10 % which belongs to MLP classifier and the highest F1 Score is 0.832773 which belongs to Passive Aggressive classifier.
Hash Vectorizer:
Passive Aggressive | MLP | Logistic Regression | AdaBoost | Decision Tree | Random Forest | KNN | SVM | Naive Bayes | |
Model Accuracy | 0.957000 | 0.956000 | 0.944000 | 0.941000 | 0.908000 | 0.906000 | 0.929000 | 0.938000 | 0.925000 |
F1 Score | 0.788732 | 0.764325 | 0.684211 | 0.687470 | 0.287554 | 0.253875 | 0.594937 | 0.618639 | 0.490920 |
In this table also, we can clearly see that the highest accuracy is 95.70 % and the highest F1 Score is 0.788732. And both the highest accuracy and F1 Score belongs to Passive Aggressive Classifier.
Summary :
As a conclusion, we have met our objective which is to evaluate and investigate 9 selected classification algorithms based on the data. The best algorithm based on the news data is Passive Aggressive classifier with an F1 Score 0.832773. These results suggest that among the machine learning algorithm tested, Passive Aggressive classifier has the potential to significantly improve the conventional classification methods for identification of propaganda news for a given data.
Predicted Data-set will be uploaded separately in the portal.
OUTPUT PREDICTED FILE : Task_1.test
5. Task 2
Problem Statement: For a given test phrase, identify whether it is Propagandistic or not.
In order to create the corpus data-set for further analysis in this task, we need to perform certain steps.
Data preprocessing steps :-
- Read all article text files and stored those articles in the data-frame.
- Read the .Label files of Task 2 and stored it into another data-frame.
- We concatenated data-frame 1 (Articles) and data-frame 2 (Label files)
- Concatenated data-frame was read into a .CSV file for further processing.
MACHINE LEARNING CLASSIFIERS & F1 SCORE CALCULATION AND RESULTS :
(The Detailed Code is included in the attached Jupyter Notebook)
The aim of our work is to investigate the performance of different classification method for a set of large data. We have used 9 different algorithms (as mentioned above) to calculate accuracy values and also we did the calculation for F1 Score.
Count Vectorizer:
Passive Aggressive | MLP | Logistic Regression | AdaBoost | Decision Tree | Random Forest | KNN | SVM | Naive Bayes | |
Model Accuracy | 0.717000 | 0.714000 | 0.760000 | 0.734000 | 0.725000 | 0.741000 | 0.718000 | 0.678000 | 0.760000 |
F1 Score | 0.481370 | 0.469467 | 0.463224 | 0.280480 | 0.043938 | 0.411671 | 0.136006 | 0.374035 | 0.503145 |
Based on the above table, we can clearly see that the highest accuracy is 76 % for both Logistic Regression and Naive Bayes algorithms and the highest F1 Score is 0.503145. In fact, both the highest accuracy and F1 Score belongs to Naive Bayes Classifier.
Tfidf Vectorizer:
Passive Aggressive | MLP | Logistic Regression | AdaBoost | Decision Tree | Random Forest | KNN | SVM | Naive Bayes | |
Model Accuracy | 0.733000 | 0.714000 | 0.725000 | 0.729000 | 0.710000 | 0.743000 | 0.720000 | 0.703202 | 0.738000 |
F1 Score | 0.511945 | 0.479353 | 0.487146 | 0.232804 | 0.427453 | 0.319109 | 0.063963 | 0.498420 | 0.129082 |
Here in this above table, we can clearly see that the highest accuracy is 74.30 % and the highest F1 Score is 0.511945. And highest accuracy belongs to Random Forest and F1 Score belongs to Passive Aggressive Classifier.
Summary :
Here also after the investigation of all the 9 selected classification algorithms based on the news data, the best algorithm is Passive Aggressive classifier with an accuracy of 73.30 % and F1 Score 0.511945. These results suggest that among the machine learning algorithm tested, Passive Aggressive classifier has the potential to significantly improve the conventional classification methods for identification of propaganda news for a given data.
Predicted Data-set will be uploaded separately in the portal.
OUTPUT PREDICTED FILE :Task2_output
6. Task 3
Problem Statement: For a given test propgandistic phrase, identify the type of propaganda.
(The Detailed Code is included in the attached Jupyter Notebook)
In order to create the corpus data-set for further analysis in this task, we need to perform certain steps.
Data preprocessing steps :-
- We are getting the start and end-values from the .LABELS file.
- We are reading text from article text file based on the extracted start and end values and stored those articles in the data-frame.
- Data-frame was read into a .CSV file for further processing.
MACHINE LEARNING CLASSIFIERS & F1 SCORE CALCULATION AND RESULTS :
(The Detailed Code is included in the attached Jupyter Notebook)
The aim of our work is to investigate the performance of different classification method for a set of large data. We have used seven different algorithms (as mentioned above) to calculate accuracy values and also we did the calculation for F1 Score.
Count Vectorizer:
Passive Aggressive | MLP | Logistic Regression | AdaBoost | Decision Tree | Random Forest | KNN | SGDC | Naive Bayes | |
Model Accuracy | 0.466101 | 0.325293 | 0.498044 | 0.363755 | 0.488266 | 0.485007 | 0.370926 | 0.508474 | o.499348 |
F1 Score | 0.228497 | 0.128223 | 0.224045 | 0.043522 | 0.255098 | 0.264961 | 0.135333 | 0.272964 | 0.199505 |
Based on the above table, we can clearly see that the highest accuracy is 50.84 % and the highest F1 Score is 0.272964. In fact, both the highest accuracy and F1 Score belongs to SGDC Classifier.
Tfidf Vectorizer:
Passive Aggressive | MLP | Logistic Regression | AdaBoost | Decision Tree | Random Forest | KNN | SGDC | Naive Bayes | |
Model Accuracy | 0.462842 | 0.353976 | 0.468057 | 0.363748 | 0.388266 | 0.488918 | 0.294003 | 0.510450 | 0.464146 |
F1 Score | 0.249072 | 0.119122 | 0.139001 | 0.043522 | 0.255269 | 0.252027 | 0.115815 | 0.254536 | 0.110229 |
Here in this above table, we can clearly see that the highest accuracy is 51.04 % which belongs to SGDC classifier and the highest F1 Score is 0.255269 which belongs to Decision tree classifier.
Summary :
Here also after the investigation of all the 9 selected classification algorithms based on the news data, the best algorithm is SGDC classifier with an F1 Score 0.272964. These results suggest that among the machine learning algorithm tested, SGDC classifier has the potential to significantly improve the conventional classification methods for identification of propaganda news for a given data.
Predicted Data-set will be uploaded separately in the portal.
7. Conclusion
This solution mainly caters to the need of solving 3 main problems –
- Detecting whether an article is propagandistic or not
- Detecting whether an article phrase is propagandistic or not
- Detecting the type of propaganda from propagandistic article phrase
For reference of codes prefer below links:
Task1 :-
INPUT SCRIPTS: Task1_ F1_Score
OUTPUT SCRIPTS: Task1_Test_Prediction
Task2 :-
INPUT SCRIPTS: Task2_ F1_Score
OUTPUT SCRIPTS: Task2_Test_Prediction
Task3 :-
INPUT SCRIPTS: Task3_ F1_Score