Business Understanding:- Propaganda is a method of correspondence used to control or impact the assessment of gatherings to help a specific reason or conviction. Throughout the hundreds of years, purposeful publicity has appeared as work of art, movies, discourses, and music, however it’s not constrained to these types of correspondence.
The first major task is to classify the data into either of these two categories i.e.. Propaganda or Non-Propaganda. In order to do the following we have to undergo the process of text cleaning, understanding the text. Its very important to understand the text document on which we are working.
The second and the final task is to identify the particular sentence in each document which supports whether or not the sentence is Propaganda or not.
Data Understanding:-
Each row of the file provides information about various news related details followed by its corresponding news number and the type(Propaganda or non-propaganda ). The data provided was in a text format delimited by spaces and tabs. We had to find a way in order to split the data and form a data frame which consists of the following columns.
- News_Text
- News_Number
- News_Type
The data has lots of fillers which had to be removed and some rows where news_numbers and type were missing. Some important decisions had to be taken with respect to what has to be done to those missing data.
Task 2 :- The data was spread among different text and label files. The procedure was to merge all the files in a sequence. We had used the command prompt to perform this task of merging using the following code.
‘for %f in (*.labels) do type “%f” >> c:\Test\output_file.txt ‘ This code would check for ‘labels’ and merge them together and vice versa for text files. This code has to be run from the path where the files are present.
Exploring the training data these words seems to be dominating when we had filtered for type==’Propaganda’. We can clearly see ‘american’,’nation’,’country’,’people’ and etc frequently occurring words along with ‘one’ and ‘said’. This image displayed above is free from tense and other stopwords. They have been taken care by the nltk package using stemming and stopwords filtration.
Data Preparation:- As discussed on the previous step, the data has to be cleaned. In order to clean the data we had to remove the fillers using the NLTK stop words filtration. Later on we tokenized the data using the word_tokenizer from the nltk package.
The next important step was to lemmatize/stem the data to remove the tense from the words and normalize the words. Once the data was prepared we had to apply a function called CountVectorizer which converts a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
The third and the final step is to prepare the data for fitting into the various machine learning models. The data had to be separated into train and test. Which then would be fed into the model.
we can use the stratified sampling to handle the highly imbalanced training data set to avoid the model biasing towards predominant “Non Propaganda Class”.
Modeling:- Various classifications models were build in order to classify the data, there were promising results from Passive Aggressive Classifier, Support Vector Machine , Neural Network MLP , Logistic Regression , XGBoost, the best and the optimal algorithm among these three was XGBoost. Even though it was a time consumption process the results were promising.
XGBoost has capability to handle the imbalanced dataset by using the parameter “Scale_Pos_weight”.we can do threshold probability throttling to increase the sensitivity by sacrificing the reasonable specificity.
Evaluation:- This process is kind of tricky for the train data set provided, as the data was highly imbalanced, the dependent feature/variable had imbalanced classes. So the accuracy score would not be an optimal scoring technique. We had to depend on F1 Score.
List of External Libraries Used.
- sklearn.feature_extraction(To Perform CountVectorizer)
- sklearn.model_selection (To perform Train Test Split)
- sklearn.preprocessing (To Perform Label Encoding)
- from sklearn.linear_model (Passive Agrasssive Model)
- import xgboost as xgs (XGBoost)
- from sklearn.model_selection import GridSearchCV (To Fine tuning the models)
- from sklearn.neural_network (MLP Approach)
- from sklearn.ensemble (Bagging and Trees)
- from nltk.stem (To perform Stemming)
- from. nltk.word_tokenize (To perform tokenize function)
The Test Scores :-
XGB- XGBoosting
NB- Multinomial NaiveBaysian
PAC- PassiveAggressiveClassifier
MLP- Neural Networks
OUTPUT FILES:-
Task 1 output:- task1output
Task 2 Output:- task2output
Python Code for Text Cleaning and Building Models:-
Task3:- task_3 Pred