Abstract¶
This notebook tries to classify news articles in 2 classes propaganda and non-propaganda. 3 types of models Naive Bayes Classifier, Linear Support Vector Classifier and Recurrent Neural Network. The Linear SVC shows the best results. yesThe neural network comes close, while the Naive Bayes Classifier predicts only one class.
The following packages have been used:
- nympy
- matplotlib.pyplot
- pandas
- sklearn
- keras
Table of contents¶
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.feature_extraction import text
from sklearn.utils import shuffle
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from collections import Counter
Preprocessing and EDA¶
We read the text file stored as task1_train.txt and store is as data. Then we preview the top 5 rows. Later we add the columns labels as Article, Code and Type.
data_train = pd.read_table('task1_train.txt' , header = None)
data_train.head()
data_train.columns = ['Article', 'Code', 'Type']
data_train.head()
data_train['Article'].count()
The total number of articles is 35986.
length = []
for i in range(data_train['Article'].count()):
length.append(len(data_train['Article'][i]))
We add the total length of the articles and to a list. Later we can see that the minimum length is 45, while the maximum length is 100000 and the average length is 3556 words per article.
max(length)
min(length)
round(sum(length)/len(length),0)
All the cells of the table are filled with data, none are empty.
data_train.isnull().sum()
Now we explore how many of the articles are propaganda and how many are not. 4021 of the articles are propaganda type while 31965 are non propaganda.
data_train[data_train['Type'] == 'propaganda'].count()
data_train[data_train['Type'] == 'non-propaganda'].count()
89% of the articles are marked as non-propaganda, which means that our dataset is highly unbalanced. The pie chart offers us a visualization.
((data_train['Article'][data_train['Type'] == 'non-propaganda']).count())/((data_train['Article']).count())
plt.pie([(data_train['Article'][data_train['Type'] == 'non-propaganda']).count(),(data_train['Article'][data_train['Type'] == 'propaganda']).count()], labels = ['Non-propaganda', 'Propaganda'])
plt.title('Propaganda vs Non-Propaganda Aricles')
plt.show()
Now we read the test set and store it as data_test, we use the same column names as before. The number of articles here is 10153.
data_test = pd.read_table('task1_test.txt' , header = None)
We add columns to the data_test. Unfortunately The Type columns is empty and it will not be provided public for this particular Datathon.
data_test.columns = ['Article', 'Code', 'Type']
data_test.head()
len(data_test)
The number of the data_test rows is 10153. The unique attribute to the Type columns shows us that all the rows are marked with '?'
data_test['Type'].unique()
Pandas does not read all the rows, 3 particular rows are missing, but they are present in the dataset. We will make a new dataframe called data_test5 and will later add it to the data_test dataframe. We will use 'This is an article' to place something there since 5 out of 10153 represent less then 0.5% of the articles it will have a minimal result on the whole.
data_test5 = pd.DataFrame([['This is an article', 177241, '?'],
['This is an article1', 177240, '?'],
['This is an article2', 112831, '?'],
['This is an article3', 177242, '?'],
['This is an article4', 112824, '?']], columns = ['Article', 'Code', 'Predicted_type'])
data_test5
Now we append the data_test5 to data_test set. It has grown by 5 now.
data_test = data_test.append(data_test5)
len(data_test)
Naive Bayes Classifier¶
We use the shuffled train data and store in X_train the column 'Article', while the column 'Type' is stored in y_train. Finally we store the column 'Article' from the data_test dataframe to X_test.
X_train, y_train= shuffle(data_train['Article'], data_train['Type'])
X_test = data_test['Article']
Now we create a pipeline object for the MultinomialNB classifier. We will use the TfIdf vectorizer to encode the words in the Article column to a numeric matrix form. Afterwards we fit the model with the training data.
# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
('clf', MultinomialNB()),
])
text_clf_nb.fit(X_train, y_train)
Then we generate the predictions using the X_test data. When using the set function to print the predictions we see that all the predictions are from the 'non-propaganda' type. This is definitely not a good predictor.
# Form a prediction set
predictions = text_clf_nb.predict(X_test)
print(predictions)
print(set(predictions))
len(predictions)
Now we add the predictions to the data_test in a new column 'Predicted_type'. We can see the first 5 rows below. Later we remove the Type column.
data_test['Predicted_type'] = predictions
data_test.head()
data_test = data_test.drop(['Type'], axis = 1)
data_test.head()
Finally we export the data to a new test_predicted text file containing only the Code and Predicted Type columns, using a tab separetor and removing the index.
data_test.to_csv('test_predicted.txt', header=None, columns = ['Code', 'Predicted_type'], index = False, sep = '\t')
We would like to evaluate the model so we will split the train date with the train_test split using a 30% ratio.
X_train1, X_test1, y_train1, y_test1 = train_test_split(
... X_train, y_train, test_size=0.30, random_state=42)
text_clf_nb.fit(X_train1, y_train1)
# Form a prediction set
predictions1 = text_clf_nb.predict(X_test1)
We print the confusion matrix. This confirms that only one class is being predicted. Under that we print the classification report, we can see we get good results for the non-propaganda type and the overall is somewhat lower, but not that bad. Unfortunately we receive very poor results on the propaganda type. Finally we print the accuracy score and it shows 89% accuracy. Unfortunately this reflects the highly skewed data we have. So to sum thing up the Naive Baise Classifier performs poorly.
print(metrics.confusion_matrix(y_test1,predictions1))
# Print a classification report
print(metrics.classification_report(y_test1,predictions1))
# Print the overall accuracy
print(metrics.accuracy_score(y_test1,predictions1))
Linear Support Vector Classifier¶
Now we will try a new model. The steps are pretty much the same so, they will not be discussed as thoroughly as the last classifier only the results, we will also use different notations to the variables.
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
('clf', LinearSVC()),
])
text_clf_lsvc.fit(X_train, y_train)
# Form a prediction set
predictions2 = text_clf_lsvc.predict(X_test)
data_test1 = data_test
data_test1['Predicted_type'] = predictions2
data_test1.head()
This time we have predictions from both classes.
data_test1['Predicted_type'].unique()
len(data_test1)
Again we export the data into the test_predicted1 text file.
data_test1.to_csv('test_predicted1.txt', header=None, columns = ['Code', 'Predicted_type'], index = False, sep = '\t')
text_clf_lsvc.fit(X_train1, y_train1)
# Form a prediction set
predictions3 = text_clf_lsvc.predict(X_test1)
According to the confusion matrix the number of True Positives and True Negative is relatively high. We can see that the F1 score for the non-propaganda articles is very good, for the propaganda it is worse, but significantly better then the Naive Bayes Classifier. The accuracy score is 96% also better. This model is substantially then the Naive Bayes Classifier.
print(metrics.confusion_matrix(y_test1,predictions3))
# Print a classification report
print(metrics.classification_report(y_test1,predictions3))
# Print the overall accuracy
print(metrics.accuracy_score(y_test1,predictions3))
Now we will use a similar approach like the data_test dataset for a new file names as dev. We will use the Linear SVC to generate the predictions. Everything is analogous.
dev = pd.read_table('task1_dev.txt' , header = None, encoding = 'utf-8')
dev.head()
dev.columns = ['Article', 'Code', 'Type']
dev.head()
len(dev)
print(type(dev['Code']))
dev[dev['Code'] == 219277]
#We need to add 2 new rows.
dev1 = pd.DataFrame([['This is an article', 219277, '?'], ['This is an article1', 219318, '?']], columns = ['Article', 'Code', 'Type'])
dev1
dev = dev.append(dev1)
X_test2 = dev['Article']
predictions4 = text_clf_lsvc.predict(X_test2)
dev['Generated_Type'] = predictions4
dev.head()
len(dev)
dev.to_csv('test_predicted2.txt', header=None, columns = ['Code', 'Generated_Type'], index = False, sep = '\t')
Now we will try to use cross validation with 5 folds to see better the performance of the model.
accurecies = cross_val_score(text_clf_lsvc,X_train, y_train, cv=5)
All the folds have very similar accuracies, so there is a lower probability of overfitting. The situation is similar with the f1 scores.
print(accurecies)
scores = cross_val_score(text_clf_lsvc,X_train, y_train, cv=5, scoring='f1_macro')
print(scores)
#no overfiting
Now we are going to check the steps for the first article. We will get the coefficients for individual words and their names.
svc = text_clf_lsvc.steps[1][1]
tfidf = text_clf_lsvc.steps[0][1]
svc.coef_[0]
w2score = { w: score for w, score in zip(tfidf.get_feature_names(), svc.coef_[0])}
w2score = Counter(w2score)
We get the 50 words that have the highest weights for evaluating the article as propaganda, and the lowest 50. They can be seen below.
w2score.most_common(len(w2score))[-50:]
w2score.most_common(len(w2score))[:50]
We will also try to remove the most common English words to see if out model will further improve. The words can be seen below.
print(text.ENGLISH_STOP_WORDS)
X_train1, X_test1, y_train1, y_test1 = train_test_split(
... X_train, y_train, test_size=0.30, random_state=42)
# RUN THIS CELL TO ADD STOPWORDS TO THE LINEAR SVC PIPELINE:
text_clf_lsvc2 = Pipeline([('tfidf', TfidfVectorizer(stop_words=text.ENGLISH_STOP_WORDS)),
('clf', LinearSVC()),
])
text_clf_lsvc2.fit(X_train1, y_train1)
predictions5 = text_clf_lsvc2.predict(X_test1)
The implementation of stop words does not have a big impacts.
print(metrics.confusion_matrix(y_test1,predictions5))
print(metrics.classification_report(y_test1,predictions5))
print(metrics.accuracy_score(y_test1,predictions1))
Lastly we will try to predict the type of article by copying random articles from the internet. 2 of the articles are classified as non-propaganda and the 3rd one as propaganda.
test_text1 = """The White House sent a statement notifying reporters that the bill had been signed by the president. No press was present for the bill signing, which ended the 35-day partial government shutdown – the longest in history.
Earlier in the evening, House Speaker Nancy Pelosi signed the bill with eight pens in a public ceremony and gave the pens as souvenirs to her House Democratic colleagues.
US Government shutdown: auto, airline industries feeling squeeze amid standoff
The hastily passed bill will fully fund the government for three weeks and provide retroactive pay for federal government employees.
On Twitter, the president defended his decision to back down from his demands for border wall funding before he would sign a bill to fund the government.
This was in no way a concession, he wrote. It was taking care of millions of people who were getting badly hurt by the Shutdown with the understanding that in 21 days if no deal is done, it’s off to the races!
Congress has three weeks to come up with a plan to fund border security, as a bi-partisan committee was formed in the Senate on Friday to come up with a compromise.
White House Press Secretary Sarah Sanders also defended the president on Twitter.
In 21 days President Donald Trump is moving forward building the wall with or without the Democrats, she wrote. The only outstanding question is whether the Democrats want something or nothing.
The president warned Congress Friday that if they failed to reach a compromise, he would declare a State of Emergency on the Southern border, allowing him the flexibility to fund border barriers using executive authority.
I think we have a good chance, Trump told reporters at the White House. We’ll work with the Democrats and negotiate and if we can’t do that, then obviously we’ll do the emergency because that’s what it is. It’s a national emergency."""
print(text_clf_lsvc.predict([test_text1]))
test_text2 = """What about Obamas bonmbing of Lybea, fucking impossible task to stop the country from collapsing. They called them
babies"""
print(text_clf_lsvc.predict([test_text2]))
test_text3 = """ History stopped in 1936 – after that, there was only propaganda. So said George Orwell of an era when the multiple miseries of the Great Depression were compounded by the ruthless media strategies of Hitler and Stalin
An international collection of propaganda posters from before and during the second world war.
An international collection of propaganda posters from before and during the second world war. Composite: UIG/VGC via Getty Images
Truth was the first casualty of the Great Depression. Reflecting the anguish of the time, propaganda was manufactured on an unprecedented scale. As economic disaster threatened to trigger shooting wars so,
as George Orwell said, useful lies were preferred to harmful truths. He went further, declaring that history stopped in 1936; after that there was only propaganda.
This was a characteristic exaggeration but it points to the universality of state deception. The very term Depression aimed to mislead: President Hoover employed it as a euphemism for the standard American word for financial crisis,
“Panic”. Hence the poet WH Auden’s verdict that this was a “low dishonest decade”, a conclusion he reached in a New York dive on 1 September 1939 while attempting to “undo the folded lie … the lie of Authority.”
It was the end of a decade in which, as Auden wrote elsewhere: “We have seen a myriad faces / ecstatic from one lie.”
Of course, to lie is human, and official mendacity had been practised throughout the ages. But it was developed intensively during the first world war, notably under the direction of Lord Northcliffe,
founder of the popular press in Britain and portrayed in Germany as “the father of lies”.
Particularly effective were his attacks on the Kaiser, who was portrayed (in a leaflet dropped behind German lines) as marching with his six sons, all in full military regalia, past a host of outstretched skeletal arms,
the caption reading: “One family which has not lost a single member. """
print(text_clf_lsvc.predict([test_text3]))
Neural Network¶
We will try to implement an neural network with keras. This is in a different notebook so, the data will be imported again. There will be less discussion on the steps.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.optimizers import RMSprop
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping
data = pd.read_table('task1_train.txt' , header = None)
data.columns = ['Article', 'Code', 'Type']
We use the Label encoder function to vectorize the Type column.
le = LabelEncoder()
y = data['Type']
y = le.fit_transform(y)
y = y.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(
... data['Article'], y, test_size=0.30, random_state=42)
max_words = 10000
max_len = 150
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X_train)
sequences = tok.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)
The function RNN is defined to create a recurrent neural network architecture. We can see the layers, activation function in the summary.
def RNN():
inputs = Input(name='inputs',shape=[max_len])
layer = Embedding(max_words,50,input_length=max_len)(inputs)
layer = LSTM(64)(layer)
layer = Dense(256,name='FC1')(layer)
layer = Activation('relu')(layer)
layer = Dropout(0.5)(layer)
layer = Dense(1,name='out_layer')(layer)
layer = Activation('sigmoid')(layer)
model = Model(inputs=inputs,outputs=layer)
return model
model = RNN()
model.summary()
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])
We will train 10 epochs. The accuracy increases rapidly during the 1st epoch, later we see a smaller increase.
model.fit(sequences_matrix,y_train,batch_size=128,epochs=10,
validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])
test_sequences = tok.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)
accr = model.evaluate(test_sequences_matrix,y_test)
The accuracy is less then the Linear SVC. Probably increasing the number of epochs or changing the infrastructure can gives us some more accuracy.
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(accr[0],accr[1]))
Conclusion¶
If we compare the 3 models the Linear SVC performs the best, closely followed by the RNN, while the Naive Bayesian Classifier has poor scoring. For further development other classification models like logistic regression, decision tree classifier or a convolutional neural network can be tried.
References¶
- https://www.datasciencesociety.net/
- https://stackoverflow.com/
- https://www.kaggle.com/
- Applied Text Analysis with Python, O'Reilly Media, 2018