Datathons Solutions


We are back to participate in another Datathon hosted by Data Science Society. This time the theme is Text Analytics.
We will not be able to completely devote ourselves to the cause this time because of the exams which start in next week. But We’ll try to keep the article as simple and well detailed as possible so that it will be helpful for any new Data Science Enthusiast seeking for little helps. So Lets roll.


They Shoot Horses, Don’t They?


Propaganda is the new weapon which influences people’s opinions or beliefs about a certain ideology, whether that ideology is right or wrong.

While propaganda influences the behavior of individuals, it is important to bear in mind that it is only one of the means by which man’s behavior is influenced. There are other forms of inducement employed in winning assent or compliance. In limited or wholesale degree, depending upon the political organization of a given country, men have used force or violence to control people. They have resorted to boycott, bribery, passive resistance, and other techniques. Bribes, bullets, and bread have been called symbols of some of the actions that men have taken to force people into particular patterns of behavior.

This time the case is focused on detecting the use of propaganda, in news articles. The purpose of the this Datathon is to develop intelligent systems using the data training sets provided that are able to classify entire articles as well as text fragments as propagandistic or not.  Two training sets of news articles written in English are provided. One is annotated at the article level whether it is propagandistic or not. The other one is annotated at the fragment level with one out of 18 propaganda techniques. We have been assigned with 3 levels.  So enough talking and lets move on with the first one, shall we?


Given a news article, We are required to build an intelligent system that is able to detect whether the article is propagandistic or not.

For starters we will  try understand the granularity of data.  Granularity refers to the quality of data at every level.  Herein we get to know that every single row represents an article which has a unique Article ID and a label stating if its a propaganda or not.

We perform couple of EDA steps to understand pattern of data. Some key points to note are –

The label class is highly imbalanced with non propaganda consisting of 88.82 %. values Hence We should not only consider the accuracy. of model but should also work on sensitivity(recall) and specificity for the same. Check the below image for better understanding of those terms.

There are lot of stop words  like for,in,he,she etc which are not really useful in our text analysis report. Hence we will remove them to use count vectorization.

                                                                                                       Frequency  of Stop Words

We have missing values to deal with in each column. Lets find them and then eliminate them as there is not much scope to replace them with any other value so the best bet is to get rid of them.



After such little pre-processing We will dive straight into Machine Learning, But first We still have to create train and test.
We have taken 80% train data and 20% test data.

x = news_new.drop('Labels', axis = 1)
y = news_new['Labels']

train_x, test_x, train_y, test_y = train_test_split(x[‘Text’], y, test_size= 0.2, random_state = 53 )

We will be using count vectorization technique to create a sparse matrix out of unstructured data.

count_vectorisation = CountVectorizer(stop_words="english")

count_train = count_vectorisation.transform(train_x)
count_test = count_vectorisation.transform(test_x)

Vector Factorization gives us a sparse matrix which could be used to fit model.

We have used Multinomial Naive Bayes,PassiveAggreseive, XGBoost, K Nearest Neighbors, MLP classifier & Gradient Boost .




Multinomial is giving us best result with 92.6% accuracy.
It should be noted that We can also use TFIDF or embeddings like google’s word to vec to enhance the performance. Unfortunately We are not working on the systems which can cope with it.

Thank you.



Share this

Leave a Reply