Datathons Solutions

Datathon – HackNews – Solution – data_monks

The word propaganda is defined as designating any attempt to influence the opinions or actions of others to some predetermined end by appealing to their emotions or prejudices or by distorting the facts. We are fooled by propaganda chiefly because they appeal to our emotions rather than to our reason. They make us believe and do something we would not believe or do. And since it appeal more to our emotions; we often don’t recognize it when we see it.
The current political landscape is shaped by extreme polarization of opinions and by the proliferation of fake news.
Studies and surveys has found that rumour’s and fake news tend to spread six times faster than truthful information. This situation both damages the reputation of respectable news outlets and it also undermines the very foundations of democracy, which needs free and reliable press to thrive. Therefore, it is in the interest of the public as well as of the news organizations to be able to detect and fight disinformation in all its forms.
Here, we are trying to create a tool that can help identify propagandistic articles with the help of Predictive Analytics.
The main objectives are:
(i) to flag the article as a whole
(ii) to detect the potentially propagandistic sentences in a news article
(iii) to identify the exact type and span of use of propagandistic techniques

1
votes

Hack the News Challenge

 Data Science Life Cycle

  1. Business Problem Formulation

The current political landscape is shaped by extreme polarization of opinions and by the proliferation of fake news. For example, a recent study published in Science has found that rumour’s and fake news tend to spread six times faster than truthful information. This situation both damages the reputation of respectable news outlets and it also undermines the very foundations of democracy, which needs free and reliable press to thrive. Therefore, it is in the interest of the public as well as of the news organizations to be able to detect and fight disinformation in all its forms. While most previous work has focused on “fake news”, here we are interested in propaganda.

Propaganda is the deliberate spreading of ideas, facts, or allegations with the aim of advancing one’s cause or of damaging an opposing cause. While it may include falsehoods, this is not really necessary; rather, propaganda can be seen as a form of extreme bias. Yet, propagandistic news articles usually use certain techniques such as Whataboutism, Red Herring, and Name Calling, among many others. Detecting the use of such techniques can help identify potentially propagandistic articles. Here, we are trying to create a tool that can help with this endeavour.

Note that any news article—even such coming from a reputable source—can reflect the unconscious bias of the author, and thus it could possibly be propagandistic.

Our objective here is:

(i)  to flag the article as a whole

(ii)  to detect the potentially propagandistic sentences in a news article

(iii) to identify the exact type and span of use of propagandistic techniques

Data Understanding

Two sets of documents were provided:

PAL – Propaganda detection at the article level (PAL)– and 

PSL – Propaganda detection at the sentence level (PSL) and  PTR – Propaganda type recognition (PTR).

Platform to perform Data Analysis and Modeling : Python (Jupyter Notebook), Tableau

Packages used : pandas, numpy, matplotlib, seaborn, sklearn

 Task 1

Propaganda detection at the article level (PAL) using a classical supervised document classification problem. We are given a set of news articles, and we classified each article in one of two possible classes: “propagandistic article” vs. “non-propagandistic article.”

Data Preprocessing

In Task 1, the train file was directly imported to excel file and the data cleaning was performed.

The cleaned excel file was imported to jupyter notebook.

The dataset has 35955 rows and 3 columns. The attributes are: news_text, news_number, news_type

There were no null values in the data.

Vectorization

Text data requires special preparation before we start using it for predictive modelling.

The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).

For our analysis, the vectorizers used are:

  1. Count vectorizer –  The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
  2. Tfidf vectorizer – The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

Modelling

The data were split into train and validation sets.

For modelling purpose, the algorithms we have used are:

1. Naïve Bayes Multinomial Classifier Model

2. Passive Aggressive Classifier Model

3. Decision Tree Classifier Model

4. Random Forest Classifier Model

5. k-NN Classifier Model

6. Logistic Model

7. Support Vector Machine

8. XGBoost Model

9. Multiple Layer Perceptron Classification Model (Feed Forward Neural Network)

Method 1 – Using Count vectorizer and applying the above given models

Method 2 – Using Tfidf vectorizer and applying the above models

Method 3 – Using Hash  vectorizer and applying the above models

Jupyter Notebook

Datathon-Task 1

Visualization of results using Tableau

For Count Vectorizer

Accuracy and F1 Score

Sensitivity and Specificity

True Positive, True Negative, False Positive, False Negative

For Tfidf

Accuracy and F1 Score

Sensitivity and Specificity

True Positive, True Negative, False Positive, False Negative

Important Results

  • For Count Vectorizer, Decision Tree, RandomForest classifier and SVM are giving high specificity, as it is predicting most of the prediction as non propagandistic. Hence Negative class are rightly predicted, but positive class is not being predicted correctly.
  • Balancing sensitivity and specificity, we observe that MLP classifier performs the best with accuracy and f1 score of 96.07 and 83.98 respectively

 

Task 2

Propaganda detection at the sentence level (PSL). This is another classification task, but of different granularity. The objective is to classify each sentence in a news article as either “sentence that contains propaganda” or “sentence that does not contain propaganda.”

Data Cleaning

Task 2 folder had three patterns of files, with task2labels, task3labels and text files. The data files were separated out into different folders, cleaned in python and then finally merged into a single file.

The new file has 14264 rows and 5 columns. The attributes of the data are :

Source.Name, article_id, line_number, news_text, news_type

Jupyter Notebook

Datathon-Task 2

Visualization of results using Tableau

For Count Vectorizer

Accuracy and F1 Score

Sensitivity and Specificity

True Positive, True Negative, False Positive, False Negative

For Tfidf

Accuracy and F1 Score

Sensitivity and Specificity

True Positive, True Negative, False Positive, False Negative

Important Results

  • We can see from sensitivity and specificity graphs of different model, specificity is pretty high for XGBoost, svm, decision tree. High accuracy can be seen here, due to imbalanced class. But considering sensitivity as well, we see that passive aggressive and multinomialNB classifiers works better for predicting positive class.
  • Using Count vectorizer, Top 3 F1 Score is given by Passive Aggressive classifier Model, MultinomialNB classifier and MLP classifier
  • We would be using Passive Aggressive Classifier to predict on Test Data. F1 -Score for Passive Aggressive model is 59.6

Task 3

Propaganda type recognition (PTR) is task of Named Entity Recognition, but applied in the propaganda detection setting. We detected the occurrences and to correctly assign the type of propaganda to text fragments.

Jupyter Notebook

Datathon-Task 3

Important Results

  • Observing the F1 score from various classifier models, we get RandomForestClassifier, Passive Aggressive classifier and MultinomialNB classifier with maximum F1 score of 0.46

Output/Prediction Files

These are the prediction files for Task 1 and Task 2

Task 1

submission-task1-predictions_test

Task 2

submission-task2-predictions_test

Task 3

example-submission-task3-predictions-test

Doc Files

Here is the zipped file containing Jupyter notebooks

output

 

 

 

Share this

6 thoughts on “Datathon – HackNews – Solution – data_monks

  1. 0
    votes

    I have a question about how you modeled task 3. It is different from Task1 and Task2, it is NOT a document classification type of problem. It is a sequence tagging problem. You don’t explain how you modify your algorithms to be able to solve task 3. Out of the box, a Random Forest classifier can’t – correct me if I am wrong – tag sequences.

    1. 0
      votes

      Frankly Speaking, We don’t have much knowledge on sequence tagging. we have followed the same approach as task-2 and solved it. we still don’t know if our approach was correct. This task3 was totally a new thing for us. May be that’s the reason even though we were able to upload output file, we haven’t got proper score.

Leave a Reply