Datathons Solutions

Datathon-HackNews-Solutions-Data Titans

5
votes

 

Team Name : Data Titans

Team Members : M.HEMANTH KUMAR, A.PAVAN SHANKAR, B.MANOHAR, V. LITHIN CHOWDARY,  E.V.S.SAI RAM

PROBLEM STATEMENT :

Hack the news whether it is propaganda or Non-Propaganda

INTRODUCTION:

Propaganda is a view which can mislead us to certain false assumptions,

So here we got a chance to Identify the Propaganda in the news articles.

Propaganda became a significant issue in many sectors as it was misleading the facts.

Problem Understanding:

Here the problem is about PROPAGANDA in news articles as we see how the article writers are using this platform to influence people with their writings whether its right or wrong,so we have to find a solution for this with required measures.

Technologies and Packages Used:

Python,Excel,pandas, numpy, sklearn, Tableau.

Data Understanding:

At first, when we have the data, it’s a complicated text file which was not in a structured format but separated by a comma(,)
and it contains as discussed the complicated data like the articles, article id, and propaganda columns which is to be structured and figured

Data Preparation :

With the given text file it’s some critical to analyze the data, so we converted that text file into .CSV file which can be separated by comma(,). now we have some kind of Structured data which has to be analyzed with given data of 3 columns

Data Modeling:

Task1 :

To predict whether the article is propaganda or Non-propaganda

We have got one train data which is in text format. With the help of Excel we imported the text data and converted into a structured format. We used tab as a delimited manner. We got 3 columns news_number ,news_text,news_type whether its propaganda or not.We performed some Exploratory Data Analysis to identity any null values and removed them.

We used three types of Vectorisation methods to fit the model.They are Count Vector,Hash Vector,and TfIdf vector(Term Frequency and Inverse document Frequency).

We made the train data into Train and Test data for model building.We tried to build Machine Learning models for this Binary Text Classification. and we got good accuracy .After we tried some Neural Networks to build the model.

We used Multi layer Perceptron Neural network to build the model .It has given us better accuracy compared to Ml models and F1 score is high.

The below code file indicates the  process of Importing ,exploring and Building models for Task1.

Datathon_task1

Here are the visualisation results we got in Task1.

Using Count Vector -Algorithms Prediction:

Using Hash Vector -Algorithms Prediction:

Using Tf-Idf Vector -Algorithms Prediction:

So for task 1 , we predicted using Passive Aggressive Classifier Algorithm for predicting whether article is propaganda or not

Task 2:

To predict whether the sentence is propaganda or Non-propaganda

At the beginning of task 2, we felt it difficult to understand the problem statement because the data is not properly distributed.We received the data labels in one text file and output data in  one text file.We used help of excel to arrange all data labels like article_name,line_number,news_type from different labels into one csv file.

The next challenge is to get each sentence from the text file into a structured format exact matching to the labeled data.With the help of python we combined all the sentence files into a structured format.

The cleaning and appending the text data into a data frame is shown in below code

task_2_clean

The same methodology which we used in task 1 is being implemented for task 2, predicting whether the sentence is propaganda or not.The below code file indicates the  process of Importing ,exploring and Building models for Task 1.

datathon_new_task_2

Here are the visualisation results we got in Task2.

Using Count Vector -Algorithms Prediction:

Using Hash Vector -Algorithms Prediction:

Using Tf-Idf Vector -Algorithms Prediction:

So, for task 2 we used Neural Networks For predicting a phrase is propaganda or not.

Task 3:

Herr is the prediction file code for task 3

datathon_new_task3

Summary:

For the given three tasks predicting whether given articles are propaganda or non-propaganda, we used three vectorisation methods and passing this vector data into fitting the models.In all the algorithms we have used Multi Layer Perceptron and Passive Aggressive Classifier models for prediction.

Share this

4 thoughts on “Datathon-HackNews-Solutions-Data Titans

Leave a Reply