Team has considered following properties of data for coming up with the solution:
Repetition of text.
Length of words
Lexical analysis of words
frequency of words
trigrams and bigrams of words
Sentiments conveyed by the
The main modeling which included in
LSTM – Long short term memory with embedding from fasttext.
Using Bidirectional LSTMs and trainable embeddings initialized with GloVe for propaganda detection at the article level
Task 1 Part 1 Task 1 Part 2
Business Understanding Fake news is a massive problem for the multiple industries and government that needs to be addressed in a more automated format. Providing an automated method to examine text and classify it as fake or propaganda can help reduce the effect of fake news. It is easier said than done though as even […]
In order to do the following we have to undergo the process of text cleaning, understanding the text. We had to find a way in order to split the data and form a data frame which consists of the following columns.News_TextNews_NumberNews_TypeThe data has lots of fillers which had to be removed and some rows where news_numbers and type were missing. In order to clean the data we had to remove the fillers using the NLTK stop words filtration. Later on we tokenized the data using the word_tokenizer from the nltk package.The next important step was to lemmatize/stem the data to remove the tense from the words and normalize the words. Even though it was a time consumption process the results were promising.XGBoost has capability to handle the imbalanced dataset by using the parameter Scale_Pos_weight. we can do threshold probability throttling to increase the sensitivity by sacrificing the reasonable specificity.Evaluation:- This process is kind of tricky for the train data set provided, as the data was highly imbalanced, the dependent feature/variable had imbalanced classes
We are back to participate in another Datathon hosted by Data Science Society. This time the theme is Text Analytics.
We will not be able to completely devote ourselves to the cause this time because of the exams which start in next week. But We’ll try to keep the article as simple and well detailed as possible so that it will be helpful for any new Data Science Enthusiast seeking for little helps. So Lets roll.
News is the lifeline of the human society , it underlines all the important events and influences public opinion like no other tool , but with the recent advent of electronic media and the sheer amount of new being churned out and the current political climate it’s hard to figure out what’s genuine news and what’s propaganda , this is where intelligent systems which can classify news articles , text fragments as propagandistic or non-propagandistic comes into play , this Datathon is focussed on developing such a system using various algorithms and methods to predict such a scenario the levels of challenges are:
A System that is able to classify a news article whether it is propaganda or not.
A System that is able to classify whether a sentence in a article is propaganda or not.
A System that is intelligently able to classify the propaganda technique used in the new piece.
This article describes our submission for the Hack the News Datathon 2019 which focuses on Task 2, Propaganda sentence classification. It outlines our exploratory data analysis, methodology and future work. Our work revolves around the BERT model as we believe it offers an excellent language model that’s also good at attending to context which is an important aspect of propaganda detection.
The Institute for Propaganda Analysis in 1938 defined propaganda as: “The expression of an opinion or an action by individuals or groups deliberately designed to influence the opinions or the actions of other individuals or groups with reference to predetermined ends”. – Institute for Propaganda Analysis The point of view, highlights, and storytelling expressed in […]
This is a Leopards team’s submission for the Propaganda Detection datathon. Key findings: the best performing classifier is logistic regression, operating on Word2vec representations of the sentences plus several designed features like the proportion of sentiment-bearing words in the sentence.
Edit: link to github is here: https://github.com/mboyanov/propaganda-deteciton Article