Introduction to NLP
Natural Language Processing (NLP) is the field of computer science that is concerned with developing algorithms for analysis of human languages. Artificial Intelligence approaches( eg. Machine Learning) have been used for solving many tasks of NLP such as parsing, POS tagging, Named Entity Recognition, word sense disambiguation, document classification, machine translation, textual entailment, question answering, summarization, etc. Natural languages are notoriously difficult to understand and model by machines mostly because of ambiguity (eg. humor, sarcasm, puns), lack of clear structure, diversity (eg. models for English are not directly applicable to Chinese). Even so, in recent years we’re witnessing rapid progress in the field of NLP, due to deep learning models, which are becoming more and more complex and able to capture subtleties of human languages.
dina zaychik, dzay, [email protected] Sergey Sedov, Sianur, [email protected] Task 1. The hypothesis is that propaganda/non-propaganda on article level could be detected using distributional semantics features. That’s why we performed thorough preprocessing, removing urls, hashtags, unusual symbols, unusual articles beginnings, non-English first paragraphs (using langid open package), short texts. After that we trained fasttext supervised model (the […]
Team has considered following properties of data for coming up with the solution:
Repetition of text.
Length of words
Lexical analysis of words
frequency of words
trigrams and bigrams of words
Sentiments conveyed by the
The main modeling which included in
LSTM – Long short term memory with embedding from fasttext.
Using Bidirectional LSTMs and trainable embeddings initialized with GloVe for propaganda detection at the article level
Task 1 Part 1 Task 1 Part 2
Business Understanding Fake news is a massive problem for the multiple industries and government that needs to be addressed in a more automated format. Providing an automated method to examine text and classify it as fake or propaganda can help reduce the effect of fake news. It is easier said than done though as even […]
In order to do the following we have to undergo the process of text cleaning, understanding the text. We had to find a way in order to split the data and form a data frame which consists of the following columns.News_TextNews_NumberNews_TypeThe data has lots of fillers which had to be removed and some rows where news_numbers and type were missing. In order to clean the data we had to remove the fillers using the NLTK stop words filtration. Later on we tokenized the data using the word_tokenizer from the nltk package.The next important step was to lemmatize/stem the data to remove the tense from the words and normalize the words. Even though it was a time consumption process the results were promising.XGBoost has capability to handle the imbalanced dataset by using the parameter Scale_Pos_weight. we can do threshold probability throttling to increase the sensitivity by sacrificing the reasonable specificity.Evaluation:- This process is kind of tricky for the train data set provided, as the data was highly imbalanced, the dependent feature/variable had imbalanced classes
We are back to participate in another Datathon hosted by Data Science Society. This time the theme is Text Analytics.
We will not be able to completely devote ourselves to the cause this time because of the exams which start in next week. But We’ll try to keep the article as simple and well detailed as possible so that it will be helpful for any new Data Science Enthusiast seeking for little helps. So Lets roll.
Propaganda is a form of communication that is aimed at influencing the attitude of a community toward some cause or position. It often presents facts selectively to encourage a particular synthesis. The disinformation damages the reputation of respectable news outlets, organisations and very bad for business indeed. The objective of the Hackathon is to be able to detect the Propaganda and Non-propaganda news as well as to develop a model that can help with the venture. The other objectives of this work includes detecting phrases which are propagandist and also finding out the type of propaganda it is. The algorithms that we will be taking help from are Passive Aggressive, Multiple Layer Perceptron Network, Logistic Regression, AdaBoost, Decision Tree, Random Forest, KNN, SVM and Naive Bayes to detect the potentially propagandistic and non-propagandistic sentences in a news article. For the evaluation, we are calculating F1 Score to measure the class imbalance in the testing dataset. We have used the best model for detecting propagandist and non-propagandist articles, phrases and also type of propaganda.
News is the lifeline of the human society , it underlines all the important events and influences public opinion like no other tool , but with the recent advent of electronic media and the sheer amount of new being churned out and the current political climate it’s hard to figure out what’s genuine news and what’s propaganda , this is where intelligent systems which can classify news articles , text fragments as propagandistic or non-propagandistic comes into play , this Datathon is focussed on developing such a system using various algorithms and methods to predict such a scenario the levels of challenges are:
A System that is able to classify a news article whether it is propaganda or not.
A System that is able to classify whether a sentence in a article is propaganda or not.
A System that is intelligently able to classify the propaganda technique used in the new piece.
In recent years, deceptive content such as fake news and fake reviews, also known as opinion spams, have increasingly become a dangerous prospect for online users. Fake reviews have affected consumers and stores alike. Furthermore, the problem of fake news has gained attention in 2016, especially in the aftermath of the last U.S. presidential elections. Fake reviews and fake news are a closely related phenomenon as both consist of writing and spreading false information or beliefs. The opinion spam problem was formulated for the first time a few years ago, but it has quickly become a growing research area due to the abundance of user-generated content. It is now easy for anyone to either write fake reviews or write fake news on the web. The biggest challenge is the lack of an efficient way to tell the difference between a real review and a fake one; even humans are often unable to tell the difference. We are implementing 7 machine learning classification techniques here.