Datathons Solutions

Datathon – HackNews – Solution – The Blind Scientists

Introduction to NLP
Natural Language Processing (NLP) is the field of computer science that is concerned with developing algorithms for analysis of human languages. Artificial Intelligence approaches( eg. Machine Learning) have been used for solving many tasks of NLP such as parsing, POS tagging, Named Entity Recognition, word sense disambiguation, document classification, machine translation, textual entailment, question answering, summarization, etc. Natural languages are notoriously difficult to understand and model by machines mostly because of ambiguity (eg. humor, sarcasm, puns), lack of clear structure, diversity (eg. models for English are not directly applicable to Chinese). Even so, in recent years we’re witnessing rapid progress in the field of NLP, due to deep learning models, which are becoming more and more complex and able to capture subtleties of human languages.


Propaganda, is it real or not- Let’s find out.

“Propaganda is the new weapon which influences people’s opinions or believes at certain ideology whether that ideology is right or wrong.”


  1. Business Objective:- News now-a-days have become an important part of the human civilization,as news give us the freedom to express what we think and perceive as the things are,it is very important to get authentic news instead to fake one. Here we have been given News data-set which we analyse which news is fake and which isn’t,also to rectify them thus, making the news system better.
  2. Assess Situation:- Applying prediction model and by doing time series analysis on the news data we can predict beforehand which news is fake and which isn’t. As a result of this the news companies get to know the to-be defect network site and send engineers to rectify the degrading site. This saves huge resource in terms of time and money.
  3. Determine Data Mining Goals:- By analyzing real time data from each and every network sites we can identify and pinpoint failures and performance of each and every site. It allows staff to address issues even before the site starts to degrade.
  4. Produce Project Plans:- The project is implied by us using the following tools and techniques:-
  • Python
  • Machine Learning Algorithm-passive aggressive classifier
  • NLTK
  • MS-Excel


a) Collect Initial Data:-The data was provided to us by the Data Science Society which consists of Fake news.

b) Describe Data:- Since News has become one of the major deciding factor in almost everything now-a-days,it is better if get genuine news instead of fake one’s so that our influence isn’t on the wrong side and wrong decision isn’t made,so to predict further flaws in the system and correct them,we need some algorithm which would do that task for us,instead of us doing that. As we dig deep into this data, we will find valuable insights that would help us to improve the failure rate and downgrade it, so as to improve the success rate and provide a better working system to work on in the near future.

Lately the fact-checking world has been in a bit of a crisis. Sites like Politifact and Snopes have traditionally focused on specific claims, which is admirable but tedious; by the time they’ve gotten through verifying or debunking a fact, there’s a good chance it’s already traveled across the globe and back again.

Social media companies have also had mixed results limiting the spread of propaganda and misinformation. Facebook plans to have 20,000 human moderators by the end of the year, and is putting significant resources into developing its own fake-news-detecting-algorithms.

c) Explore Data:- The data set contains 35,993 rows 3 columns.

d) Verify Data Quality:- The data did contain some null values,mostly in news_text,which is a variable we took in the starting,it had 1 null value and column ID had 36 followed by type column with 36 null values.While test data had zero null values.

3.Data Preparation:-

a) Data selectionThe dataset we are going to analyse is the Fake News dataset,to mine out results based on the article given to us. As news being an essential part of the human existence and lives are solely dependent on regular updates from the newspapers and media channels, it is necessary to find out the failures, so as to improve them and make the news system better and to predict further flaws in the system and correct them. As we dig deep into this data, we will find valuable insights that would help us to improve the failure rate and downgrade it, so as to improve the success rate and provide a better working system to work on in the near future. The file chosen is a formatted file having 53mb containing 35,993 rows 3 columns

b) Clean data: The data before analysis was already clean to carry out different analytical methodologies.

c) Construct dataThe data provided to us was not that clean,so we had to clean it first,we did that using regular expressions, and then we started working on it.

d) Integrate data: The data was already well integrated so we focused on grouping up the specific columns for gaining the adequate results for our EDA part as well as the modelling part.

e) Format datathe format of the data was well versed and grouped for further exploratory analysis for both the parts. 


 a) Select Modelling TechniqueThe Machine Learning model used for the modelling was

b) Generate Test Designthe test data was generated by grouping up the two columns of the data provided which were the

c) Build Model Parameter SettingModel:- Passive Aggressive Classifier:- Passive Aggressive Algorithms are a family of online learning algorithms (for both classification and regression) proposed by Crammer at Al. The idea is very simple and their performance has been proofed to be superior to many other alternative methods like Online Perceptron and MIRA.


Let’s suppose to have a data set:

The index t has been chosen to mark the temporal dimension. In this case, in fact, the samples can continue arriving for an indefinite time. Of course, if they are drawn from same data generating distribution, the algorithm will keep learning (probably without large parameter modifications), but if they are drawn from a completely different distribution, the weights will slowly forget the previous one and learn the new distribution. For simplicity, we also assume we’re working with a binary classification based on bipolar labels.

Given a weight vector w, the prediction is simply obtained as:

All these algorithms are based on the Hinge loss function (the same used by SVM):

d)Assess ModellingThe modelling is assessed using the test data which is mentioned  above and parameters were attained.


After the model was run the results were evaluated

Firstly, to asses the data and gain insights from it the machine learning algorithm-  is used by us, We were facing some difficulties finding the time series day-wise so we assessed the model monthly and the output is therefore:-




Plan Deployment

Passive-Aggressive Active (PAA) learning algorithms by adapting the Passive-Aggressive algorithms in online active learning settings. Unlike conventional Perceptron-based approaches that employ only the mis-classified instances for updating the model, the proposed PAA learning algorithms not only use the mis-classified instances to update the classifier, but also exploit correctly classified examples with low prediction confidence. Specifically, we propose several variants of PAA algorithms to tackle three types of online learning tasks: binary classification, multi-class classification, and cost-sensitive classification. We give the mistake bounds of the proposed algorithms in theory, and conduct extensive experiments to evaluate the empirical performance of our techniques on both standard and large-scale data sets, in which the encouraging results validate the empirical effectiveness of the proposed algorithms.

The value of L is bounded between 0 (meaning perfect match) and K depending on f(x(t),θ) with K>0 (completely wrong prediction). A Passive-Aggressive algorithm works generically with this update rule:

To understand this rule, let’s assume the slack variable ξ=0 (and L constrained to be 0). If a sample x(t) is presented, the classifier uses the current weight vector to determine the sign. If the sign is correct, the loss function is 0 and the argmin is w(t). This means that the algorithm is passive when a correct classification occurs. Let’s now assume that a mis-classification occurred:

The angle θ > 90°, therefore, the dot product is negative and the sample is classified as -1, however, its label is +1. In this case, the update rule becomes very aggressive, because it looks for a new w which must be as close as possible as the previous (otherwise the existing knowledge is immediately lost), but it must satisfy L=0 (in other words, the classification must be correct).

The introduction of the slack variable allows to have soft-margins (like in SVM) and a degree of tolerance controlled by the parameter C. In particular, the loss function has to be L <= ξ, allowing a larger error. Higher C values yield stronger aggressiveness (with a consequent higher risk of destabilization in presence of noise), while lower values allow a better adaptation. In fact, this kind of algorithms, when working online, must cope with the presence of noisy samples (with wrong labels). A good robustness is necessary, otherwise, too rapid changes produce consequent higher mis-classification rates.

After solving both update conditions, we get the closed-form update rule:

This rule confirms our expectations: the weight vector is updated with a factor whose sign is determined by y(t) and whose magnitude is proportional to the error. Note that if there’s no mis-classification the nominator becomes 0, so w(t+1) = w(t), while, in case of mis-classification, w will rotate towards x(t) and stops with a loss L <= ξ. In the next figure, the effect has been marked to show the rotation, however, it’s normally as smallest as possible:

After the rotation, θ < 90° and the dot product becomes negative, so the sample is correctly classified as +1. Scikit-Learn implements Passive Aggressive algorithms, but I preferred to implement the code, just to show how simple they are. In next snippet (also available in this GIST), I first create a data set, then compute the score with a Logistic Regression and finally apply the PA and measure the final score on a test set:


For all three tasks there are multiple alternatives to compute representations; from manually-engineered to automatically-inferred features. Perhaps the most straighforward representation is the known as bag-of-words model (BoWaddress). In BoW the order of the words is neglected and each of them is weighted either on the basis of statistics of the single document, a collection, or both. Other valuable representations include the occurrence of certain words (e.g., particularly negative/positive ones) or the style in the writing. Consider for instance this MPQA’s or Bing Liu’s lexicons. Be creative! Try novel representations!

Another option is considering distributional representations: embeddings. These are models that map words, sentences, or full documents into a vector space. One good property of such vectors is that representations of semantically-similar words appear close to each other in such a space. There are multiple pre-computed embedding models available online, so you do not need to train your own model from large volumes of data. For instance, consider GLOVE, word2vec, or fastext.

Usually the computation of such representations requires a number of pre-preprocessing steps, which may include stopword removal, stemming, and/or lemmatization, part-of-speech tagging, casefolding, punctuation removal, etc. Multiple libraries exist to perform these tasks (cf. Tools and Frameworks).

One of the simplest classification models is the k nearest-neighbours algorithm. In this case, there is no training stage, but a new item is assigned to the majority class with respect to the k closest elements in the representation step. More sophisticated models include naïve bayes, support-vector machines, or multi-layer perceptron, among many other alternatives.

Task three is a sequential task in which each fragment in the text (e.g., a token) has to be labeled as one of the propagandistc techniques or none of them. Perhaps the “standard” task resembling the most of that of named entity recognition. There is plenty of material online about this technique, including an introduction to the topic and a tutorial using sklearn.

Share this

Leave a Reply