Team Name : Stark
Team Members : Vasil Dimitrov
PROBLEM STATEMENT :
Hack the news is all about propaganda versus Non-Propaganda
Propaganda is a view which can mislead us to certain false assumptions,
So here we got a chance to Identify the Propaganda in the news articles.
Propaganda became a significant issue in many sectors as it was misleading the facts.
Here the problem is about PROPAGANDA in news articles as we see how the article writers are using this platform to influence people with their writings whether its right or wrong,so we have to find a solution for this with required measures.
Technologies and Packages Used:
Python,pandas, numpy, sklearn, XGBoost
As a member of a single person team, I didn’t have much time to cover all 3 Tasks and focused on just the first one.
Task 1: Propaganda detection at the article level (PAL).
We have to classify each article as propagandistic article vs. non-propagandistic article.
The given train dataset contains 35986 rows and 3 columns in TAB-separated format :
1) First column contains the title and the contents of the article
2) Second column is the article id
3) Third column is the label of the article. Values are: “propaganda”, “non-propaganda”
Approach and Algorithms used:
# Vectorizers used –
# ML Algorithms used –
- Logistic Regression
- Decision Tree
- Random Forest
- Naive Bayes
# Accuracy Measures
- F1 Score calculation : For evaluation
I decided to start with simple algorithms proven to work fast and be competitive and even outperform some bleeding edge and computationally expensive algorithms and spend more time on fine tinning parameters.
So I tested Count and TFIDF for vectorisers (in most cases TFIDF worked better) and played with n-grams (1-4) , min-df, max-df, stop words also provided improvement in performance.
Classification methods on top were: Logistic Regression, Decision Tree (dismal performance), SVM, XGBoosts and Naive Bayes.
SVM had a clear advantage over Linear Regression , Naive Bayes and XGBoost. I spent rest of the time experimenting with values for SVC parameters.
Achieved F1 score on dev input set on Leader Board was up to 0.8569 (with Precision 0.9414).
The code is provided as an attached ZIP with ipub file.
A very simple, yet effective approach based on TF.IDF and character n-grams.
This is the winning system for Task 1.
Thanks for explaining the approach in your article. I am wondering whether the choice of the parameters of the representation (min-df, max-df, stopwording) was done. Was it purely looking at the performances or there was some intuition/analysis of the data that guided it? In the latter case, it would be nice to read about it in the article
Hello Giovanni !
Choice of parameters was mixture of experience, checking research papers and similar cases performance analysis and testing on the specific data set. I decided that lemmatisation and stemming are not good idea in this case as we will loose some context, while removing stopwords was a must. I am sorry not to try nltk stopword corpus – default sklearn corpus is known to have some issues.
You’re on to something as you have the best score! I’m also a proponent of SVM “on top.” Do you think you would have done even better with a neural network as a feature extractor? If yes, what was the limitting factor, why didn’t you try?
There is a chance that neural network will do better job as extractor but given the time constraint I preferred to make s safe bet – using simple and fast methods. I intended to experiment with neural network as well, but … will do this these days and share the results as a followup to the article.
Good work. It is nice that you show that a simple model can achieve a top performance in this task
Thank you Alberto !
This was also one of my points – to show that not always complicated algorithms are performing the best.
But still – did not expect to score best F1 !