Team Name : Stark
Team Members : Vasil Dimitrov
PROBLEM STATEMENT :
Hack the news is all about propaganda versus Non-Propaganda
Propaganda is a view which can mislead us to certain false assumptions,
So here we got a chance to Identify the Propaganda in the news articles.
Propaganda became a significant issue in many sectors as it was misleading the facts.
Here the problem is about PROPAGANDA in news articles as we see how the article writers are using this platform to influence people with their writings whether its right or wrong,so we have to find a solution for this with required measures.
Technologies and Packages Used:
Python,pandas, numpy, sklearn, XGBoost
As a member of a single person team, I didn’t have much time to cover all 3 Tasks and focused on just the first one.
Task 1: Propaganda detection at the article level (PAL).
We have to classify each article as propagandistic article vs. non-propagandistic article.
The given train dataset contains 35986 rows and 3 columns in TAB-separated format :
1) First column contains the title and the contents of the article
2) Second column is the article id
3) Third column is the label of the article. Values are: “propaganda”, “non-propaganda”
Approach and Algorithms used:
# Vectorizers used –
# ML Algorithms used –
- Logistic Regression
- Decision Tree
- Random Forest
- Naive Bayes
# Accuracy Measures
- F1 Score calculation : For evaluation
I decided to start with simple algorithms proven to work fast and be competitive and even outperform some bleeding edge and computationally expensive algorithms and spend more time on fine tinning parameters.
So I tested Count and TFIDF for vectorisers (in most cases TFIDF worked better) and played with n-grams (1-4) , min-df, max-df, stop words also provided improvement in performance.
Classification methods on top were: Logistic Regression, Decision Tree (dismal performance), SVM, XGBoosts and Naive Bayes.
SVM had a clear advantage over Linear Regression , Naive Bayes and XGBoost. I spent rest of the time experimenting with values for SVC parameters.
Achieved F1 score on dev input set on Leader Board was up to 0.8569 (with Precision 0.9414).
The code is provided as an attached ZIP with ipub file.