Business Understanding
With the rise of social media everybody is free to share their thoughts and ideas. In addition, various news media suffer from the unethical practice to deliver deliberate disinformation. Therefore, with great online speech freedom, an even greater responsibility arises – to be able to differentiate between fake, i.e. propagandistic type of text, and an objective, evidence-based view.
Catching fake news or types of propaganda is highly essential open source cause to which would like to contribute. Hence, what follows next would be our study case for the 2019 Datathon – Hack the News.
Since our team consists of 7 people we aim to address all 3 tasks.
TASK 1 – technical documentation can be found in Task1.zip file.
Summary
This approach is based on the use of Recurrent Neural Network with LSTM layer. The text that is going to be fed in the neural network is presented as word embedding. Different approaches for the word embeddings were run – using pretrained Word2Vec and training the embeddings inside the neural network. Different architectures of the neural network with different hyperparameters were fine-tunned.
- Data Prep
1.1 Text standardization
The data prep of the text includes few standardization techniques – the special characters and punctuations are removed from the text; each word is lemmatized; as final step – stop words are removed.
1.2 Random splits
The other part of the data prep is to split the data on two random samples. The first one is employed for the model training and the second sample is used to check for model overfitting.
1.3 Tokenization
Each article is presented as tokens, which represent the different words in the article. The first 6000 tokens are used to present each article. The process of tokenization transforms the raw text to sequences of indexes of the words which are found in the text. The number of tokens to employ from an article is a hyper-parameter that was fine-tuned to achieve optimal performance.
1.4 Padding
Since each article is with different number of words, i.e. tokens, we had to standardize their length, because the neural network requires the inputs to be with equal lengths. The average number of tokens in the random split used for the training of the model is 348. The articles with more tokens than 348 are cut to use only the first 348 tokens and for the articles with less than 348 tokens the sequence is filled with indexes 0.
- Model
This is the final architecture of the neural network. The keras library is used for the model creation.
The first layer is Embedding layer. We tried few techniques here – to use a pre-trained word2vec models, inputted here, as well as tests for different embed_size – 100, 200, 300. The current architecture where the neural network is learning the embeddings by itself, rather than using pre-trained embeddings, shows better results.
The second layer is a Bidirectionl LSTM layer. We experimented with few layers and with different numbers of neurons. Again, the current architecture status is the best one found.
- Performance
The performance on the training random sample: F1-score: 0.93
The performance on the second random sample: F1-score: 0.81
The performance on development sample: F1-score: 0.816176
TASK 2 – technical documentation can be found in Task2.zip file.
For task 2 we assembled different ideas in identifying propaganda instances – neural networks(considered by some as the ‘black-box-model’) and the features derivation approach. Finally, we combine them in an ensemble model where we search for optimal f1 score.
What follows next is a discussion of the algorithms that are fed into the ensemble model.
Component 1 – average vector of w2v embeddings and XGBoost model
Summary
This approach is based on the use of pretrained word2vec word embeddings. The text is presented as words/tokens and each token is replace by its word2vec vector representation. For each sentence all the vectors are combined as one average vector. Then the vectors are inputted in an XGBoost model.
1.2 Random splits
Here the splitting is a little more special, because it won’t be objective if we just split all the sentences on two. It will be more robust if the splitting is done on article level. I.e. all the articles are split on two parts.
1.3 Average vector
As explained earlier, each token in an article is replaced by its vector representation. For each sentence all the vectors are combined to one average vector.
Model
An XGBoost model is used for the modelling part. The input are the average vector representations of the sentences. Random Search technique is used for the hyper-parameters optimization.
- Performance
The performance on the training random sample: F1-score: 0.85
The performance on the second random sample: F1-score: 0.53
Component 2 – average vector of w2v embeddings and NN model
The same data prep was conducted and neural network employed as the one used in Task1. The parameter for the padding is changed here to be adequate to the sentence data. The size of the vector representation in the embedding layer is also optimized.
The Component 2 model is with the same data prep as explained above. Furthermore, the same neural network architecture as in Taks1 is implemented. Note: few parameters update in line with the data for Task 2.
Component 3 – approach with semantic feature engineering
Data Understanding
Data Preparation
The team generated a number of semantic features from the train set text corpus.
- Sentiment analysis
Two approaches were applied:
- Use lexicon with emotions value for each word (in the range [-5, 5]) and calculate the mean of the values for each word in the sentence
- Use pre-trained model from textblob package in Python, that assigns a number in the range [-1, 1] for two semantic features: polarity and subjectivity
- Entities tagging
Types and quantity of entities mentioned in the sentence is believed to be important for the definition whether a sentence is propagandistic or not. The python model en_core_web_sm from nltk was used and the following categories were tagged:
‘GPE’, ‘CARDINAL’, ‘ORG’,
‘NORP’, ‘PERSON’, ‘DATE’, ‘TIME’,
‘LOC’, ‘ORDINAL’, ‘EVENT’, ‘WORK_OF_ART’,
‘FAC’, ‘LAW’, ‘PRODUCT’, ‘MONEY’,
‘QUANTITY’, ‘PERCENT’, ‘LANGUAGE’
- Readability scores
The readability of a text/sentence can be used to predict whether it is propagandistic or not – special levels of readability are expected to point to propagandistic sentences. As a source for this analysis, we employed – https://pypi.org/project/textstat/. The following readability scores were calculated using textstat module in Python:
flesch_reading_ease
flesch_kincaid_grade
gunning_fog
smog_index
automated_readability_index
coleman_liau_index
linsear_write_formula
dale_chall_readability_score
- Tokenization
For every sentence in the train set the following operations were performed:
- Split into tokens
- Removed stop words (used standard stop words list from nltk)
- Lemmatizing the tokens, we chose to use verb form for representing the tokens
- Representing each token into its initial form (stemming)
- Create bi-grams and tri-grams
- Extracted phrases from the sentences (i.e. west_world_bad)
- Vectorization
- word2index vocabulary representation was used. The vocabulary contains 238k unique words and phrases
- After tokenisation the tf-idf weight matrix was calculated
- Latent Semantic Analysis (LSA) was performed to extract 600 vectors from the 238k tf-idf feature values
- Length of sentence
The length of the sentence (number of words after stop words were removed) was used as a feature.
Modeling
The team chose to use kernel SVM classifier with parameters based on grid search, cross-validated on 5 CV chunks.
The model was built on a train/test split with proportion 0.7/0.3.
On the train sample we used synthetic oversampling in order to extract a similar proportion of propagandistic sentences to the test sample. This is expected to boost the model performance.
Component 4- Random Forest
Data Prep
The data prep step was conducted similar to the XGBoost data prep in Component 1.
Model
A Random Forest model is used for the modelling part. The input are the average vector representations of the sentences. Random Search technique is used for the hyper-parameters optimization.
Component 5
The model implemented for task 3 was also used as Component 5 in task 2. For details please see the description below.
Task 3 – technical documentation can be found in Task3.zip file.
- High Level Idea
We are using Deep Learning approach to classify the text fragments for propaganda. The general approach is as follow:
- Split article per sentences
- Tokenize each word in the sentences
- Feed tokens in a sequential model by using retrained word embedding model and get prediction for each token
- Consolidate results on article level
The pros of this approach are that there are a lot of standard LSTM and CNN models for sentiment analysis of a text.
Main challenges are coming from the data prep. On the pre-modeling prep converting text to tokens keeping their relative positions and outcome. This is crucial for cases when tokenizer is dropping words of the sentence. On the post-modeling prep converting back results to text positions.
- Considered scenarios
We considered two scenarios for addressing the approach described above.
- End-To-End modelling
There are 18 kinds of propaganda, as follow:
- Appeal_to_Authority
- Appeal_to_fear-prejudice
- Bandwagon
- Black-and-White_Fallacy
- Causal_Oversimplification
- Doubt
- Exaggeration,Minimisation
- Flag-Waving
- Loaded_Language
- Name_Calling,Labeling
- Obfuscation,Intentional_Vagueness,Confusion
- Red_Herring
- Reductio_ad_hitlerum
- Repetition
- Slogans
- Straw_Men
- Thought-terminating_Cliches
- Whataboutism
One idea is to use a multi-categorical classification for each token. There are 19 categories – 18 propagandas and non-propaganda. Unfortunately, there is overlapping between the propaganda fragments, which means that some tokens could belong to a several categories simultaneously.
Therefore, we decided to run a classification for each kind of Proganda. Unfortunately, as could be seen from the table below, not all of the propagandas are well populated.
Propaganda Type | # |
Loaded_Language | 1627 |
Name_Calling,Labeling | 839 |
Repetition | 427 |
Doubt | 359 |
Exaggeration,Minimisation | 350 |
Flag-Waving | 182 |
Appeal_to_fear-prejudice | 162 |
Causal_Oversimplification | 133 |
Slogans | 110 |
Black-and-White_Fallacy | 83 |
Appeal_to_Authority | 81 |
Thought-terminating_Cliches | 57 |
Whataboutism | 52 |
Reductio_ad_hitlerum | 35 |
Reductio_ad_hitlerum | 35 |
Red_Herring | 18 |
Obfuscation,Intentional_Vagueness,Confusion | 9 |
Straw_Men | 8 |
Bandwagon | 7 |
Practically, we received a good models just for top 2 propagandas.
- Two Stage modelling
In order to be able to cover all kind of propaganda, we split the modelling tasks on 2 phases.
Phase 1. Detect a propagandistic phrase.
Phase 2. Classify a propagandistic phrase.
4.1 Word embedding
We used a pretrained Glove model on Wikipedia corpus having 400k words in it. We tested different size of the vectors – 50, 100, 200 and 300. Based on the test we selected word embedding of size 200.
4.2 Final model used
On training side we considered two approaches – Bidirectional LSTM models and 1D CNN models. We selected CNN models, because LSTM models are much slower on the hardware we used, therefore it took much more time for selecting the appropriate network architecture and hyper parameters
4.2.1 Propaganda identification
model = Sequential()
model.add(Embedding(num_words, EMBEDDING_DIM, weights=[embedding_matrix], trainable=False))
model.add(Conv1D(filters=LATENT_DIM, kernel_size=5, padding=”same”))
model.add(MaxPooling1D(pool_size=3, strides=1, padding=”same”))
model.add(Conv1D(filters=LATENT_DIM, kernel_size=4, padding=”same”))
model.add(MaxPooling1D(pool_size=4, strides=1, padding=”same”))
model.add(Conv1D(filters=LATENT_DIM, kernel_size=3, padding=”same”))
model.add(MaxPooling1D(pool_size=5, strides=1, padding=”same”))
model.add(TimeDistributed(Dense(20, activation=”relu”)))
model.add(Dense(1, activation=”sigmoid”))
model.compile(
loss=’binary_crossentropy’,
optimizer=Adam(lr=0.01),
metrics=[‘accuracy’]
)
Layer (type) Output Shape Param #
=================================================================
embedding_12 (Embedding) (None, None, 200) 3786200
_________________________________________________________________
conv1d_19 (Conv1D) (None, None, 32) 32032
_________________________________________________________________
max_pooling1d_15 (MaxPooling (None, None, 32) 0
_________________________________________________________________
conv1d_20 (Conv1D) (None, None, 32) 4128
_________________________________________________________________
max_pooling1d_16 (MaxPooling (None, None, 32) 0
_________________________________________________________________
conv1d_21 (Conv1D) (None, None, 32) 3104
_________________________________________________________________
max_pooling1d_17 (MaxPooling (None, None, 32) 0
_________________________________________________________________
time_distributed_10 (TimeDis (None, None, 20) 660
_________________________________________________________________
dense_20 (Dense) (None, None, 1) 21
=================================================================
Total params: 3,826,145
Trainable params: 39,945
Non-trainable params: 3,786,200
F1 score on the train-dev set is 0.25
4.2.2 Propaganda classification
model = Sequential()
model.add(Embedding(num_words, EMBEDDING_DIM, weights=[embedding_matrix], trainable=False))
model.add(Conv1D(filters=LATENT_DIM, kernel_size=5, padding=”same”))
model.add(MaxPooling1D(pool_size=3, strides=1, padding=”same”))
model.add(Conv1D(filters=LATENT_DIM, kernel_size=4, padding=”same”))
model.add(MaxPooling1D(pool_size=4, strides=1, padding=”same”))
model.add(Conv1D(filters=LATENT_DIM, kernel_size=3, padding=”same”))
model.add(GlobalMaxPool1D())
model.add(Dropout(0.2))
model.add(Dense(128, activation=”relu”))
model.add(Dropout(0.2))
model.add(Dense(18, activation=”softmax”))
model.compile(
loss=’binary_crossentropy’,
optimizer=Adam(lr=0.01),
metrics=[‘accuracy’]
)
Layer (type) Output Shape Param #
=================================================================
embedding_15 (Embedding) (None, None, 200) 1359600
_________________________________________________________________
conv1d_7 (Conv1D) (None, None, 32) 32032
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, None, 32) 0
_________________________________________________________________
conv1d_8 (Conv1D) (None, None, 32) 4128
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, None, 32) 0
_________________________________________________________________
conv1d_9 (Conv1D) (None, None, 32) 3104
_________________________________________________________________
global_max_pooling1d_10 (Glo (None, 32) 0
_________________________________________________________________
dropout_19 (Dropout) (None, 32) 0
_________________________________________________________________
dense_19 (Dense) (None, 128) 4224
_________________________________________________________________
dropout_20 (Dropout) (None, 128) 0
_________________________________________________________________
dense_20 (Dense) (None, 18) 2322
=================================================================
Total params: 1,405,410
Trainable params: 45,810
Non-trainable params: 1,359,600
F1 score on the train-dev set is 0.35
- Options for further research
During the brainstorming sessions, there were 2 additional ideas, which we did not have time to work on.
- Based task 3 on the results of task2. As result all non-propagandistic sentences could be filtered and taks 3 will focus just on finding the propagandistic phrase in a sentence which is propaganda.
- To replicate YOLO object detection model for this task.
9 thoughts on “Datathon – HackNews – Solution – FlipFlops”
Nice stuff – i used similar architecture but with 2 dropouts and more epochs (which took forever), also used Glove Embedding which didn’t help much.
If you are interested: https://www.datasciencesociety.net/fighting-propaganda-task1/
Very thorough analysis, I like that you looked into all three tasks. The approaches are very reasonable, and I wonder if you have any explanation why your Task1 and Task2 models were not among the top scoring.
Hi Laura,
Thank you for your comment. It’s great to see that you have found time to review it.
Unfortunately regarding Task 2 we did not manage to apply ‘Component 3’ part due to data preparation issue on the test set. In the paper we described all the models we had built and trained, but unfortunately we didn’t apply all of them on the test sample. To our knowledge, this could lead to the poor performance on task 2.
Thank you for participating in the event as a mentor and expert J
For us as a team it will be very beneficial and highly appreciated if you share your thoughts for improvement and what you would do differently.
Thanks,
Ognyan
Hi! Thanks for the kind words!
I believe that the reason our model under-performed on Task 2 is that we ran out of time and couldn’t make the standard checks for over-fitting. By the time we had the final ensemble, the DEV set was already offline, so we couldn’t really see its performance on an out-of-sample data. This meant that we either had to go with the best single component (evaluated on DEV) or take a leap of faith and submit the ensemble on TEST.
We decided that whether we win or lose, we’ll do it as a team, so we went with the combined model. Sadly, the ship sank with everyone on board 🙂
Hi Laura,
Thank you for the good words. I am happy to understand that our work is appreciated.
For Task 1, I am thinking about the following things, that could lead to better performing model.
We didn’t played enough with the hypeparameters and the architecture of the neural network.
For example, the text is padded to the mean number of tokens in article, which means that for half of the articles, some information is dropped.
If the padding parameter is adjusted, we can feed more information to the neural network.
On the other hand, the LSTMs are not working very well with long sequences, so experimenting with different layers here could be beneficial.
For example, an attention mechanism added to the LSTM layer could be tested here.
Using different types of text representations, models and feature engineering, should explain and catch different connections in the texts.
Similar to the approach we have in Task 2, creating a stacked ensemble should give a better performing final model.
Really great article, great analysis, and great in modeling the task in a way that makes a lot of sense. The features tried should help further research on the problem.
Thank you very much for your comment. It’s great to understand that our work could help further the research.
Regarding the features, we were influenced by your presentation in front of DSS community half a year ago. By far every datathon we participate, we try to keep good balance between performance and interpretability of the models.
Hi guys. Good work and nice article. I have some questions for you, all regarding task 3:
1. You mention that, due to overlapping, you opted for running multiple binary classifiers. Did you consider to try multi-task learning? Any idea what would have been the outcome?
2. You say that you only obtained good models for 2 techniques. May I ask for which ones?
3. You report F-measures of 0.25 and 0.35. May I assume this is for the singleton tasks of spotting a propagandistic event and then classifying it with one of the techniques? Otherwise, any justification for the huge drop wrt the test set?
Hi Alberto,
Thank you for you questions. Our answers are as follow:
1. We did not considered it. Not sure why, may be all of the chaos and stress to organize the tasks and start quickly to produce output has blind spotted us for this option.
2. They are the 2 most populated – Loaded_Language and Name_Calling,Labeling
3. You are correct, they are on the singleton tasks. When the models are applied one after other, the performance drop to significantly (pure multiplication). Our test set score was close to the dev and train-dev set (just 0.005 points drop). Something we have tried is to combined the 2 approaches. We have used the individual models for the top 2 propaganda models and combined them with the joint model of the other 16 techniques. Although we got better f1 scores for each propaganda type, the overall score was lower.