NLP

Datathon – HackNews – Solution – FlipFlops

Catching fake news or types of propaganda is highly essential open source cause to which would like to contribute. Hence, what follows next would be our study case for the 2019 Datathon – Hack the News.

13
votes

 

 

Business Understanding

With the rise of social media everybody is free to share their thoughts and ideas. In addition, various news media suffer from the unethical practice to deliver deliberate disinformation. Therefore, with great online speech freedom, an even greater responsibility arises – to be able to differentiate between fake, i.e. propagandistic type of text, and an objective, evidence-based view.

Catching fake news or types of propaganda is highly essential open source cause to which would like to contribute. Hence, what follows next would be our study case for the 2019 Datathon – Hack the News.

Since our team consists of 7 people we aim to address all 3 tasks.

 

TASK 1technical documentation can be found in Task1.zip file.

Task1

Summary

This approach is based on the use of Recurrent Neural Network with LSTM layer. The text that is going to be fed in the neural network is presented as word embedding. Different approaches for the word embeddings were run – using pretrained Word2Vec and training the embeddings inside the neural network. Different architectures of the neural network with different hyperparameters were fine-tunned.

 

  1. Data Prep

1.1 Text standardization

The data prep of the text includes few standardization techniques – the special characters and punctuations are removed from the text; each word is lemmatized; as final step – stop words are removed.

 

1.2 Random splits

The other part of the data prep is to split the data on two random samples. The first one is employed for the model training and the second sample is used to check for model overfitting.

 

1.3 Tokenization

Each article is presented as tokens, which represent the different words in the article. The first 6000 tokens are used to present each article. The process of tokenization transforms the raw text to sequences of indexes of the words which are found in the text. The number of tokens to employ from an article is a hyper-parameter that was fine-tuned to achieve optimal performance.

 

1.4 Padding

Since each article is with different number of words, i.e. tokens, we had to standardize their length, because the neural network requires the inputs to be with equal lengths. The average number of tokens in the random split used for the training of the model is 348. The articles with more tokens than 348 are cut to use only the first 348 tokens and for the articles with less than 348 tokens the sequence is filled with indexes 0.

 

  1. Model

 

This is the final architecture of the neural network. The keras library is used for the model creation.

The first layer is Embedding layer. We tried few techniques here – to use a pre-trained word2vec models, inputted here, as well as tests for different embed_size – 100, 200, 300. The current architecture where the neural network is learning the embeddings by itself, rather than using pre-trained embeddings, shows better results.

The second layer is a Bidirectionl LSTM layer. We experimented with few layers and with different numbers of neurons. Again, the current architecture status is the best one found.

 

  1. Performance

The performance on the training random sample: F1-score: 0.93

The performance on the second random sample: F1-score: 0.81

The performance on development sample: F1-score: 0.816176

 

 

TASK 2technical documentation can be found in Task2.zip file.

Task2

 

For task 2 we assembled different ideas in identifying propaganda instances – neural networks(considered by some as the ‘black-box-model’) and the features derivation approach. Finally, we combine them in an ensemble model where we search for optimal f1 score.

What follows next is a discussion of the algorithms that are fed into the ensemble model.

 

Component 1 – average vector of w2v embeddings and XGBoost model

Summary

This approach is based on the use of pretrained word2vec word embeddings. The text is presented as words/tokens and each token is replace by its word2vec vector representation. For each sentence all the vectors are combined as one average vector. Then the vectors are inputted in an XGBoost model.

 

1.2 Random splits

Here the splitting is a little more special, because it won’t be objective if we just split all the sentences on two. It will be more robust if the splitting is done on article level. I.e. all the articles are split on two parts.

 

1.3 Average vector

As explained earlier, each token in an article is replaced by its vector representation. For each sentence all the vectors are combined to one average vector.

 

Model

An XGBoost model is used for the modelling part. The input are the average vector representations of the sentences. Random Search technique is used for the hyper-parameters optimization.

 

  1. Performance

The performance on the training random sample: F1-score: 0.85

The performance on the second random sample: F1-score: 0.53

 

Component 2 – average vector of w2v embeddings and NN model

The same data prep was conducted and neural network employed as the one used in Task1. The parameter for the padding is changed here to be adequate to the sentence data. The size of the vector representation in the embedding layer is also optimized.

The Component 2 model is with the same data prep as explained above. Furthermore, the same neural network architecture as in Taks1 is implemented. Note: few parameters update in line with the data for Task 2.

 

 

Component 3 – approach with semantic feature engineering

Data Understanding

Data Preparation

The team generated a number of semantic features from the train set text corpus.

  1. Sentiment analysis

Two approaches were applied:

  • Use lexicon with emotions value for each word (in the range [-5, 5]) and calculate the mean of the values for each word in the sentence
  • Use pre-trained model from textblob package in Python, that assigns a number in the range [-1, 1] for two semantic features: polarity and subjectivity
  1. Entities tagging

Types and quantity of entities mentioned in the sentence is believed to be important for the definition whether a sentence is propagandistic or not. The python model en_core_web_sm from nltk was used and the following categories were tagged:

‘GPE’,    ‘CARDINAL’,         ‘ORG’,

‘NORP’,      ‘PERSON’,        ‘DATE’,        ‘TIME’,

‘LOC’,     ‘ORDINAL’,       ‘EVENT’, ‘WORK_OF_ART’,

‘FAC’,         ‘LAW’,     ‘PRODUCT’,       ‘MONEY’,

‘QUANTITY’,     ‘PERCENT’,    ‘LANGUAGE’

  1. Readability scores

The readability of a text/sentence can be used to predict whether it is propagandistic or not – special levels of readability are expected to point to propagandistic sentences. As a source for this analysis, we employed – https://pypi.org/project/textstat/.  The following readability scores were calculated using textstat module in Python:

flesch_reading_ease

flesch_kincaid_grade

gunning_fog

smog_index

automated_readability_index

coleman_liau_index

linsear_write_formula

dale_chall_readability_score

  1. Tokenization

For every sentence in the train set the following operations were performed:

  • Split into tokens
  • Removed stop words (used standard stop words list from nltk)
  • Lemmatizing the tokens, we chose to use verb form for representing the tokens
  • Representing each token into its initial form (stemming)
  • Create bi-grams and tri-grams
  • Extracted phrases from the sentences (i.e. west_world_bad)
  1. Vectorization
  • word2index vocabulary representation was used. The vocabulary contains 238k unique words and phrases
  • After tokenisation the tf-idf weight matrix was calculated
  • Latent Semantic Analysis (LSA) was performed to extract 600 vectors from the 238k tf-idf feature values
  1. Length of sentence

The length of the sentence (number of words after stop words were removed) was used as a feature.

Modeling

The team chose to use kernel SVM classifier with parameters based on grid search, cross-validated on 5 CV chunks.

The model was built on a train/test split with proportion 0.7/0.3.

On the train sample we used synthetic oversampling in order to extract a similar proportion of propagandistic sentences to the test sample. This is expected to boost the model performance.

 

 

Component  4- Random Forest

Data Prep

The data prep step was conducted similar to the XGBoost data prep in Component 1.

 

Model

A Random Forest model is used for the modelling part. The input are the average vector representations of the sentences. Random Search technique is used for the hyper-parameters optimization.

 

 

Component  5 

The model implemented for task 3 was also used as Component 5 in task 2. For details please see the description below.

 

 

 

Task 3 – technical documentation can be found in Task3.zip file.

Task3

 

  1. High Level Idea

 

We are using Deep Learning approach to classify the text fragments for propaganda. The general approach is as follow:

  1. Split article per sentences
  2. Tokenize each word in the sentences
  3. Feed tokens in a sequential model by using retrained word embedding model and get prediction for each token
  4. Consolidate results on article level

 

The pros of this approach are that there are a lot of standard LSTM and CNN models for sentiment analysis of a text.

Main challenges are coming from the data prep. On the pre-modeling prep converting text to tokens keeping their relative positions and outcome. This is crucial for cases when tokenizer is dropping words of the sentence. On the post-modeling prep converting back results to text positions.

 

  1. Considered scenarios

 

We considered two scenarios for addressing the approach described above.

 

  1. End-To-End modelling

 

There are 18 kinds of propaganda, as follow:

  • Appeal_to_Authority
  • Appeal_to_fear-prejudice
  • Bandwagon
  • Black-and-White_Fallacy
  • Causal_Oversimplification
  • Doubt
  • Exaggeration,Minimisation
  • Flag-Waving
  • Loaded_Language
  • Name_Calling,Labeling
  • Obfuscation,Intentional_Vagueness,Confusion
  • Red_Herring
  • Reductio_ad_hitlerum
  • Repetition
  • Slogans
  • Straw_Men
  • Thought-terminating_Cliches
  • Whataboutism

 

One idea is to use a multi-categorical classification for each token. There are 19 categories – 18 propagandas and non-propaganda. Unfortunately, there is overlapping between the propaganda fragments, which means that some tokens could belong to a several categories simultaneously.

 

Therefore, we decided to run a classification for each kind of Proganda. Unfortunately, as could be seen from the table below, not all of the propagandas are well populated.

Propaganda Type #
Loaded_Language 1627
Name_Calling,Labeling 839
Repetition 427
Doubt 359
Exaggeration,Minimisation 350
Flag-Waving 182
Appeal_to_fear-prejudice 162
Causal_Oversimplification 133
Slogans 110
Black-and-White_Fallacy 83
Appeal_to_Authority 81
Thought-terminating_Cliches 57
Whataboutism 52
Reductio_ad_hitlerum 35
Reductio_ad_hitlerum 35
Red_Herring 18
Obfuscation,Intentional_Vagueness,Confusion 9
Straw_Men 8
Bandwagon 7

 

Practically, we received a good models just for top 2 propagandas.

 

  1. Two Stage modelling

 

In order to be able to cover all kind of propaganda, we split the modelling tasks on 2 phases.

Phase 1. Detect a propagandistic phrase.

Phase 2. Classify a propagandistic phrase.

 

4.1 Word embedding

 

We used a pretrained Glove model on Wikipedia corpus having 400k words in it. We tested different size of the vectors – 50, 100, 200 and 300. Based on the test we selected word embedding of size 200.

 

 

4.2 Final model used

 

On training side we considered two approaches – Bidirectional LSTM models and 1D CNN models. We selected CNN models, because LSTM models are much slower on the hardware we used, therefore it took much more time for selecting the appropriate network architecture and hyper parameters

 

4.2.1 Propaganda identification

 

model = Sequential()

model.add(Embedding(num_words, EMBEDDING_DIM, weights=[embedding_matrix], trainable=False))

model.add(Conv1D(filters=LATENT_DIM, kernel_size=5, padding=”same”))

model.add(MaxPooling1D(pool_size=3, strides=1, padding=”same”))

model.add(Conv1D(filters=LATENT_DIM, kernel_size=4, padding=”same”))

model.add(MaxPooling1D(pool_size=4, strides=1, padding=”same”))

model.add(Conv1D(filters=LATENT_DIM, kernel_size=3, padding=”same”))

model.add(MaxPooling1D(pool_size=5, strides=1, padding=”same”))

model.add(TimeDistributed(Dense(20, activation=”relu”)))

model.add(Dense(1, activation=”sigmoid”))

model.compile(

loss=’binary_crossentropy’,

optimizer=Adam(lr=0.01),

metrics=[‘accuracy’]

)

 

Layer (type)                 Output Shape              Param #

=================================================================

embedding_12 (Embedding)     (None, None, 200)         3786200

_________________________________________________________________

conv1d_19 (Conv1D)           (None, None, 32)          32032

_________________________________________________________________

max_pooling1d_15 (MaxPooling (None, None, 32)          0

_________________________________________________________________

conv1d_20 (Conv1D)           (None, None, 32)          4128

_________________________________________________________________

max_pooling1d_16 (MaxPooling (None, None, 32)          0

_________________________________________________________________

conv1d_21 (Conv1D)           (None, None, 32)          3104

_________________________________________________________________

max_pooling1d_17 (MaxPooling (None, None, 32)          0

_________________________________________________________________

time_distributed_10 (TimeDis (None, None, 20)          660

_________________________________________________________________

dense_20 (Dense)             (None, None, 1)           21

=================================================================

Total params: 3,826,145

Trainable params: 39,945

Non-trainable params: 3,786,200

 

F1 score on the train-dev set is 0.25

 

 

4.2.2 Propaganda classification

 

model = Sequential()

model.add(Embedding(num_words, EMBEDDING_DIM, weights=[embedding_matrix], trainable=False))

model.add(Conv1D(filters=LATENT_DIM, kernel_size=5, padding=”same”))

model.add(MaxPooling1D(pool_size=3, strides=1, padding=”same”))

model.add(Conv1D(filters=LATENT_DIM, kernel_size=4, padding=”same”))

model.add(MaxPooling1D(pool_size=4, strides=1, padding=”same”))

model.add(Conv1D(filters=LATENT_DIM, kernel_size=3, padding=”same”))

model.add(GlobalMaxPool1D())

model.add(Dropout(0.2))

model.add(Dense(128, activation=”relu”))

model.add(Dropout(0.2))

model.add(Dense(18, activation=”softmax”))

model.compile(

loss=’binary_crossentropy’,

optimizer=Adam(lr=0.01),

metrics=[‘accuracy’]

)

 

Layer (type)                 Output Shape              Param #

=================================================================

embedding_15 (Embedding)     (None, None, 200)         1359600

_________________________________________________________________

conv1d_7 (Conv1D)            (None, None, 32)          32032

_________________________________________________________________

max_pooling1d_3 (MaxPooling1 (None, None, 32)          0

_________________________________________________________________

conv1d_8 (Conv1D)            (None, None, 32)          4128

_________________________________________________________________

max_pooling1d_4 (MaxPooling1 (None, None, 32)          0

_________________________________________________________________

conv1d_9 (Conv1D)            (None, None, 32)          3104

_________________________________________________________________

global_max_pooling1d_10 (Glo (None, 32)                0

_________________________________________________________________

dropout_19 (Dropout)         (None, 32)                0

_________________________________________________________________

dense_19 (Dense)             (None, 128)               4224

_________________________________________________________________

dropout_20 (Dropout)         (None, 128)               0

_________________________________________________________________

dense_20 (Dense)             (None, 18)                2322

=================================================================

Total params: 1,405,410

Trainable params: 45,810

Non-trainable params: 1,359,600

 

F1 score on the train-dev set is 0.35

 

  1. Options for further research

 

During the brainstorming sessions, there were 2 additional ideas, which we did not have time to work on.

  1. Based task 3 on the results of task2. As result all non-propagandistic sentences could be filtered and taks 3 will focus just on finding the propagandistic phrase in a sentence which is propaganda.
  2. To replicate YOLO object detection model for this task.

Share this

10 thoughts on “Datathon – HackNews – Solution – FlipFlops

  1. 1
    votes

    Very thorough analysis, I like that you looked into all three tasks. The approaches are very reasonable, and I wonder if you have any explanation why your Task1 and Task2 models were not among the top scoring.

    1. 0
      votes

      Hi Laura,

      Thank you for your comment. It’s great to see that you have found time to review it.

      Unfortunately regarding Task 2 we did not manage to apply ‘Component 3’ part due to data preparation issue on the test set. In the paper we described all the models we had built and trained, but unfortunately we didn’t apply all of them on the test sample. To our knowledge, this could lead to the poor performance on task 2.

      Thank you for participating in the event as a mentor and expert J

      For us as a team it will be very beneficial and highly appreciated if you share your thoughts for improvement and what you would do differently.

      Thanks,
      Ognyan

    2. 0
      votes

      Hi! Thanks for the kind words!
      I believe that the reason our model under-performed on Task 2 is that we ran out of time and couldn’t make the standard checks for over-fitting. By the time we had the final ensemble, the DEV set was already offline, so we couldn’t really see its performance on an out-of-sample data. This meant that we either had to go with the best single component (evaluated on DEV) or take a leap of faith and submit the ensemble on TEST.
      We decided that whether we win or lose, we’ll do it as a team, so we went with the combined model. Sadly, the ship sank with everyone on board 🙂

  2. 0
    votes

    Hi Laura,

    Thank you for the good words. I am happy to understand that our work is appreciated.

    For Task 1, I am thinking about the following things, that could lead to better performing model.
    We didn’t played enough with the hypeparameters and the architecture of the neural network.
    For example, the text is padded to the mean number of tokens in article, which means that for half of the articles, some information is dropped.
    If the padding parameter is adjusted, we can feed more information to the neural network.
    On the other hand, the LSTMs are not working very well with long sequences, so experimenting with different layers here could be beneficial.
    For example, an attention mechanism added to the LSTM layer could be tested here.

    Using different types of text representations, models and feature engineering, should explain and catch different connections in the texts.
    Similar to the approach we have in Task 2, creating a stacked ensemble should give a better performing final model.

  3. 1
    votes

    Really great article, great analysis, and great in modeling the task in a way that makes a lot of sense. The features tried should help further research on the problem.

    1. 0
      votes

      Thank you very much for your comment. It’s great to understand that our work could help further the research.
      Regarding the features, we were influenced by your presentation in front of DSS community half a year ago. By far every datathon we participate, we try to keep good balance between performance and interpretability of the models.

  4. 0
    votes

    Hi guys. Good work and nice article. I have some questions for you, all regarding task 3:

    1. You mention that, due to overlapping, you opted for running multiple binary classifiers. Did you consider to try multi-task learning? Any idea what would have been the outcome?
    2. You say that you only obtained good models for 2 techniques. May I ask for which ones?
    3. You report F-measures of 0.25 and 0.35. May I assume this is for the singleton tasks of spotting a propagandistic event and then classifying it with one of the techniques? Otherwise, any justification for the huge drop wrt the test set?

    1. 0
      votes

      Hi Alberto,

      Thank you for you questions. Our answers are as follow:
      1. We did not considered it. Not sure why, may be all of the chaos and stress to organize the tasks and start quickly to produce output has blind spotted us for this option.
      2. They are the 2 most populated – Loaded_Language and Name_Calling,Labeling
      3. You are correct, they are on the singleton tasks. When the models are applied one after other, the performance drop to significantly (pure multiplication). Our test set score was close to the dev and train-dev set (just 0.005 points drop). Something we have tried is to combined the 2 approaches. We have used the individual models for the top 2 propaganda models and combined them with the joint model of the other 16 techniques. Although we got better f1 scores for each propaganda type, the overall score was lower.

Leave a Reply