dina zaychik, dzay, [email protected]
Sergey Sedov, Sianur, [email protected]
Task 1.
The hypothesis is that propaganda/non-propaganda on article level could be detected using distributional semantics features.
That’s why we performed thorough preprocessing, removing urls, hashtags, unusual symbols, unusual articles beginnings, non-English first paragraphs (using langid open package), short texts.
After that we trained fasttext supervised model (the best one was with 150 epochs, model #8).
The hyperparameters that we tried are listed in the following part from our code (with our internal 5-fold cross-validation results).
params = [
dict(input=train_path, epoch=40, thread=10, lr=0.1, ws=5, loss=’hs’, minCount=5, dim=80),
# 0 Precision=0.854086 Recall=0.795290 F1=0.823640
dict(input=train_path, epoch=40, thread=10, lr=0.1, ws=5, loss=’hs’, minCount=5, dim=40),
dict(input=train_path, epoch=40, thread=10, lr=0.1, ws=5, loss=’softmax’, minCount=5, dim=60),
# 2 Precision=0.840996 Recall=0.795290 F1=0.817505
dict(input=train_path, epoch=40, thread=10, lr=0.1, ws=5, loss=’hs’, minCount=5, dim=60),
dict(input=train_path, epoch=60, thread=10, lr=0.1, ws=5, loss=’hs’, minCount=5, dim=80),
# 4 Precision=0.821494 Recall=0.817029 F1=0.819255
dict(input=train_path, epoch=60, thread=10, lr=0.05, ws=5, loss=’hs’, minCount=5, dim=80),
# 5 Precision=0.861386 Recall=0.788043 F1=0.823084
dict(input=train_path, epoch=100, thread=10, lr=0.05, ws=5, loss=’hs’, minCount=5, dim=80),
# 6 Precision=0.845283 Recall=0.811594 F1=0.828096
dict(input=train_path, epoch=100, thread=10, lr=0.05, ws=10, loss=’hs’, minCount=5, dim=80),
# 7 Precision=0.845149 Recall=0.820652 F1=0.832721
dict(input=train_path, epoch=150, thread=10, lr=0.05, ws=10, loss=’hs’, minCount=5, dim=80),
# 8 Precision=0.849624 Recall=0.818841 F1=0.833948
dict(input=train_path, epoch=150, thread=10, lr=0.05, ws=20, loss=’hs’, minCount=5, dim=80),
# 9 Precision=0.856322 Recall=0.809783 F1=0.832402
dict(input=train_path, epoch=150, thread=10, lr=0.05, ws=20, loss=’hs’, minCount=5, dim=120),
# 10 Precision=0.839779 Recall=0.826087 F1=0.832877
dict(input=train_path, epoch=300, thread=10, lr=0.05, ws=20, loss=’hs’, minCount=5, dim=120),
# 11 Precision=0.846442 Recall=0.818841 F1=0.832413
dict(input=train_path, epoch=300, thread=10, lr=0.05, ws=15, loss=’hs’, minCount=5, dim=120),
# 12 Precision=0.845438 Recall=0.822464 F1=0.833792
dict(input=train_path, epoch=300, thread=10, lr=0.05, ws=15, loss=’hs’, minCount=5, dim=80),
# 13 Precision=0.844156 Recall=0.824275 F1=0.834097
dict(input=train_path, epoch=300, thread=12, lr=0.05, ws=15, loss=’hs’, minCount=5, dim=60),
]
model_num = len(params) – 1
print(model_num)
#fasttext learning
model = train_supervised(**params[model_num])
Task 2.
The hypothesis is that it is a very detailed task, so hand-crafted, manually chosen features could be useful here.
We did not notice non-English texts here, looking at them manually, so we did not apply such preprocessing as in Task 1.
So we used 5 groups of features:
1) POS tags features (using most of the English specified POS tags from SPACY package)
Example of code for extracting them:
english_tags_list = [‘FW’, ‘JJ’, ‘JJR’, ‘JJS’, ‘MD’, ‘PDT’, ‘PRP’, ‘PRP$’, ‘RBR’, ‘RBS’, ‘VB’, ‘VBD’, ‘VBG’, ‘VBN’, ‘VBP’,
‘VBZ’, ‘WDT’, ‘WP’, ‘WP$’, ‘PUNCT’, ‘CD’, ‘NN’, ‘NNP’, ‘NNPS’, ‘NNS’, ‘POS’, ‘RB’, ‘RP’, ‘UH’, ‘WRB’,
‘DT’]
# Get POS tags features for the train set
train_dict = []
for i, row in df.iterrows():
annotated_text = spacyparser(row[‘text’])
tokens_number = len([word.text for word in annotated_text])
pos_tags_dict = {}
english_pos_tags = [word.tag_ for word in annotated_text]
if tokens_number == 0:
for english_tag in english_tags_list:
pos_tags_dict[english_tag] = 0
else:
for english_tag in english_tags_list:
pos_tags_dict[english_tag] = english_pos_tags.count(english_tag) / tokens_number
train_dict.append(pos_tags_dict)
2) readability indexes features (popular readability indexes, such as Automated Readability Index)
3) named entity features (‘PERSON’, ‘NORP’, ‘ORG’ in SPACY annotations)
4) simple text features, such as text length, or frequencies of punctuation marks per text
5) features from publicly available lexicons of affective words, biased word etc.: word frequencies of words from such lexicons.
After looking at their importance in the model, we performed features selection and left only most important features.
In general, finally we had 57 features.
As they were presented in a different way (for example, readability Flesh-Kincaid index can be 9.65 and frequency of biased words can be 0.03), we used scaling.
We tried different hyperparameters for a range of baseline sklearn models (linear/rbf kernel SVM, Logistic Regression with different C values, Multinomial Naive Bayes).
Finally we chose Logistic Regression model (C=0.1), class_weight=’balanced’, solver=”lbfgs”, penalty=’l2′.
Most important features for the model are frequencies of such SPACY pos tags as MD, NN, NNS, JJS, PRP$, VBP; frequency of organization names in sentence and frequency of unique organization names in sentence; number of factive phrases in sentence.
I can send the clean notebook with all features later, if needed.
The work was inspired by the research of
https://link.springer.com/chapter/10.1007%2F978-3-319-44748-3_17 (possible features)
https://aclanthology.info/papers/R17-1045/r17-1045 (possible features)
http://aclweb.org/anthology/P18-1022 (possible features)
https://aclanthology.info/papers/C18-1287/c18-1287 (possible features)
http://resources.mpi-inf.mpg.de/impact/subho-thesis/credibility-analysis.pdf (lexicons)
https://aclanthology.info/papers/D18-1389/d18-1389 (possible features)
https://aclanthology.info/papers/D17-1317/d17-1317 (lexicons)
Task 3.
As the task looks similar to some sequence detection tasks, such as named entities recognition or aspect-based sentiment detection, we used the classic approach –
conditional random fields algorithm (sklearn-crfsuite implementation).
Much attention was paid to preprocessing, converting non-ascii characters to ascii characters, empty lines, and handling cases where sequence, based on characters numbers, ended after the end of the sentence.
All gold label examples from the train set were transformed to basic IOB annotation types.
For each token, features were: lemma, wordform, POS tag (based on SPACY annotations for English specified POS tags), if first letter is uppercase.
We also added to the token features the same features for the previous token and for the next token.
The same transformation was done for the development set and for the test set, to find out the model results.
Unfortunately, we did not have time to complete this model and to look at the results.