|Country of origin?||
Popular articles by omutlu
Popular comments by omutlu
Dear Ramybaly and Preslav,
Thank you for your remarks and questions. Regarding your questions :
1. Yes, indeed I used the 3b fine-tuning schema. This is the one used for single sentence (any chunk of text in this case) classification tasks.
2. Since this is not NER task (or any task that would need case information), we assume a cased model would not contribute to the outcome, in fact it may even harm it.
3. As the authors of BERT showed previously with their experiments, large BERT model indeed gets higher results than BERT-base. But it is not a significant gain in our opinion and also it is impossible for us to use it without a TPU (or a high number of GPU’s) at the moment, since the large model’s size is huge. Also we needed to consider the time restriction.
4. The pretrained BERT model is only trained using Masked LM and Next Sentence Prediction tasks, which are unsupervised tasks. Therefore, out of the box, it is not suitable to use for classification tasks, or any other task for that matter.
5. We have been using BERT for other tasks as well, for submissions in Hyperpartisan News Detection, a SemEval 2019 shared task and also in preliminary experiments for our own CLEF 2019 lab https://emw.ku.edu.tr/clef-protestnews-2019/ . The document types of these two tasks are also news articles as in this datathon. We tried in these previous experiments 128, 256 and 512 as our maximum sequence lengths and found out that 256 gives us the best results, hence we use it here. This may be that, using the lead sentences of a news article has been found more effective for some NLP tasks [1,2]. From our experiments, we can argue that, for news articles first 128 tokens does not carry enough information and first 512 tokens has too much irrelevant information.
Hope my answers were satisfactory. Please let me know if you have any further questions.
 Brandow, R., Mitze, K., & Rau, L. F. (1995). Automatic condensation of electronic publications by sentence selection. Information Processing & Management, 31(5), 675-685.
 Wasson, M. (1998, August). Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2 (pp. 1364-1368). Association for Computational Linguistics.
You can refer to my answer to the 5th question of Preslav’s comment for your first question.