Using of n-grams from morphological tags for fake news classification

被引:0
|
作者
Kapusta J. [1 ]
Drlik M. [1 ]
Munk M. [1 ,2 ]
机构
[1] Department of Informatics, Constantine the Philosopher University in Nitra, Nitra
[2] Science and Research Centre, University of Pardubice, Pardubice
关键词
Computational Linguistics; Data Mining and Machine Learning; Fake news identification; Morphological analysis; Natural Language and Speech; Natural language processing; POS tagging; Text mining;
D O I
10.7717/PEERJ-CS.624
中图分类号
学科分类号
摘要
Research of the techniques for effective fake news detection has become very needed and attractive. These techniques have a background in many research disciplines, including morphological analysis. Several researchers stated that simple content-related n-grams and POS tagging had been proven insufficient for fake news classification. However, they did not realise any empirical research results, which could confirm these statements experimentally in the last decade. Considering this contradiction, the main aim of the paper is to experimentally evaluate the potential of the common use of n-grams and POS tags for the correct classification of fake and true news. The dataset of published fake or real news about the current Covid-19 pandemic was pre-processed using morphological analysis. As a result, n-grams of POS tags were prepared and further analysed. Three techniques based on POS tags were proposed and applied to different groups of n-grams in the pre-processing phase of fake news detection. The n-gram size was examined as the first. Subsequently, the most suitable depth of the decision trees for sufficient generalization was scoped. Finally, the performance measures of models based on the proposed techniques were compared with the standardised reference TF-IDF technique. The performance measures of the model like accuracy, precision, recall and f1-score are considered, together with the 10-fold cross-validation technique. Simultaneously, the question, whether the TF-IDF technique can be improved using POS tags was researched in detail. The results showed that the newly proposed techniques are comparable with the traditional TF-IDF technique. At the same time, it can be stated that the morphological analysis can improve the baseline TF-IDF technique. As a result, the performance measures of the model, precision for fake news and recall for real news, were statistically significantly improved. © 2021 Kapusta et al. All Rights Reserved.
引用
收藏
页码:1 / 27
页数:26
相关论文
共 50 条
  • [21] Error Classification Using Automatic Measures Based on n-grams and Edit Distance
    Benko, L'ubomir
    Benkova, Lucia
    Munkova, Dasa
    Munk, Michal
    Shulzenko, Danylo
    ADVANCED RESEARCH IN TECHNOLOGIES, INFORMATION, INNOVATION AND SUSTAINABILITY, ARTIIS 2022, PT I, 2022, 1675 : 345 - 356
  • [22] CONTINUOUS MODELS OF AFFECT FROM TEXT USING N-GRAMS
    Malandrakis, Nikolaos
    Potamianos, Alexandros
    Narayanan, Shrikanth
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 8500 - 8504
  • [23] Plagiarism Detection Using Stopword n-grams
    Stamatatos, Efstathios
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2011, 62 (12): : 2512 - 2527
  • [24] Spam detection using character N-grams
    Kanaris, Ioannis
    Kanaris, Konstantinos
    Stamatatos, Efstathios
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
  • [25] Automatic annotation of dialogues using n-grams
    Martinez-Hinarejos, Carlos D.
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2006, 4188 : 653 - 660
  • [26] Malware Detection and Classification Based on n-grams Attribute Similarity
    Zhang Fuyong
    Zhao Tiezhou
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE) AND IEEE/IFIP INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (EUC), VOL 1, 2017, : 793 - 796
  • [27] Author verification using syntactic N-grams
    Center for Computing Research , Instituto Politécnico Nacional , Mexico City, Mexico
    CEUR Workshop Proc.,
  • [28] Using N-grams for arabic text searching
    Mustafa, SH
    Al-Radaideh, QA
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2004, 55 (11): : 1002 - 1007
  • [29] N-grams Based Features for Indonesian Tweets Classification Problems
    Abidin, Taufik Fuadi
    Hasanuddin, Mauliana
    Mutiawani, Viska
    2017 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICELTICS), 2017, : 307 - 310
  • [30] Algorithm for Updating n-Grams Word Dictionary for Web Classification
    Abidin, Taufik Fuadi
    Ferdhiana, Ridha
    2016 INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTING (ICIC), 2016, : 432 - 436