The Multiclass Classification of Newspaper Articles with Machine Learning: The Hybrid Binary Snowball Approach

被引:16
|
作者
Sebok, Miklos [1 ]
Kacsuk, Zoltan [1 ,2 ]
机构
[1] Hungarian Acad Sci, Ctr Social Sci, Budapest, Hungary
[2] Hsch Medien, Stuttgart, Germany
关键词
machine learning; statistical analysis of texts; Comparative Agendas Project; multiclass classification; automated content analysis;
D O I
10.1017/pan.2020.27
中图分类号
D0 [政治学、政治理论];
学科分类号
0302 ; 030201 ;
摘要
In this article, we present a machine learning-based solution for matching the performance of the gold standard of double-blind human coding when it comes to content analysis in comparative politics. We combine a quantitative text analysis approach with supervised learning and limited human resources in order to classify the front-page articles of a leading Hungarian daily newspaper based on their full text. Our goal was to assign items in our dataset to one of 21 policy topics based on the codebook of the Comparative Agendas Project. The classification of the imbalanced classes of topics was handled by a hybrid binary snowball workflow. This relies on limited human resources as well as supervised learning; it simplifies the multiclass problem to one of binary choice; and it is based on a snowball approach as we augment the training set with machine-classified observations after each successful round and also between corpora. Our results show that our approach provided better precision results (of over 80% for most topic codes) than what is customary for human coders and most computer-assisted coding projects. Nevertheless, this high precision came at the expense of a relatively low, below 60%, share of labeled articles.
引用
收藏
页码:236 / 249
页数:14
相关论文
共 50 条
  • [31] A hybrid machine learning approach for imbalanced irrigation water quality classification
    Mustapha, Musa
    Zineddine, Mhamed
    Kaufman, Eran
    Friedman, Liron
    Gmira, Maha
    Majikumna, Kaloma Usman
    Alaoui, Ahmed El Hilali
    DESALINATION AND WATER TREATMENT, 2025, 321
  • [32] Machine learning approach for the classification of corn seed using hybrid features
    Ali, Aqib
    Qadri, Salman
    Mashwani, Wali Khan
    Belhaouari, Samir Brahim
    Naeem, Samreen
    Rafique, Sidra
    Jamal, Farrukh
    Chesneau, Christophe
    Anam, Sania
    INTERNATIONAL JOURNAL OF FOOD PROPERTIES, 2020, 23 (01) : 1110 - 1124
  • [33] Hybrid Feature Extraction and Machine Learning Approach for Fruits and Vegetable Classification
    Bahia, Nimratveer Kaur
    Rani, Rajneesh
    Kamboj, Aman
    Kakkar, Deepti
    PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY, 2019, 27 (04): : 1693 - 1708
  • [34] A Novel Hybrid Machine Learning Approach for Classification of Brain Tumor Images
    Asiri, Abdullah A.
    Iqbal, Amna
    Ferzund, Javed
    Ali, Tariq
    Aamir, Muhammad
    Alshamrani, Khalaf A.
    Alshamrani, Hassan A.
    Alqahtani, Fawaz F.
    Irfan, Muhammad
    Alshehri, Ali H. D.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 641 - 655
  • [35] Adversarial Machine Learning Attacks on Multiclass Classification of IoT Network Traffic
    Pantelakis, Vasileios
    Bountakas, Panagiotis
    Farao, Aristeidis
    Xenakis, Christos
    18TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY & SECURITY, ARES 2023, 2023,
  • [36] Liver Cirrhosis Stage Prediction Using Machine Learning: Multiclass Classification
    Sidana, Tejasv Singh
    Singhal, Saransh
    Gupta, Shruti
    Goel, Ruchi
    INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING AND COMMUNICATIONS, ICICC 2022, VOL 3, 2023, 492 : 109 - 129
  • [37] Multiclass Classification of Dry Bean Grains Using Machine Learning Techniques
    Coronel-Reyes, Julian
    Delgado-Vera, Carlota
    Chavez-Urbina, Jenny
    Sinche-Guzman, Andrea
    TECHNOLOGIES AND INNOVATION, CITI 2024, 2025, 2276 : 16 - 27
  • [38] A Hybrid Deep Learning and Optimized Machine Learning Approach for Rose Leaf Disease Classification
    Nuanmeesri, Sumitra
    ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2021, 11 (05) : 7678 - 7683
  • [39] Classification of Firewall Log Data Using Multiclass Machine Learning Models
    Aljabri, Malak
    Alahmadi, Amal A.
    Mohammad, Rami Mustafa A.
    Aboulnour, Menna
    Alomari, Dorieh M.
    Almotiri, Sultan H.
    ELECTRONICS, 2022, 11 (12)
  • [40] Machine learning classification of binary semiconductor heterostructures
    Rom, Samir
    Ghosh, Aishwaryo
    Halder, Anita
    Dasgupta, Tanusri Saha
    PHYSICAL REVIEW MATERIALS, 2021, 5 (04):