Semi-Automatic Creation of a Reference News Corpus for Fine-Grained Multi-Label Scenarios

被引:0
|
作者
Teixeira, Jorge [1 ]
Sarmento, Luis [1 ]
Oliveira, Eugenio [2 ]
机构
[1] Labs SAPO UP, FEUP LIACC, Rua Dr Roberto Frias S-N, P-4200465 Oporto, Portugal
[2] FEUP LIACC, P-4200465 Oporto, Portugal
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we tackle the problem of creating a reference corpus for the classification of news items in fine-grained multi-label scenarios. These scenarios are particularly challenging for text classification techniques, and the availability of reference corpora is one important bottleneck for developing and testing new classification strategies. We propose a semiautomatic approach for creating a reference corpus that uses three auxiliary classification methods - one based on Support Vector Machines, one based on Nearest Neighbor Classifiers and another based on a dictionary-based classification heuristic - for suggesting to human annotators topic-related labels that can be used to describe different facets of a given news item being annotated. Using such approach, we semi-automatically produce a corpus of 1,600 news items with 865 different labels, having in average 3.63 labels per news item. We evaluate the contribution of each of the auxiliary classification methods to the annotation process and we conclude that: (i) none of the methods alone is capable of suggesting all relevant labels, (ii) a dictionary-based classification heuristic contributes significantly and (iii) the Nearest Neighbor classifier performs very efficiently in the most extreme multi-label part of the problem and is robust to the very unbalanced item-to-class distribution.
引用
收藏
页码:749 / +
页数:2
相关论文
共 50 条
  • [21] An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation
    Viszlay, Peter
    Stas, Jan
    Koctur, Tomas
    Lojka, Martin
    Juhar, Jozef
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : CP1 - CP99
  • [22] Multi-label Fine-Grained Entity Typing for Baidu Wikipedia Based on Pre-trained Model
    Pu, Keyu
    Liu, Hongyi
    Yang, Yixiao
    Lv, Wenyi
    Li, Jinlong
    CCKS 2021 - EVALUATION TRACK, 2022, 1553 : 114 - 123
  • [23] Towards ultrasonic guided wave fine-grained damage detection on hierarchical multi-label classification network
    Guo, Ziye
    Zhou, Ruohua
    Gao, Yan
    Fu, Wei
    Yu, Qiuyu
    MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2024, 218
  • [24] Semi-Automatic Creation of Youth Slang Corpus and Its Application to Affective Computing
    Ren, Fuji
    Matsumoto, Kazuyuki
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 176 - 189
  • [25] Semi-Automatic Classification and Duplicate Detection From Human Loss News Corpus
    Abid, Adnan
    Ali, Waqas
    Farooq, Muhammad Shoaib
    Farooq, Uzma
    Khan, Nabeel Sabir
    Abid, Kamran
    IEEE ACCESS, 2020, 8 (08): : 97737 - 97747
  • [26] Automatic Identification Methods on a Corpus of Twenty Five Fine-Grained Arabic Dialects
    Harrat, Salima
    Meftouh, Karima
    Abidi, Karima
    Smaili, Kamel
    ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019, 2019, 1108 : 79 - 92
  • [27] Refining semi-automatic parallel corpus creation for Zulu to English statistical machine translation
    Kotze, Gideon
    2016 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS INTERNATIONAL CONFERENCE (PRASA-ROBMECH), 2016,
  • [28] Multi-label multi-class COVID-19 Arabic Twitter dataset with fine-grained misinformation and situational information annotations
    Obeidat, Rasha
    Gharaibeh, Maram
    Abdullah, Malak
    Alharahsheh, Yara
    PEERJ COMPUTER SCIENCE, 2022, 8
  • [29] Multi-label multi-class COVID-19 Arabic Twitter dataset with fine-grained misinformation and situational information annotations
    Obeidat R.
    Gharaibeh M.
    Abdullah M.
    Alharahsheh Y.
    PeerJ Computer Science, 2022, 8
  • [30] The USAGE review corpus for fine-grained, multi-lingual opinion analysis
    Klinger, Roman
    Cimiano, Philipp
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2211 - 2218