Semi-Automatic Creation of a Reference News Corpus for Fine-Grained Multi-Label Scenarios

被引:0
|
作者
Teixeira, Jorge [1 ]
Sarmento, Luis [1 ]
Oliveira, Eugenio [2 ]
机构
[1] Labs SAPO UP, FEUP LIACC, Rua Dr Roberto Frias S-N, P-4200465 Oporto, Portugal
[2] FEUP LIACC, P-4200465 Oporto, Portugal
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we tackle the problem of creating a reference corpus for the classification of news items in fine-grained multi-label scenarios. These scenarios are particularly challenging for text classification techniques, and the availability of reference corpora is one important bottleneck for developing and testing new classification strategies. We propose a semiautomatic approach for creating a reference corpus that uses three auxiliary classification methods - one based on Support Vector Machines, one based on Nearest Neighbor Classifiers and another based on a dictionary-based classification heuristic - for suggesting to human annotators topic-related labels that can be used to describe different facets of a given news item being annotated. Using such approach, we semi-automatically produce a corpus of 1,600 news items with 865 different labels, having in average 3.63 labels per news item. We evaluate the contribution of each of the auxiliary classification methods to the annotation process and we conclude that: (i) none of the methods alone is capable of suggesting all relevant labels, (ii) a dictionary-based classification heuristic contributes significantly and (iii) the Nearest Neighbor classifier performs very efficiently in the most extreme multi-label part of the problem and is robust to the very unbalanced item-to-class distribution.
引用
收藏
页码:749 / +
页数:2
相关论文
共 50 条
  • [31] Design and Evaluation of SentiEcon: a fine-grained Economic/Financial Sentiment Lexicon from a Corpus of Business News
    Moreno-Ortiz, Antonio
    Fernandez-Cruz, Javier
    Perez-Hernandez, Chantal
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5065 - 5072
  • [32] Developing Multi-Labelled Corpus of Twitter Short Texts: A Semi-Automatic Method
    Liu, Xuan
    Zhou, Guohui
    Kong, Minghui
    Yin, Zhengtong
    Li, Xiaolu
    Yin, Lirong
    Zheng, Wenfeng
    SYSTEMS, 2023, 11 (08):
  • [33] Few-Shot Fine-Grained Entity Typing with Automatic Label Interpretation and Instance Generation
    Huang, Jiaxin
    Meng, Yu
    Han, Jiawei
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 605 - 614
  • [34] DualAttNet: Synergistic fusion of image-level and fine-grained disease attention for multi-label lesion detection in chest X-rays
    Xu, Qing
    Duan, Wenting
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 168
  • [35] Roll with the Punches: Expansion and Shrinkage of Soft Label Selection for Semi-supervised Fine-Grained Learning
    Duan, Yue
    Zhao, Zhen
    Qi, Lei
    Zhou, Luping
    Wang, Lei
    Shi, Yinghuan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 10, 2024, : 11829 - 11837
  • [36] A multi-scale fine-grained LUTI model to simulate land-use scenarios in Luxembourg
    Gerber, Philippe
    Caruso, Geoffrey
    Cornelis, Eric
    de Chardon, Cyrille Medard
    JOURNAL OF TRANSPORT AND LAND USE, 2018, 11 (01) : 255 - 272
  • [37] The Subject Annotations of the Danish Parliament Corpus (2009-2017) Evaluated with Automatic Multi-label Classification
    Navarretta, Costanza
    Hansen, Dorte Haltrup
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1428 - 1436
  • [38] Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers
    Muthukrishnan, Harini
    Nellans, David
    Lustig, Daniel
    Fessler, Jeffrey A.
    Wenisch, Thomas F.
    2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 139 - 152
  • [39] Extensive Experimental Evaluation Self-Organizing Maps for Automatic Classification of a Multi-Class Multi-Label Corpus
    Giannopoulou, Eleni
    Mitrou, Nikolas
    IEEE ACCESS, 2018, 6 : 67385 - 67403
  • [40] Automatic Image Annotation Using Semi-Supervised Multi-Instance Multi-Label Learning Algorithm
    Feng Songhe
    Xu De
    Lang Congyan
    Li Bing
    CHINESE JOURNAL OF ELECTRONICS, 2008, 17 (04): : 602 - 606