Semi-Automatic Creation of a Reference News Corpus for Fine-Grained Multi-Label Scenarios

被引：0

作者：

Teixeira, Jorge ^{[1
]}

Sarmento, Luis ^{[1
]}

Oliveira, Eugenio ^{[2
]}

机构：

[1] Labs SAPO UP, FEUP LIACC, Rua Dr Roberto Frias S-N, P-4200465 Oporto, Portugal

[2] FEUP LIACC, P-4200465 Oporto, Portugal

来源：

SISTEMAS E TECNOLOGIAS DE INFORMACAO, VOL I | 2011年

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper we tackle the problem of creating a reference corpus for the classification of news items in fine-grained multi-label scenarios. These scenarios are particularly challenging for text classification techniques, and the availability of reference corpora is one important bottleneck for developing and testing new classification strategies. We propose a semiautomatic approach for creating a reference corpus that uses three auxiliary classification methods - one based on Support Vector Machines, one based on Nearest Neighbor Classifiers and another based on a dictionary-based classification heuristic - for suggesting to human annotators topic-related labels that can be used to describe different facets of a given news item being annotated. Using such approach, we semi-automatically produce a corpus of 1,600 news items with 865 different labels, having in average 3.63 labels per news item. We evaluate the contribution of each of the auxiliary classification methods to the annotation process and we conclude that: (i) none of the methods alone is capable of suggesting all relevant labels, (ii) a dictionary-based classification heuristic contributes significantly and (iii) the Nearest Neighbor classifier performs very efficiently in the most extreme multi-label part of the problem and is robust to the very unbalanced item-to-class distribution.

引用

页码：749 / +

页数：2

共 50 条

[21] An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation
Viszlay, Peter
Stas, Jan
Koctur, Tomas
Lojka, Martin
Juhar, Jozef
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : CP1 - CP99
[22] Multi-label Fine-Grained Entity Typing for Baidu Wikipedia Based on Pre-trained Model
Pu, Keyu
Liu, Hongyi
Yang, Yixiao
Lv, Wenyi
Li, Jinlong
CCKS 2021 - EVALUATION TRACK, 2022, 1553 : 114 - 123
[23] Towards ultrasonic guided wave fine-grained damage detection on hierarchical multi-label classification network
Guo, Ziye
Zhou, Ruohua
Gao, Yan
Fu, Wei
Yu, Qiuyu
MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2024, 218
[24] Semi-Automatic Creation of Youth Slang Corpus and Its Application to Affective Computing
Ren, Fuji
Matsumoto, Kazuyuki
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 176 - 189
[25] Semi-Automatic Classification and Duplicate Detection From Human Loss News Corpus
Abid, Adnan
Ali, Waqas
Farooq, Muhammad Shoaib
Farooq, Uzma
Khan, Nabeel Sabir
Abid, Kamran
IEEE ACCESS, 2020, 8 (08): : 97737 - 97747
[26] Automatic Identification Methods on a Corpus of Twenty Five Fine-Grained Arabic Dialects
Harrat, Salima
Meftouh, Karima
Abidi, Karima
Smaili, Kamel
ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019, 2019, 1108 : 79 - 92
[27] Refining semi-automatic parallel corpus creation for Zulu to English statistical machine translation
Kotze, Gideon
2016 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS INTERNATIONAL CONFERENCE (PRASA-ROBMECH), 2016,
[28] Multi-label multi-class COVID-19 Arabic Twitter dataset with fine-grained misinformation and situational information annotations
Obeidat, Rasha
Gharaibeh, Maram
Abdullah, Malak
Alharahsheh, Yara
PEERJ COMPUTER SCIENCE, 2022, 8
[29] Multi-label multi-class COVID-19 Arabic Twitter dataset with fine-grained misinformation and situational information annotations
Obeidat R.
Gharaibeh M.
Abdullah M.
Alharahsheh Y.
PeerJ Computer Science, 2022, 8
[30] The USAGE review corpus for fine-grained, multi-lingual opinion analysis
Klinger, Roman
Cimiano, Philipp
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2211 - 2218

← 1 2 3 4 5 →