Automatic label curation from large-scale text corpus

被引：0

作者：

Avasthi, Sandhya ^{[1
]}

Chauhan, Ritu ^{[2
]}

机构：

[1] ABES Engn Coll, Dept CSE, Ghaziabad, India

[2] Amity Univ, Ctr Computat Biol & Bioinformat, AI & IoT Lab, Noida, Uttar Pradesh, India

来源：

ENGINEERING RESEARCH EXPRESS | 2024年 / 6卷 / 01期

关键词：

automatic labeling; contextual word embedding; latent dirichlet allocation; topic modeling; topic coherence; topic label;

D O I：

10.1088/2631-8695/ad299e

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

The topic modeling technique extracts themes based on their probabilistic measurements from any large-scale text collection. Even though topic modeling pulls out the most important phrases that describe latent themes in text collections, a suitable label has yet to be found. Interpreting the topics extracted from a text corpus and identifying a suitable label automatically reduces the cognitive load for the analyst. Extractive methods are used typically to select a label from a given candidate set, based on probability metrics for each candidate set. Some of the existing approaches use phrases, words, and images to generate labels using frequency counts of different words in the text. The paper proposes a method to generate labels automatically to represent each topic based on a labeling strategy to filter candidate labels and then apply sequence-to-sequence labelers. The objective of the method is to get a meaningful label for the result of the Latent Dirichlet Allocation algorithm. The BERTScore metric is used to evaluate the effectiveness of the proposed method. The proposed method generates good interpretative labels as compared to baseline models for topic words or terms automatically. The comparison with the label generated through ChatGPT API shows the quality of the generated label with the experiment performed on Four Datasets NIPS, Kindle, PUBMED, and CORD-19.

引用

页数：14

共 50 条

[21] Learning metric space with distillation for large-scale multi-label text classification
Qin, Shaowei
Wu, Hao
Zhou, Lihua
Li, Jiahui
Du, Guowang
NEURAL COMPUTING & APPLICATIONS, 2023, 35 (15): : 11445 - 11458
[22] Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders
Ribadas-Pena, Francisco J.
Cao, Shuyuan
Darriba Bilbao, Victor M.
MATHEMATICS, 2022, 10 (16)
[23] Guest editorial: large-scale data curation and metadata management
Eltabakh, Mohamed
Glavic, Boris
DISTRIBUTED AND PARALLEL DATABASES, 2018, 36 (01) : 5 - 8
[24] Guest editorial: large-scale data curation and metadata management
Mohamed Eltabakh
Boris Glavic
Distributed and Parallel Databases, 2018, 36 : 5 - 8
[25] Mining Large-scale Event Knowledge from Web Text
Cao, Ya-nan
Zhang, Peng
Guo, Jing
Guo, Li
2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 2014, 29 : 478 - 487
[26] Simple Large-scale Relation Extraction from Unstructured Text
Christodoulopoulos, Christos
Mittal, Arpit
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 215 - 222
[27] Large-Scale Multimodal Movie Dialogue Corpus
Yasuhara, Ryu
Inoue, Masashi
Suga, Ikuya
Kosaka, Tetsuo
ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 414 - 415
[28] Vocal development in a large-scale crosslinguistic corpus
Cychosz, Margaret
Cristia, Alejandrina
Bergelson, Elika
Casillas, Marisa
Baudet, Gladys
Warlaumont, Anne S.
Scaff, Camila
Yankowitz, Lisa
Seidl, Amanda
DEVELOPMENTAL SCIENCE, 2021, 24 (05)
[29] A Phrase Topic Model for Large-scale Corpus
Li, Baoji
Xu, Wenhua
Tian, Yuhui
Chen, Juan
2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), 2019, : 634 - 639
[30] An automatic image-text alignment method for large-scale web image retrieval
Zhang, Baopeng
Qu, Yanyun
Peng, Jinye
Fan, Jianping
MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (20) : 21401 - 21421

← 1 2 3 4 5 →