Automatic label curation from large-scale text corpus

被引:0
|
作者
Avasthi, Sandhya [1 ]
Chauhan, Ritu [2 ]
机构
[1] ABES Engn Coll, Dept CSE, Ghaziabad, India
[2] Amity Univ, Ctr Computat Biol & Bioinformat, AI & IoT Lab, Noida, Uttar Pradesh, India
来源
ENGINEERING RESEARCH EXPRESS | 2024年 / 6卷 / 01期
关键词
automatic labeling; contextual word embedding; latent dirichlet allocation; topic modeling; topic coherence; topic label;
D O I
10.1088/2631-8695/ad299e
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The topic modeling technique extracts themes based on their probabilistic measurements from any large-scale text collection. Even though topic modeling pulls out the most important phrases that describe latent themes in text collections, a suitable label has yet to be found. Interpreting the topics extracted from a text corpus and identifying a suitable label automatically reduces the cognitive load for the analyst. Extractive methods are used typically to select a label from a given candidate set, based on probability metrics for each candidate set. Some of the existing approaches use phrases, words, and images to generate labels using frequency counts of different words in the text. The paper proposes a method to generate labels automatically to represent each topic based on a labeling strategy to filter candidate labels and then apply sequence-to-sequence labelers. The objective of the method is to get a meaningful label for the result of the Latent Dirichlet Allocation algorithm. The BERTScore metric is used to evaluate the effectiveness of the proposed method. The proposed method generates good interpretative labels as compared to baseline models for topic words or terms automatically. The comparison with the label generated through ChatGPT API shows the quality of the generated label with the experiment performed on Four Datasets NIPS, Kindle, PUBMED, and CORD-19.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Learning metric space with distillation for large-scale multi-label text classification
    Qin, Shaowei
    Wu, Hao
    Zhou, Lihua
    Li, Jiahui
    Du, Guowang
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (15): : 11445 - 11458
  • [22] Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders
    Ribadas-Pena, Francisco J.
    Cao, Shuyuan
    Darriba Bilbao, Victor M.
    MATHEMATICS, 2022, 10 (16)
  • [23] Guest editorial: large-scale data curation and metadata management
    Eltabakh, Mohamed
    Glavic, Boris
    DISTRIBUTED AND PARALLEL DATABASES, 2018, 36 (01) : 5 - 8
  • [24] Guest editorial: large-scale data curation and metadata management
    Mohamed Eltabakh
    Boris Glavic
    Distributed and Parallel Databases, 2018, 36 : 5 - 8
  • [25] Mining Large-scale Event Knowledge from Web Text
    Cao, Ya-nan
    Zhang, Peng
    Guo, Jing
    Guo, Li
    2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 2014, 29 : 478 - 487
  • [26] Simple Large-scale Relation Extraction from Unstructured Text
    Christodoulopoulos, Christos
    Mittal, Arpit
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 215 - 222
  • [27] Large-Scale Multimodal Movie Dialogue Corpus
    Yasuhara, Ryu
    Inoue, Masashi
    Suga, Ikuya
    Kosaka, Tetsuo
    ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 414 - 415
  • [28] Vocal development in a large-scale crosslinguistic corpus
    Cychosz, Margaret
    Cristia, Alejandrina
    Bergelson, Elika
    Casillas, Marisa
    Baudet, Gladys
    Warlaumont, Anne S.
    Scaff, Camila
    Yankowitz, Lisa
    Seidl, Amanda
    DEVELOPMENTAL SCIENCE, 2021, 24 (05)
  • [29] A Phrase Topic Model for Large-scale Corpus
    Li, Baoji
    Xu, Wenhua
    Tian, Yuhui
    Chen, Juan
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), 2019, : 634 - 639
  • [30] An automatic image-text alignment method for large-scale web image retrieval
    Zhang, Baopeng
    Qu, Yanyun
    Peng, Jinye
    Fan, Jianping
    MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (20) : 21401 - 21421