TC-DWA: Text Clustering with Dual Word-Level Augmentation

被引:0
|
作者
Cheng, Bo [1 ,4 ,5 ]
Li, Ximing [2 ,3 ]
Chang, Yi [1 ,4 ,5 ]
机构
[1] Jilin Univ, Sch Artificial Intelligence, Jilin, Jilin, Peoples R China
[2] Jilin Univ, Coll Comp Sci & Technol, Jilin, Jilin, Peoples R China
[3] Jilin Univ, Key Lab Symbol Computat & Knowledge Engn, MOE, Jilin, Jilin, Peoples R China
[4] Jilin Univ, Int Ctr Future Sci, Jilin, Jilin, Peoples R China
[5] Minist Educ, Engn Res Ctr Knowledge Driven Human Machine Intel, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The pre-trained language models, e.g., ELMo and BERT, have recently achieved promising performance improvement in a wide range of NLP tasks, because they can output strong contextualized embedded features of words. Inspired by their great success, in this paper we aim to fine-tune them to effectively handle the text clustering task, i.e., a classic and fundamental challenge in machine learning. Accordingly, we propose a novel BERT-based method, namely Text Clustering with Dual Word-level Augmentation (TC-DWA). To be specific, we formulate a self-training objective and enhance it with a dual word-level augmentation technique. First, we suppose that each text contains several most informative words, called anchor words, supporting the full text semantics. We use the embedded features of anchor words as augmented features, which are selected by ranking the norm-based attention weights of words. Second, we formulate an expectation form of word augmentation, which is equivalent to generating infinite augmented features, and further suggest a tractable approximation of Taylor expansion for efficient optimization. To evaluate the effectiveness of TC-DWA, we conduct extensive experiments on several benchmark text datasets. The results demonstrate that TC-DWA consistently outperforms the state-of-the-art baseline methods. Code available: https://github.com/BoCheng-96/TC-DWA.
引用
收藏
页码:7113 / 7121
页数:9
相关论文
共 37 条
  • [1] MODEL FOR WORD-LEVEL CONVERSION OF ARBITRARY TEXT TO SPEECH
    ALLEN, J
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1973, 53 (01): : 356 - &
  • [2] Word-level and phrase-level strategies for figurative text identification
    Qimeng Yang
    Long Yu
    Shengwei Tian
    Jinmiao Song
    Multimedia Tools and Applications, 2022, 81 : 14339 - 14353
  • [3] Word-level and phrase-level strategies for figurative text identification
    Yang, Qimeng
    Yu, Long
    Tian, Shengwei
    Song, Jinmiao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (10) : 14339 - 14353
  • [4] Combating Word-level Adversarial Text with Robust Adversarial Training
    Du, Xiaohu
    Yu, Jie
    Li, Shasha
    Yi, Zibo
    Liu, Hai
    Ma, Jun
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [5] Word-level Text Markup for Prosody Control in Speech Synthesis
    Korotkova, Yuliya
    Kalinovskiy, Ilya
    Vakhrusheva, Tatiana
    INTERSPEECH 2024, 2024, : 2280 - 2284
  • [6] Word-level text highlighting of medical texts for telehealth services
    Ozyegen, Ozan
    Kabe, Devika
    Cevik, Mucahit
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2022, 127
  • [7] Word-Level and Pinyin-Level Based Chinese Short Text Classification
    Sun, Xinjie
    Huo, Xingying
    IEEE ACCESS, 2022, 10 : 125552 - 125563
  • [8] A CHINESE CHARACTER-LEVEL AND WORD-LEVEL COMPLEMENTARY TEXT CLASSIFICATION METHOD
    Chen, Wentong
    Fan, Chunxiao
    Wu, Yuexin
    Lou, Zhixiong
    2020 25TH INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2020), 2020, : 187 - 192
  • [9] Document and Word-level Language Identification for Noisy User Generated Text
    Kozhirbayev, Zhanibek
    Yessenbayev, Zhandos
    Makazhanov, Aibek
    2018 IEEE 12TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2018, : 124 - 127
  • [10] Cascaded Segmentation-Detection Networks for Word-Level Text Spotting
    Qin, Siyang
    Manduchi, Roberto
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1275 - 1282