TC-DWA: Text Clustering with Dual Word-Level Augmentation

被引:0
|
作者
Cheng, Bo [1 ,4 ,5 ]
Li, Ximing [2 ,3 ]
Chang, Yi [1 ,4 ,5 ]
机构
[1] Jilin Univ, Sch Artificial Intelligence, Jilin, Jilin, Peoples R China
[2] Jilin Univ, Coll Comp Sci & Technol, Jilin, Jilin, Peoples R China
[3] Jilin Univ, Key Lab Symbol Computat & Knowledge Engn, MOE, Jilin, Jilin, Peoples R China
[4] Jilin Univ, Int Ctr Future Sci, Jilin, Jilin, Peoples R China
[5] Minist Educ, Engn Res Ctr Knowledge Driven Human Machine Intel, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The pre-trained language models, e.g., ELMo and BERT, have recently achieved promising performance improvement in a wide range of NLP tasks, because they can output strong contextualized embedded features of words. Inspired by their great success, in this paper we aim to fine-tune them to effectively handle the text clustering task, i.e., a classic and fundamental challenge in machine learning. Accordingly, we propose a novel BERT-based method, namely Text Clustering with Dual Word-level Augmentation (TC-DWA). To be specific, we formulate a self-training objective and enhance it with a dual word-level augmentation technique. First, we suppose that each text contains several most informative words, called anchor words, supporting the full text semantics. We use the embedded features of anchor words as augmented features, which are selected by ranking the norm-based attention weights of words. Second, we formulate an expectation form of word augmentation, which is equivalent to generating infinite augmented features, and further suggest a tractable approximation of Taylor expansion for efficient optimization. To evaluate the effectiveness of TC-DWA, we conduct extensive experiments on several benchmark text datasets. The results demonstrate that TC-DWA consistently outperforms the state-of-the-art baseline methods. Code available: https://github.com/BoCheng-96/TC-DWA.
引用
收藏
页码:7113 / 7121
页数:9
相关论文
共 37 条
  • [31] Hybrid System for Continuous Word-level Emphasis Modeling Based on HMM State Clustering and Adaptive Training
    Do, Quoc Truong
    Todat, Tomoki
    Neubig, Graham
    Sakti, Sakriani
    Nakamura, Satoshi
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3196 - 3200
  • [32] Neurobiological Bases of Reading Comprehension: Insights From Neuroimaging Studies of Word-Level and Text-Level Processing in Skilled and Impaired Readers
    Landi, Nicole
    Frost, Stephen J.
    Mencl, W. Einar
    Sandak, Rebecca
    Pugh, Kenneth R.
    READING & WRITING QUARTERLY, 2013, 29 (02) : 145 - 167
  • [33] Word-level dual channel with multi-head semantic attention interaction for community question answering
    Wu, Jinmeng
    Hong, Hanyu
    Zhang, Yaozong
    Hao, Yanbin
    Ma, Lei
    Wang, Lei
    ELECTRONIC RESEARCH ARCHIVE, 2023, 31 (10): : 6012 - 6026
  • [34] Multiple Text Style Transfer by using Word-level Conditional Generative Adversarial Network with Two-Phase Training
    Lai, Chih-Te
    Hong, Yi-Te
    Chen, Hong-You
    Lu, Chi-Jen
    Lin, Shou-De
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 3579 - 3584
  • [35] Script-Level Word Sample Augmentation for Few-Shot Handwritten Text Recognition
    Chen, Wei
    Su, Xiangdong
    Zhang, Haoran
    FRONTIERS IN HANDWRITING RECOGNITION, ICFHR 2022, 2022, 13639 : 316 - 330
  • [36] THE EFFECTS OF A DUAL-LEVEL STIMULUS-WORD LIST ON THE OCCURRENCE OF CLUSTERING IN RECALL
    COHEN, BH
    BOUSFIELD, WA
    JOURNAL OF GENERAL PSYCHOLOGY, 1956, 55 (01): : 51 - 58
  • [37] RKadiyala at SemEval-2024 Task 8: Black-Box Word-Level Text Boundary Detection in Partially Machine Generated Texts
    Kadiyala, Ram Mohan Rao
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 511 - 519