Joint unsupervised contrastive learning and robust GMM for text clustering

被引:6
|
作者
Hu, Chenxi [1 ]
Wu, Tao [1 ,2 ]
Liu, Shuaiqi [2 ]
Liu, Chunsheng [1 ]
Ma, Tao [1 ]
Yang, Fang [1 ]
机构
[1] Natl Univ Def Technol, Coll Elect Engn, Hefei 230031, Peoples R China
[2] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
关键词
Text clustering; Contrastive learning; Negative sampling; Gaussian mixture model; Expectation maximization;
D O I
10.1016/j.ipm.2023.103529
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text clustering aims to organize a vast collection of documents into meaningful and coherent clusters, thereby facilitating the extraction of valuable insights. While current frameworks for text clustering try to minimize the anisotropy of pre-trained language models through contrastive learning of text embeddings, the approach of treating in-batch samples as negatives is suboptimal. The K-means algorithm offers a way to sample both hard negatives and false negatives. However, relying solely on a single measure of semantic similarity between distribu-tions and using coarse-grained weighting for negative pairs may potentially limit performance. Furthermore, considering the very similar distribution in text clusters due to rich semantics, the Mahalanobis distance-based Gaussian Mixture Model (GMM) is prone to falling into local optima due to one Gaussian model, having a smaller weight, may gradually merging into another during the parameter evaluation by the EM algorithm. To tackle these challenges, we propose a model named JourTC: Joint unsupervised contrastive learning and robust GMM for Text Clustering. In the contrastive learning phase, hard negatives, potential false negatives, and their corresponding global similarity-aware weights are determined through posterior probabilities derived from a Robust GMM (RGMM). This RGMM utilizes the entropy of each individual Gaussian model as a metric and adaptively adjusts the posterior probabilities of samples based on the Gaussian models with both maximum and minimum entropy to diminish the influence of low-entropy Gaussian models. Extensive experiments have shown that JourTC can be seamlessly integrated into existing text clustering frameworks, leading to a notable improvement in accuracy. Our code is publicly available.1
引用
收藏
页数:17
相关论文
共 50 条
  • [31] C2L: Causally Contrastive Learning for Robust Text Classification
    Choi, Seungtaek
    Jeong, Myeongho
    Han, Hojae
    Hwang, Seung-won
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10526 - 10534
  • [32] Contrastive learning with text augmentation for text classification
    Jia, Ouyang
    Huang, Huimin
    Ren, Jiaxin
    Xie, Luodi
    Xiao, Yinyin
    APPLIED INTELLIGENCE, 2023, 53 (16) : 19522 - 19531
  • [33] Contrastive learning with text augmentation for text classification
    Ouyang Jia
    Huimin Huang
    Jiaxin Ren
    Luodi Xie
    Yinyin Xiao
    Applied Intelligence, 2023, 53 : 19522 - 19531
  • [34] Robust image clustering via context-aware contrastive graph learning
    Fang, Uno
    Li, Jianxin
    Lu, Xuequan
    Mian, Ajmal
    Gu, Zhaoquan
    PATTERN RECOGNITION, 2023, 138
  • [35] Pyramid contrastive learning for clustering
    Zhou, Zi-Feng
    Huang, Dong
    Wang, Chang-Dong
    NEURAL NETWORKS, 2025, 185
  • [36] Supporting Clustering with Contrastive Learning
    Zhang, Dejiao
    Nan, Feng
    Wei, Xiaokai
    Li, Shang-Wen
    Zhu, Henghui
    McKeown, Kathleen
    Nallapati, Ramesh
    Arnold, Andrew O.
    Xiang, Bing
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5419 - 5430
  • [37] Contrastive author-aware text clustering
    Tang, Xudong
    Dong, Chao
    Zhang, Wei
    PATTERN RECOGNITION, 2022, 130
  • [38] Joint subspace learning and subspace clustering based unsupervised feature selection
    Xiao, Zijian
    Chen, Hongmei
    Mi, Yong
    Luo, Chuan
    Horng, Shi-Jinn
    Li, Tianrui
    NEUROCOMPUTING, 2025, 635
  • [39] Robust Federated Learning Based on Metrics Learning and Unsupervised Clustering for Malicious Data Detection
    Li, Jiaming
    Zhang, Xinyue
    Zhao, Liang
    ACMSE 2022: PROCEEDINGS OF THE 2022 ACM SOUTHEAST CONFERENCE, 2022, : 238 - 242
  • [40] Joint contrastive triple-learning for deep multi-view clustering
    Hu, Shizhe
    Zou, Guoliang
    Zhang, Chaoyang
    Lou, Zhengzheng
    Geng, Ruilin
    Ye, Yangdong
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (03)