Joint unsupervised contrastive learning and robust GMM for text clustering

被引:6
|
作者
Hu, Chenxi [1 ]
Wu, Tao [1 ,2 ]
Liu, Shuaiqi [2 ]
Liu, Chunsheng [1 ]
Ma, Tao [1 ]
Yang, Fang [1 ]
机构
[1] Natl Univ Def Technol, Coll Elect Engn, Hefei 230031, Peoples R China
[2] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
关键词
Text clustering; Contrastive learning; Negative sampling; Gaussian mixture model; Expectation maximization;
D O I
10.1016/j.ipm.2023.103529
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text clustering aims to organize a vast collection of documents into meaningful and coherent clusters, thereby facilitating the extraction of valuable insights. While current frameworks for text clustering try to minimize the anisotropy of pre-trained language models through contrastive learning of text embeddings, the approach of treating in-batch samples as negatives is suboptimal. The K-means algorithm offers a way to sample both hard negatives and false negatives. However, relying solely on a single measure of semantic similarity between distribu-tions and using coarse-grained weighting for negative pairs may potentially limit performance. Furthermore, considering the very similar distribution in text clusters due to rich semantics, the Mahalanobis distance-based Gaussian Mixture Model (GMM) is prone to falling into local optima due to one Gaussian model, having a smaller weight, may gradually merging into another during the parameter evaluation by the EM algorithm. To tackle these challenges, we propose a model named JourTC: Joint unsupervised contrastive learning and robust GMM for Text Clustering. In the contrastive learning phase, hard negatives, potential false negatives, and their corresponding global similarity-aware weights are determined through posterior probabilities derived from a Robust GMM (RGMM). This RGMM utilizes the entropy of each individual Gaussian model as a metric and adaptively adjusts the posterior probabilities of samples based on the Gaussian models with both maximum and minimum entropy to diminish the influence of low-entropy Gaussian models. Extensive experiments have shown that JourTC can be seamlessly integrated into existing text clustering frameworks, leading to a notable improvement in accuracy. Our code is publicly available.1
引用
收藏
页数:17
相关论文
共 50 条
  • [21] Unsupervised image clustering algorithm based on contrastive learning and K-nearest neighbors
    Zhang, Xiuling
    Wang, Shuo
    Wu, Ziyun
    Tan, Xiaofei
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2022, 13 (09) : 2415 - 2423
  • [22] A Simple and Effective Usage of Self-supervised Contrastive Learning for Text Clustering
    Shi, Haoxiang
    Wang, Cen
    Sakai, Tetsuya
    2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2021, : 315 - 320
  • [23] Unsupervised image clustering algorithm based on contrastive learning and K-nearest neighbors
    Xiuling Zhang
    Shuo Wang
    Ziyun Wu
    Xiaofei Tan
    International Journal of Machine Learning and Cybernetics, 2022, 13 : 2415 - 2423
  • [24] UniTRec: A Unified Text-to-Text Transformer and Joint Contrastive Learning Framework for Text-based Recommendation
    Mao, Zhiming
    Wang, Huimin
    Du, Yiming
    Wong, Kam-Fai
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1160 - 1170
  • [25] Inference for probabilistic unsupervised text clustering
    Rigouste, Lois
    Cappe, Olivier
    Yvon, Francois
    2005 IEEE/SP 13th Workshop on Statistical Signal Processing (SSP), Vols 1 and 2, 2005, : 351 - 356
  • [26] Kalman contrastive unsupervised representation learning
    Yekta, Mohammad Mahdi Jahani
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [27] Unsupervised Node Clustering via Contrastive Hard Sampling
    Cui, Hang
    Abdelzaher, Tarek
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT VI, DASFAA 2024, 2024, 14855 : 285 - 300
  • [28] Oscar: Omni-scale robust contrastive learning for Text-VQA
    Yue, Jianyu
    Bi, Xiaojun
    Chen, Zheng
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 255
  • [29] Unsupervised deep clustering via adaptive GMM modeling and optimization
    Wang, Jinghua
    Jiang, Jianmin
    NEUROCOMPUTING, 2021, 433 : 199 - 211
  • [30] Robust multilayer bootstrap networks in ensemble for unsupervised representation learning and clustering
    Zhang, Xiao-Lei
    Li, Xuelong
    PATTERN RECOGNITION, 2024, 156