Joint unsupervised contrastive learning and robust GMM for text clustering

被引:6
|
作者
Hu, Chenxi [1 ]
Wu, Tao [1 ,2 ]
Liu, Shuaiqi [2 ]
Liu, Chunsheng [1 ]
Ma, Tao [1 ]
Yang, Fang [1 ]
机构
[1] Natl Univ Def Technol, Coll Elect Engn, Hefei 230031, Peoples R China
[2] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
关键词
Text clustering; Contrastive learning; Negative sampling; Gaussian mixture model; Expectation maximization;
D O I
10.1016/j.ipm.2023.103529
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text clustering aims to organize a vast collection of documents into meaningful and coherent clusters, thereby facilitating the extraction of valuable insights. While current frameworks for text clustering try to minimize the anisotropy of pre-trained language models through contrastive learning of text embeddings, the approach of treating in-batch samples as negatives is suboptimal. The K-means algorithm offers a way to sample both hard negatives and false negatives. However, relying solely on a single measure of semantic similarity between distribu-tions and using coarse-grained weighting for negative pairs may potentially limit performance. Furthermore, considering the very similar distribution in text clusters due to rich semantics, the Mahalanobis distance-based Gaussian Mixture Model (GMM) is prone to falling into local optima due to one Gaussian model, having a smaller weight, may gradually merging into another during the parameter evaluation by the EM algorithm. To tackle these challenges, we propose a model named JourTC: Joint unsupervised contrastive learning and robust GMM for Text Clustering. In the contrastive learning phase, hard negatives, potential false negatives, and their corresponding global similarity-aware weights are determined through posterior probabilities derived from a Robust GMM (RGMM). This RGMM utilizes the entropy of each individual Gaussian model as a metric and adaptively adjusts the posterior probabilities of samples based on the Gaussian models with both maximum and minimum entropy to diminish the influence of low-entropy Gaussian models. Extensive experiments have shown that JourTC can be seamlessly integrated into existing text clustering frameworks, leading to a notable improvement in accuracy. Our code is publicly available.1
引用
收藏
页数:17
相关论文
共 50 条
  • [41] Unsupervised social event detection via hybrid graph contrastive learning and reinforced incremental clustering
    Guo, Yuanyuan
    Zang, Zehua
    Gao, Hang
    Xiao, Xu
    Wang, Rui
    Liu, Lixiang
    Li, Jiangmeng
    KNOWLEDGE-BASED SYSTEMS, 2024, 284
  • [42] Adaptive Working Condition Recognition With Clustering-Based Contrastive Learning for Unsupervised Anomaly Detection
    Xu, Qifa
    Xie, Tianming
    Jiang, Cuixia
    Cheng, Qiliang
    Wang, Xiangxiang
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (10) : 12103 - 12113
  • [43] Graph-Based Short Text Clustering via Contrastive Learning with Graph Embedding
    Wei, Yujie
    Zhou, Weidong
    Zhou, Jin
    Wang, Yingxu
    Han, Shiyuan
    Du, Tao
    Yang, Cheng
    Liu, Bowen
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT I, 2023, 14086 : 727 - 738
  • [44] Self-Supervised and Few-Shot Contrastive Learning Frameworks for Text Clustering
    Shi, Haoxiang
    Sakai, Tetsuya
    IEEE ACCESS, 2023, 11 : 84134 - 84143
  • [45] Learning to Perturb for Contrastive Learning of Unsupervised Sentence Representations
    Zhou, Kun
    Zhou, Yuanhang
    Zhao, Wayne Xin
    Wen, Ji-Rong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3935 - 3944
  • [46] NaCL: noise-robust cross-domain contrastive learning for unsupervised domain adaptation
    Li, Jingzheng
    Sun, Hailong
    MACHINE LEARNING, 2023, 112 (09) : 3473 - 3496
  • [47] Robust Joint Graph Learning for Multi-View Clustering
    He, Yanfang
    Yusof, Umi Kalsom
    IEEE TRANSACTIONS ON BIG DATA, 2025, 11 (02) : 722 - 734
  • [48] NaCL: noise-robust cross-domain contrastive learning for unsupervised domain adaptation
    Jingzheng Li
    Hailong Sun
    Machine Learning, 2023, 112 : 3473 - 3496
  • [49] Structured GMM Based on Unsupervised Clustering for Recognizing Adult and Child Speech
    Gorin, Arseniy
    Jouvet, Denis
    STATISTICAL LANGUAGE AND SPEECH PROCESSING, SLSP 2014, 2014, 8791 : 108 - 119
  • [50] Clustering fMRI data with a robust unsupervised learning algorithm for neuroscience data mining
    Aljobouri, Hadeel K.
    Jaber, Hussain A.
    Kocak, Orhan M.
    Algin, Oktay
    Cankaya, Ilyas
    JOURNAL OF NEUROSCIENCE METHODS, 2018, 299 : 45 - 54