Joint unsupervised contrastive learning and robust GMM for text clustering

被引：6

作者：

Hu, Chenxi ^{[1
]}

Wu, Tao ^{[1
,2
]}

Liu, Shuaiqi ^{[2
]}

Liu, Chunsheng ^{[1
]}

Ma, Tao ^{[1
]}

Yang, Fang ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Elect Engn, Hefei 230031, Peoples R China

[2] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China

来源：

INFORMATION PROCESSING & MANAGEMENT | 2024年 / 61卷 / 01期

关键词：

Text clustering; Contrastive learning; Negative sampling; Gaussian mixture model; Expectation maximization;

D O I：

10.1016/j.ipm.2023.103529

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text clustering aims to organize a vast collection of documents into meaningful and coherent clusters, thereby facilitating the extraction of valuable insights. While current frameworks for text clustering try to minimize the anisotropy of pre-trained language models through contrastive learning of text embeddings, the approach of treating in-batch samples as negatives is suboptimal. The K-means algorithm offers a way to sample both hard negatives and false negatives. However, relying solely on a single measure of semantic similarity between distribu-tions and using coarse-grained weighting for negative pairs may potentially limit performance. Furthermore, considering the very similar distribution in text clusters due to rich semantics, the Mahalanobis distance-based Gaussian Mixture Model (GMM) is prone to falling into local optima due to one Gaussian model, having a smaller weight, may gradually merging into another during the parameter evaluation by the EM algorithm. To tackle these challenges, we propose a model named JourTC: Joint unsupervised contrastive learning and robust GMM for Text Clustering. In the contrastive learning phase, hard negatives, potential false negatives, and their corresponding global similarity-aware weights are determined through posterior probabilities derived from a Robust GMM (RGMM). This RGMM utilizes the entropy of each individual Gaussian model as a metric and adaptively adjusts the posterior probabilities of samples based on the Gaussian models with both maximum and minimum entropy to diminish the influence of low-entropy Gaussian models. Extensive experiments have shown that JourTC can be seamlessly integrated into existing text clustering frameworks, leading to a notable improvement in accuracy. Our code is publicly available.1

引用

页数：17

共 50 条

[21] Unsupervised image clustering algorithm based on contrastive learning and K-nearest neighbors
Zhang, Xiuling
Wang, Shuo
Wu, Ziyun
Tan, Xiaofei
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2022, 13 (09) : 2415 - 2423
[22] A Simple and Effective Usage of Self-supervised Contrastive Learning for Text Clustering
Shi, Haoxiang
Wang, Cen
Sakai, Tetsuya
2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2021, : 315 - 320
[23] Unsupervised image clustering algorithm based on contrastive learning and K-nearest neighbors
Xiuling Zhang
Shuo Wang
Ziyun Wu
Xiaofei Tan
International Journal of Machine Learning and Cybernetics, 2022, 13 : 2415 - 2423
[24] UniTRec: A Unified Text-to-Text Transformer and Joint Contrastive Learning Framework for Text-based Recommendation
Mao, Zhiming
Wang, Huimin
Du, Yiming
Wong, Kam-Fai
61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1160 - 1170
[25] Inference for probabilistic unsupervised text clustering
Rigouste, Lois
Cappe, Olivier
Yvon, Francois
2005 IEEE/SP 13th Workshop on Statistical Signal Processing (SSP), Vols 1 and 2, 2005, : 351 - 356
[26] Kalman contrastive unsupervised representation learning
Yekta, Mohammad Mahdi Jahani
SCIENTIFIC REPORTS, 2024, 14 (01):
[27] Unsupervised Node Clustering via Contrastive Hard Sampling
Cui, Hang
Abdelzaher, Tarek
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT VI, DASFAA 2024, 2024, 14855 : 285 - 300
[28] Oscar: Omni-scale robust contrastive learning for Text-VQA
Yue, Jianyu
Bi, Xiaojun
Chen, Zheng
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 255
[29] Unsupervised deep clustering via adaptive GMM modeling and optimization
Wang, Jinghua
Jiang, Jianmin
NEUROCOMPUTING, 2021, 433 : 199 - 211
[30] Robust multilayer bootstrap networks in ensemble for unsupervised representation learning and clustering
Zhang, Xiao-Lei
Li, Xuelong
PATTERN RECOGNITION, 2024, 156

← 1 2 3 4 5 →