Joint unsupervised contrastive learning and robust GMM for text clustering

被引：6

作者：

Hu, Chenxi ^{[1
]}

Wu, Tao ^{[1
,2
]}

Liu, Shuaiqi ^{[2
]}

Liu, Chunsheng ^{[1
]}

Ma, Tao ^{[1
]}

Yang, Fang ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Elect Engn, Hefei 230031, Peoples R China

[2] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China

来源：

INFORMATION PROCESSING & MANAGEMENT | 2024年 / 61卷 / 01期

关键词：

Text clustering; Contrastive learning; Negative sampling; Gaussian mixture model; Expectation maximization;

D O I：

10.1016/j.ipm.2023.103529

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text clustering aims to organize a vast collection of documents into meaningful and coherent clusters, thereby facilitating the extraction of valuable insights. While current frameworks for text clustering try to minimize the anisotropy of pre-trained language models through contrastive learning of text embeddings, the approach of treating in-batch samples as negatives is suboptimal. The K-means algorithm offers a way to sample both hard negatives and false negatives. However, relying solely on a single measure of semantic similarity between distribu-tions and using coarse-grained weighting for negative pairs may potentially limit performance. Furthermore, considering the very similar distribution in text clusters due to rich semantics, the Mahalanobis distance-based Gaussian Mixture Model (GMM) is prone to falling into local optima due to one Gaussian model, having a smaller weight, may gradually merging into another during the parameter evaluation by the EM algorithm. To tackle these challenges, we propose a model named JourTC: Joint unsupervised contrastive learning and robust GMM for Text Clustering. In the contrastive learning phase, hard negatives, potential false negatives, and their corresponding global similarity-aware weights are determined through posterior probabilities derived from a Robust GMM (RGMM). This RGMM utilizes the entropy of each individual Gaussian model as a metric and adaptively adjusts the posterior probabilities of samples based on the Gaussian models with both maximum and minimum entropy to diminish the influence of low-entropy Gaussian models. Extensive experiments have shown that JourTC can be seamlessly integrated into existing text clustering frameworks, leading to a notable improvement in accuracy. Our code is publicly available.1

引用

页数：17

共 50 条

[31] C2L: Causally Contrastive Learning for Robust Text Classification
Choi, Seungtaek
Jeong, Myeongho
Han, Hojae
Hwang, Seung-won
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10526 - 10534
[32] Contrastive learning with text augmentation for text classification
Jia, Ouyang
Huang, Huimin
Ren, Jiaxin
Xie, Luodi
Xiao, Yinyin
APPLIED INTELLIGENCE, 2023, 53 (16) : 19522 - 19531
[33] Contrastive learning with text augmentation for text classification
Ouyang Jia
Huimin Huang
Jiaxin Ren
Luodi Xie
Yinyin Xiao
Applied Intelligence, 2023, 53 : 19522 - 19531
[34] Robust image clustering via context-aware contrastive graph learning
Fang, Uno
Li, Jianxin
Lu, Xuequan
Mian, Ajmal
Gu, Zhaoquan
PATTERN RECOGNITION, 2023, 138
[35] Pyramid contrastive learning for clustering
Zhou, Zi-Feng
Huang, Dong
Wang, Chang-Dong
NEURAL NETWORKS, 2025, 185
[36] Supporting Clustering with Contrastive Learning
Zhang, Dejiao
Nan, Feng
Wei, Xiaokai
Li, Shang-Wen
Zhu, Henghui
McKeown, Kathleen
Nallapati, Ramesh
Arnold, Andrew O.
Xiang, Bing
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5419 - 5430
[37] Contrastive author-aware text clustering
Tang, Xudong
Dong, Chao
Zhang, Wei
PATTERN RECOGNITION, 2022, 130
[38] Joint subspace learning and subspace clustering based unsupervised feature selection
Xiao, Zijian
Chen, Hongmei
Mi, Yong
Luo, Chuan
Horng, Shi-Jinn
Li, Tianrui
NEUROCOMPUTING, 2025, 635
[39] Robust Federated Learning Based on Metrics Learning and Unsupervised Clustering for Malicious Data Detection
Li, Jiaming
Zhang, Xinyue
Zhao, Liang
ACMSE 2022: PROCEEDINGS OF THE 2022 ACM SOUTHEAST CONFERENCE, 2022, : 238 - 242
[40] Joint contrastive triple-learning for deep multi-view clustering
Hu, Shizhe
Zou, Guoliang
Zhang, Chaoyang
Lou, Zhengzheng
Geng, Ruilin
Ye, Yangdong
INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (03)

← 1 2 3 4 5 →