Topic extraction from extremely short texts with variational manifold regularization

被引:8
|
作者
Li, Ximing [1 ,2 ]
Wang, Yang [3 ,4 ]
Ouyang, Jihong [1 ,2 ]
Wang, Meng [3 ,4 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun, Peoples R China
[2] Minist Educ, Key Lab Symbol Computat & Knowledge Engn, Changchun, Peoples R China
[3] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei, Peoples R China
[4] Hefei Univ Technol, Intelligent Interconnected Syst Lab Anhui Prov, Hefei, Peoples R China
基金
中国国家自然科学基金;
关键词
Topic modeling; Short text; Dirichlet multinomial mixture; Variational manifold regularization; Online inference; NETWORK; MODEL;
D O I
10.1007/s10994-021-05962-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the emerging of massive short texts, e.g., social media posts and question titles from Q&A systems, discovering valuable information from them is increasingly significant for many real-world applications of content analysis. The family of topic modeling can effectively explore the hidden structures of documents through the assumptions of latent topics. However, due to the sparseness of short texts, the existing topic models, e.g., latent Dirichlet allocation, lose effectiveness on them. To this end, an effective solution, namely Dirichlet multinomial mixture (DMM), supposing that each short text is only associated with a single topic, indirectly enriches document-level word co-occurrences. However, DMM is sensitive to noisy words, where it often learns inaccurate topic representations at the document level. To address this problem, we extend DMM to a novel Laplacian Dirichlet Multinomial Mixture (LapDMM) topic model for short texts. The basic idea of LapDMM is to preserve local neighborhood structures of short texts, enabling to spread topical signals among neighboring documents, so as to modify the inaccurate topic representations. This is achieved by incorporating the variational manifold regularization into the variational objective of DMM, constraining the close short texts with similar variational topic representations. To find nearest neighbors of short texts, before model inference, we construct an offline document graph, where the distances of short texts can be computed by the word mover's distance. We further develop an online version of LapDMM, namely Online LapDMM, to achieve inference speedup on massive short texts. Carrying this implications, we exploit the spirit of stochastic optimization with mini-batches and an up-to-date document graph that can efficiently find approximate nearest neighbors instead. To evaluate our models, we compare against the state-of-the-art short text topic models on several traditional tasks, i.e., topic quality, document clustering and classification. The empirical results demonstrate that our models achieve very significant performance gains over the baseline models.
引用
收藏
页码:1029 / 1066
页数:38
相关论文
共 50 条
  • [1] Topic extraction from extremely short texts with variational manifold regularization
    Ximing Li
    Yang Wang
    Jihong Ouyang
    Meng Wang
    Machine Learning, 2021, 110 : 1029 - 1066
  • [2] Dirichlet Multinomial Mixture with Variational Manifold Regularization: Topic Modeling over Short Texts
    Li, Ximing
    Zhang, Jiaojiao
    Ouyang, Jihong
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 7884 - 7891
  • [3] Topic extraction by clustering word embeddings on short online texts
    Nabergoj, David
    D’Alconzo, Alessandro
    Valerio, Danilo
    Štrumbelj, Erik
    Elektrotehniski Vestnik/Electrotechnical Review, 2022, 89 (1-2): : 64 - 72
  • [4] Topic extraction by clustering word embeddings on short online texts
    Nabergoj, David
    D'Alconzo, Alessandro
    Valerio, Danilo
    Strumbelj, Erik
    ELEKTROTEHNISKI VESTNIK, 2022, 89 (1-2): : 64 - 72
  • [5] Hierarchical neural topic modeling with manifold regularization
    Ziye Chen
    Cheng Ding
    Yanghui Rao
    Haoran Xie
    Xiaohui Tao
    Gary Cheng
    Fu Lee Wang
    World Wide Web, 2021, 24 : 2139 - 2160
  • [6] Hierarchical neural topic modeling with manifold regularization
    Chen, Ziye
    Ding, Cheng
    Rao, Yanghui
    Xie, Haoran
    Tao, Xiaohui
    Cheng, Gary
    Wang, Fu Lee
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2021, 24 (06): : 2139 - 2160
  • [7] A Neural Topic Model Based on Variational Auto-Encoder for Aspect Extraction from Opinion Texts
    Cui, Peng
    Liu, Yuanchao
    Liu, Binqquan
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING (NLPCC 2019), PT I, 2019, 11838 : 660 - 671
  • [8] Mass of short texts clustering and topic extraction based on frequent itemsets
    Peng, Min
    Huang, Jiajia
    Zhu, Jiahui
    Huang, Jimin
    Liu, Jiping
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2015, 52 (09): : 1941 - 1953
  • [9] Topic segmentation for short texts
    Chang, TH
    Lee, CH
    PACLIC 17: LANGUAGE, INFORMATION AND COMPUTATION, PROCEEDINGS, 2003, : 159 - 165
  • [10] Online Topic Modeling for Short Texts
    Roy, Suman
    Malladi, Vijay Varma
    Sengupta, Ayan
    Das, Souparna
    SERVICE-ORIENTED COMPUTING (ICSOC 2020), 2020, 12571 : 563 - 579