Topic extraction from extremely short texts with variational manifold regularization

被引:8
|
作者
Li, Ximing [1 ,2 ]
Wang, Yang [3 ,4 ]
Ouyang, Jihong [1 ,2 ]
Wang, Meng [3 ,4 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun, Peoples R China
[2] Minist Educ, Key Lab Symbol Computat & Knowledge Engn, Changchun, Peoples R China
[3] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei, Peoples R China
[4] Hefei Univ Technol, Intelligent Interconnected Syst Lab Anhui Prov, Hefei, Peoples R China
基金
中国国家自然科学基金;
关键词
Topic modeling; Short text; Dirichlet multinomial mixture; Variational manifold regularization; Online inference; NETWORK; MODEL;
D O I
10.1007/s10994-021-05962-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the emerging of massive short texts, e.g., social media posts and question titles from Q&A systems, discovering valuable information from them is increasingly significant for many real-world applications of content analysis. The family of topic modeling can effectively explore the hidden structures of documents through the assumptions of latent topics. However, due to the sparseness of short texts, the existing topic models, e.g., latent Dirichlet allocation, lose effectiveness on them. To this end, an effective solution, namely Dirichlet multinomial mixture (DMM), supposing that each short text is only associated with a single topic, indirectly enriches document-level word co-occurrences. However, DMM is sensitive to noisy words, where it often learns inaccurate topic representations at the document level. To address this problem, we extend DMM to a novel Laplacian Dirichlet Multinomial Mixture (LapDMM) topic model for short texts. The basic idea of LapDMM is to preserve local neighborhood structures of short texts, enabling to spread topical signals among neighboring documents, so as to modify the inaccurate topic representations. This is achieved by incorporating the variational manifold regularization into the variational objective of DMM, constraining the close short texts with similar variational topic representations. To find nearest neighbors of short texts, before model inference, we construct an offline document graph, where the distances of short texts can be computed by the word mover's distance. We further develop an online version of LapDMM, namely Online LapDMM, to achieve inference speedup on massive short texts. Carrying this implications, we exploit the spirit of stochastic optimization with mini-batches and an up-to-date document graph that can efficiently find approximate nearest neighbors instead. To evaluate our models, we compare against the state-of-the-art short text topic models on several traditional tasks, i.e., topic quality, document clustering and classification. The empirical results demonstrate that our models achieve very significant performance gains over the baseline models.
引用
收藏
页码:1029 / 1066
页数:38
相关论文
共 50 条
  • [31] Modeling Topic Evolution in Social Media Short Texts
    Zhang, Yuhao
    Mao, Wenji
    Lin, Junjie
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (IEEE ICBK 2017), 2017, : 315 - 319
  • [32] Additive Regularization for Topic Modeling in Sociological Studies of User-Generated Texts
    Apishev, Murat
    Koltcov, Sergei
    Koltsova, Olessia
    Nikolenko, Sergey
    Vorontsov, Konstantin
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2016, PT I, 2017, 10061 : 169 - 184
  • [33] Semi-supervised Max-margin Topic Model with Manifold Posterior Regularization
    Hu, Wenbo
    Zhu, Jun
    Su, Hang
    Zhuo, Jingwei
    Zhang, Bo
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1865 - 1871
  • [34] An Intention-Topic Model Based on Verbs Clustering and Short Texts Topic Mining
    Lu, Tingting
    Hou, Shifeng
    Chen, Zhenxiang
    Cui, Lizhen
    Zhang, Lei
    CIT/IUCC/DASC/PICOM 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY - UBIQUITOUS COMPUTING AND COMMUNICATIONS - DEPENDABLE, AUTONOMIC AND SECURE COMPUTING - PERVASIVE INTELLIGENCE AND COMPUTING, 2015, : 837 - 842
  • [35] Topic Discovery from Heterogeneous Texts
    Qiang, Jipeng
    Chen, Ping
    Ding, Wei
    Wang, Tong
    Xie, Fei
    Wu, Xindong
    2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2016), 2016, : 196 - 203
  • [36] Topic Modeling over Short Texts by Incorporating Word Embeddings
    Qiang, Jipeng
    Chen, Ping
    Wang, Tong
    Wu, Xindong
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2017, PT II, 2017, 10235 : 363 - 374
  • [37] A Multilevel Clustering Model for Coherent Topic Discovery in Short Texts
    Maithya, Emmanuel Muthoka
    Nderu, Lawrence
    Njagi, Dennis
    2022 IST-AFRICA CONFERENCE, 2022,
  • [38] Robust Word-Network Topic Model for Short Texts
    Wang, Fei
    Liu, Rui
    Zuo, Yuan
    Zhang, Hui
    Zhang, He
    Wu, Junjie
    2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2016), 2016, : 852 - 856
  • [39] Incorporating Biterm Correlation Knowledge into Topic Modeling for Short Texts
    Zhang, Kai
    Zhou, Yuan
    Chen, Zheng
    Liu, Yufei
    Tang, Zhuo
    Yin, Li
    Chen, Jihong
    COMPUTER JOURNAL, 2022, 65 (03): : 537 - 553
  • [40] A New Sentiment and Topic Model for Short Texts on Social Media
    Xu, Kang
    Huang, Junheng
    Qi, Guilin
    SEMANTIC TECHNOLOGY, JIST 2017, 2017, 10675 : 183 - 198