Scalable Training of Hierarchical Topic Models

被引:11
|
作者
Chen, Jianfei [1 ]
Zhu, Jun [1 ]
Lu, Jie [2 ]
Liu, Shixia [2 ]
机构
[1] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Dept Comp Sci & Tech, Beijing 100084, Peoples R China
[2] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Sch Software, Beijing 100084, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 11卷 / 07期
基金
北京市自然科学基金;
关键词
DIRICHLET; INFERENCE;
D O I
10.14778/3192965.3192972
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by tree-structured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.
引用
收藏
页码:826 / 839
页数:14
相关论文
共 50 条
  • [21] Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity
    Azarbonyad, Hosein
    Dehghani, Mostafa
    Kenter, Tom
    Marx, Maarten
    Kamps, Jaap
    de Rijke, Maarten
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2017, 2017, 10193 : 68 - 81
  • [22] Multi-feature hierarchical topic models for human behavior recognition
    HePing Li
    Feng Zhang
    ShuWu Zhang
    Science China Information Sciences, 2014, 57 : 1 - 15
  • [23] Analysis and tuning of hierarchical topic models based on Renyi entropy approach
    Koltcov, Sergei
    Ignatenko, Vera
    Terpilovskii, Maxim
    Rosso, Paolo
    PEERJ COMPUTER SCIENCE, 2021, 7
  • [24] Analysis and tuning of hierarchical topic models based on Renyi entropy approach
    Koltcov S.
    Ignatenko V.
    Terpilovskii M.
    Rosso P.
    PeerJ Computer Science, 2021, 7 : 1 - 35
  • [25] Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies
    Soumen Chakrabarti
    Byron Dom
    Rakesh Agrawal
    Prabhakar Raghavan
    The VLDB Journal, 1998, 7 : 163 - 178
  • [26] Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies
    Chakrabarti, S
    Dom, B
    Agrawal, R
    Raghavan, P
    VLDB JOURNAL, 1998, 7 (03): : 163 - 178
  • [27] Regularizing Topic Discovery in EMRs with Side Information by Using Hierarchical Bayesian Models
    Li, Cheng
    Rana, Santu
    Phung, Dinh
    Venkatesh, Svetha
    2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 1307 - 1312
  • [28] cDLRM: Look Ahead Caching for Scalable Training of Recommendation Models
    Balasubramanian, Keshav
    Alshabanah, Abdulla
    Choe, Joshua
    Annavaram, Murali
    15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 263 - 272
  • [29] Scalable Training of Inference Networks for Gaussian-Process Models
    Shi, Jiaxin
    Khan, Mohammad Emtiyaz
    Zhu, Jun
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [30] Simplified training algorithm for hierarchical hidden Markov models
    Ueda, N
    Sato, T
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE, 2004, 87 (05): : 59 - 69