A Scalable Asynchronous Distributed Algorithm for Topic Modeling

被引:32
|
作者
Yu, Hsiang-Fu [1 ]
Hsieh, Cho-Jui [1 ]
Yun, Hyokun [2 ]
Vishwanathan, S. V. N. [3 ]
Dhillon, Inderjit S. [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Amazon, Seattle, WA USA
[3] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
基金
美国国家科学基金会;
关键词
Topic Models; Scalability; Sampling;
D O I
10.1145/2736277.2741682
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of [25]. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
引用
收藏
页码:1340 / 1350
页数:11
相关论文
共 50 条
  • [31] An Adaptive Distributed Asynchronous Algorithm with Application to Target Localization
    Mourya, Rahul
    Bianchi, Pascal
    Salim, Adil
    Richard, Cedric
    2017 IEEE 7TH INTERNATIONAL WORKSHOP ON COMPUTATIONAL ADVANCES IN MULTI-SENSOR ADAPTIVE PROCESSING (CAMSAP), 2017,
  • [32] Asynchronous distributed genetic algorithm for optimal channel routing
    Kim, W
    Hong, CL
    Kim, Y
    COMPUTATIONAL AND INFORMATION SCIENCE, PROCEEDINGS, 2004, 3314 : 194 - 199
  • [33] Asynchronous distributed algorithm for seeking generalized Nash equilibria
    Yi, Peng
    Pavel, Lacra
    2018 EUROPEAN CONTROL CONFERENCE (ECC), 2018, : 2164 - 2169
  • [34] A parallel asynchronous garbage collection algorithm for distributed systems
    Bagherzadeh, Nader
    Heng, Seng-lai
    Wu, Chuan-lin
    IEEE Transactions on Knowledge and Data Engineering, 1991, 3 (01) : 100 - 107
  • [35] A Distributed Fusion Algorithm over Asynchronous Sensor Networks
    Chu, Tianpeng
    Qi, Guoqing
    Li, Yinya
    Sheng, Andong
    PROCEEDINGS 2013 INTERNATIONAL CONFERENCE ON MECHATRONIC SCIENCES, ELECTRIC ENGINEERING AND COMPUTER (MEC), 2013, : 1491 - 1497
  • [36] A Distributed Termination Detection Algorithm for Dynamic Asynchronous Systems
    Johnson, Paul
    Mittal, Neeraj
    2009 29TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, 2009, : 343 - 351
  • [37] Scalable Distributed Information Retrieval Model Based on Topic Map and Mobil Agent
    Xia, Li-xin
    Wang, Zhong-yi
    Chen, Chen
    2008 IEEE INTERNATIONAL SYMPOSIUM ON IT IN MEDICINE AND EDUCATION, VOLS 1 AND 2, PROCEEDINGS, 2008, : 454 - 459
  • [38] An Asynchronous Parallelized and Scalable Image Resampling Algorithm with Parallel I/O
    Ma, Yan
    Zhao, Lingjun
    Liu, Dingsheng
    COMPUTATIONAL SCIENCE - ICCS 2009, 2009, 5545 : 357 - 366
  • [39] Scalable disk-based topic modeling for memory limited devices
    Kim, Byungju
    Lee, Dongha
    Oh, Jinoh
    Yu, Hwanjo
    INFORMATION SCIENCES, 2020, 516 : 353 - 369
  • [40] Enabling Efficient and Scalable Service Search in IoT With Topic Modeling: An Evaluation
    Razzaque, Mohammad Abdur
    IEEE ACCESS, 2021, 9 : 53452 - 53465