A Scalable Asynchronous Distributed Algorithm for Topic Modeling

被引:32
|
作者
Yu, Hsiang-Fu [1 ]
Hsieh, Cho-Jui [1 ]
Yun, Hyokun [2 ]
Vishwanathan, S. V. N. [3 ]
Dhillon, Inderjit S. [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Amazon, Seattle, WA USA
[3] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
基金
美国国家科学基金会;
关键词
Topic Models; Scalability; Sampling;
D O I
10.1145/2736277.2741682
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of [25]. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
引用
收藏
页码:1340 / 1350
页数:11
相关论文
共 50 条
  • [11] Global asynchronous distributed interactive genetic algorithm
    Miki, Mitsunori
    Yamamoto, Yuki
    Wake, Sanae
    Hiroyasu, Tomoyuki
    2006 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-6, PROCEEDINGS, 2006, : 3481 - +
  • [12] ON THE RATE OF CONVERGENCE OF A DISTRIBUTED ASYNCHRONOUS ROUTING ALGORITHM
    LUO, ZQ
    TSENG, P
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1994, 39 (05) : 1123 - 1129
  • [13] Topic encapsulation for distributed tutoring and user modeling
    Murray, T
    ARTIFICIAL INTELLIGENCE IN EDUCATION: KNOWLEDGE AND MEDIA IN LEARNING SYSTEMS, 1997, 39 : 631 - 633
  • [14] Efficient Distributed Topic Modeling with Provable Guarantees
    Ding, Weicong
    Rohban, Mohammad H.
    Ishwar, Prakash
    Saligrama, Venkatesh
    ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 33, 2014, 33 : 167 - 175
  • [15] PTRebeca: Modeling and analysis of distributed and asynchronous systems
    Jafari, Ali
    Khamespanah, Ehsan
    Sirjani, Marjan
    Hermanns, Holger
    Cimini, Matteo
    SCIENCE OF COMPUTER PROGRAMMING, 2016, 128 : 22 - 50
  • [16] Asynchronous distributed calibration for scalable and reconfigurable multi-projector displays
    Hasker, Ezekiel S.
    Sinha, Pinaki
    Majumder, Aditi
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2006, 12 (05) : 1101 - 1108
  • [17] Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates
    Stich, Sebastian U.
    Mohtashami, Amirkeivan
    Jaggi, Martin
    24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
  • [18] A Distributed Non-elitist Evolutionary Scalable Asynchronous Rebuilding Algorithm for Solving Pickup and Delivery Problems with Time Windows
    Khoo, Thau Soon
    Bonad, Mohammad Babrdel
    2021 INTERNATIONAL CONFERENCE ON DECISION AID SCIENCES AND APPLICATION (DASA), 2021,
  • [19] A Recommended Replacement Algorithm for the Scalable Asynchronous Cache Consistency Scheme
    Haraty, Ramzi A.
    Nahas, Lama Hasan
    IT CONVERGENCE AND SECURITY 2017, VOL 1, 2018, 449 : 88 - 96
  • [20] Scalable scheduling algorithm for distributed memory machines
    Darbha, S
    Agrawal, DP
    EIGHTH IEEE SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, PROCEEDINGS, 1996, : 84 - 91