A Scalable Asynchronous Distributed Algorithm for Topic Modeling

被引:32
|
作者
Yu, Hsiang-Fu [1 ]
Hsieh, Cho-Jui [1 ]
Yun, Hyokun [2 ]
Vishwanathan, S. V. N. [3 ]
Dhillon, Inderjit S. [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Amazon, Seattle, WA USA
[3] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
基金
美国国家科学基金会;
关键词
Topic Models; Scalability; Sampling;
D O I
10.1145/2736277.2741682
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of [25]. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
引用
收藏
页码:1340 / 1350
页数:11
相关论文
共 50 条
  • [21] Scalable and distributed dynamic interval mapping algorithm
    School of Computer Science, National University of Defense Technology, Changsha 410073, China
    Jisuanji Xuebao, 2006, 10 (1757-1763):
  • [22] A Distributed Algorithm for Scalable Fuzzy Time Series
    de Lima e Silva, Petronio Candido
    de Oliveira e Lucas, Patricia
    Guimaraes, Frederico Gadelha
    GREEN, PERVASIVE, AND CLOUD COMPUTING, GPC 2019, 2019, 11484 : 42 - 56
  • [23] VCube: A Provably Scalable Distributed Diagnosis Algorithm
    Duarte, Elias P., Jr.
    Bona, Luis C. E.
    Ruoso, Vinicius K.
    2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2014, : 17 - 22
  • [24] An Asynchronous Distributed Expectation Maximization Algorithm for Massive Data: The DEM Algorithm
    Srivastav, Sanvesh
    DePalma, Glen
    Liu, Chuanhai
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2019, 28 (02) : 233 - 243
  • [25] GADIA: A Greedy Asynchronous Distributed Interference Avoidance Algorithm
    Babadi, Behtash
    Tarokh, Vahid
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2010, 56 (12) : 6228 - 6252
  • [26] A PARTIALLY ASYNCHRONOUS AND ITERATIVE ALGORITHM FOR DISTRIBUTED LOAD BALANCING
    SONG, JJ
    PARALLEL COMPUTING, 1994, 20 (06) : 853 - 868
  • [27] A distributed colouring algorithm for control hazards in asynchronous pipelines
    Theodoropoulos, G
    Zhang, QY
    I-SPAN 2004: 7TH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS AND NETWORKS, PROCEEDINGS, 2004, : 266 - 271
  • [28] An Asynchronous Distributed ADMM Algorithm and Efficient Communication Model
    Fang, Ling
    Lei, Yongmei
    2016 IEEE 14TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 14TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 2ND INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/DATACOM/CYBERSC, 2016, : 136 - 140
  • [29] An Asynchronous Distributed Algorithm for Solving a Linear Algebraic Equation
    Liu, Ji
    Mou, Shaoshuai
    Morse, A. Stephen
    2013 IEEE 52ND ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2013, : 5409 - 5414
  • [30] An Almost Singularly Optimal Asynchronous Distributed MST Algorithm
    Dufoulon, Fabien
    Kutten, Shay
    Moses, William K.
    Pandurangan, Gopal
    Peleg, David
    arXiv, 2022,