A Scalable Asynchronous Distributed Algorithm for Topic Modeling

被引:32
|
作者
Yu, Hsiang-Fu [1 ]
Hsieh, Cho-Jui [1 ]
Yun, Hyokun [2 ]
Vishwanathan, S. V. N. [3 ]
Dhillon, Inderjit S. [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Amazon, Seattle, WA USA
[3] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
基金
美国国家科学基金会;
关键词
Topic Models; Scalability; Sampling;
D O I
10.1145/2736277.2741682
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of [25]. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
引用
收藏
页码:1340 / 1350
页数:11
相关论文
共 50 条
  • [41] Enabling Efficient and Scalable Service Search in IoT With Topic Modeling: An Evaluation
    Razzaque, Mohammad Abdur
    IEEE ACCESS, 2021, 9 : 53452 - 53465
  • [42] Scalable Distributed Diagnosis Algorithm for Wireless Sensor Networks
    Mahapatro, Arunanshu
    Khilar, Pabitra Mohan
    ADVANCES IN COMPUTING, COMMUNICATION AND CONTROL, 2011, 125 : 400 - 405
  • [43] DisTenC: A Distributed Algorithm for Scalable Tensor Completion on Spark
    Ge, Hancheng
    Zhang, Kai
    Alfifi, Majid
    Hu, Xia
    Caverlee, James
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 137 - 148
  • [44] An efficient and scalable checkpointing and recovery algorithm for distributed systems
    Kumar, K. P. Krishna
    Hansdah, R. C.
    DISTRIBUTED COMPUTING AND NETWORKING, PROCEEDINGS, 2006, 4308 : 94 - 99
  • [45] Toward topic diversity in recommender systems: integrating topic modeling with a hashing algorithm
    Yang, Donghui
    Wang, Yan
    Shi, Zhaoyang
    Wang, Huimin
    ASLIB JOURNAL OF INFORMATION MANAGEMENT, 2025, 77 (01) : 47 - 69
  • [46] A scalable, distributed algorithm for allocating workers in embedded systems
    Agassounon, W
    Martinoli, A
    Goodman, R
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 3367 - 3373
  • [47] pTrans: A Scalable Algorithm for Reservation Guarantees in Distributed Systems
    Peng, Yuhan
    Varman, Peter
    PROCEEDINGS OF THE 32ND ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES (SPAA '20), 2020, : 441 - 452
  • [48] Scalable Distributed-FDPS Algorithm for QoS Provisioning
    Siew, Chee Kheong
    Peng, Shuai
    Luo, Wuqiong
    Tang, Peng
    Mo, Yanting
    PROCEEDINGS OF THE 8TH INTERNATIONAL NETWORK CONFERENCE (INC 2010), 2010, : 31 - 40
  • [49] Interactive Topic Modeling for Exploring Asynchronous Online Conversations: Design and Evaluation of ConVisIT
    Hoque, Enamul
    Carenini, Giuseppe
    ACM TRANSACTIONS ON INTERACTIVE INTELLIGENT SYSTEMS, 2016, 6 (01)
  • [50] An Interpolatory Algorithm for Distributed Set Membership Estimation in Asynchronous Networks
    Farina, Francesco
    Garulli, Andrea
    Giannitrapani, Antonio
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2022, 67 (10) : 5464 - 5470