Parallel Spectral Clustering in Distributed Systems

被引:385
|
作者
Chen, Wen-Yen [1 ]
Song, Yangqiu [2 ]
Bai, Hongjie [3 ]
Lin, Chih-Jen [4 ]
Chang, Edward Y. [5 ]
机构
[1] Yahoo Inc, Sunnyvale, CA 94089 USA
[2] Microsoft Res Asia, Beijing 100193, Peoples R China
[3] Google Informat Technol China Co Ltd, Beijing 100084, Peoples R China
[4] Natl Taiwan Univ, Dept Comp Sci, Taipei 106, Taiwan
[5] Google Res, Palo Alto, CA 94306 USA
基金
美国国家科学基金会;
关键词
Parallel spectral clustering; distributed computing; normalized cuts; nearest neighbors; Nystrom approximation; CUTS;
D O I
10.1109/TPAMI.2010.88
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nystrom method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.
引用
收藏
页码:568 / 586
页数:19
相关论文
共 50 条
  • [1] Efficient clustering for parallel tasks execution in distributed systems
    Zomaya, AY
    Chan, G
    INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2005, 16 (02) : 281 - 299
  • [2] A Distributed Block Chebyshev-Davidson Algorithm for Parallel Spectral Clustering
    Qiyuan Pang
    Haizhao Yang
    Journal of Scientific Computing, 2024, 98
  • [3] A Distributed Block Chebyshev-Davidson Algorithm for Parallel Spectral Clustering
    Pang, Qiyuan
    Yang, Haizhao
    JOURNAL OF SCIENTIFIC COMPUTING, 2024, 98 (03)
  • [4] Parallel Spectral Clustering
    Song, Yangqiu
    Chen, Wen-Yen
    Bai, Hongjie
    Lin, Chih-Jen
    Chang, Edward Y.
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART II, PROCEEDINGS, 2008, 5212 : 374 - +
  • [5] Sparsification on Parallel Spectral Clustering
    Mouysset, Sandrine
    Guivarch, Ronan
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2012, 2013, 7851 : 249 - 260
  • [6] On a Strategy for Spectral Clustering with Parallel Computation
    Mouysset, Sandrine
    Noailles, Joseph
    Ruiz, Daniel
    Guivarch, Ronan
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2010, 2011, 6449 : 408 - 420
  • [7] Parallel Spectral Clustering Based on MapReduce
    Qiwei Zhong
    Yunlong Lin
    Junyang Zou
    Kuangyan Zhu
    Qiao Wang
    Lei Hu
    ZTE Communications, 2013, 11 (02) : 45 - 50
  • [8] Parallel Spectral Clustering with FEAST Library
    Mdaa, Saad
    Alami, Anass Ouali
    Guivarch, Ronan
    Mouysset, Sandrine
    ADVANCED RESEARCH IN TECHNOLOGIES, INFORMATION, INNOVATION AND SUSTAINABILITY, ARTIIS 2022, PT I, 2022, 1675 : 127 - 138
  • [9] Distributed and parallel systems
    Kacsuk, P
    Kotsis, G
    FUTURE GENERATION COMPUTER SYSTEMS, 2000, 16 (06) : V - VII
  • [10] Parallel and Distributed Systems
    Parashar, Manish
    COMPUTER, 2020, 53 (11) : 7 - 8