Local-Density Subspace Distributed Clustering for High-Dimensional Data

被引:7
|
作者
Geng, Yangli-ao [1 ]
Li, Qingyong [1 ]
Liang, Mingfei [2 ]
Chi, Chong-Yung [3 ]
Tan, Juan [4 ]
Huang, Heng [5 ,6 ]
机构
[1] Beijing Jiaotong Univ, Beijing Key Lab Transportat Data Anal & Min, Beijing 100044, Peoples R China
[2] Tencent Co Ltd, WeiXin Grp, Beijing 100044, Peoples R China
[3] Natl Tsing Hua Univ, Inst Commun Engn, Hsinchu 30013, Taiwan
[4] Beijing Technol & Business Univ, Dept Business Adm, Beijing 100048, Peoples R China
[5] Univ Pittsburgh, Dept Elect & Comp Engn, Pittsburgh, PA 15260 USA
[6] JD Finance Amer Corp, Mountain View, CA USA
基金
北京市自然科学基金;
关键词
Clustering algorithms; Distributed databases; Principal component analysis; Data models; Clustering methods; Big Data; Kernel; High-dimensional clustering; distributed clustering; density-base clustering; subspace Gaussian model; ALGORITHM;
D O I
10.1109/TPDS.2020.2975550
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Distributed clustering is emerging along with the advent of the era of big data. However, most existing established distributed clustering methods focus on problems caused by a large amount of data rather than caused by the large dimension of data. Consequently, they suffer the "curse" of dimensionality (e.g., poor performance and heavy network overhead) when high-dimensional (HD) data are clustered. In this article, we propose a distributed algorithm, referred to as Local Density Subspace Distributed Clustering (LDSDC) algorithm, to cluster large-scale HD data, motivated by the idea that a local dense region of a HD dataset is usually distributed in a low-dimensional (LD) subspace. LDSDC follows a local-global-local processing structure, including grouping of local dense regions (atom clusters) followed by subspace Gaussian model (SGM) fitting (flexible and scalable to data dimension) at each sub-site, merging of atom clusters at every sub-site according to the merging result broadcast from the global site. Moreover, we propose a fast method to estimate the parameters of SGM for HD data, together with its convergence proof. We evaluate LDSDC on both synthetic and real datasets and compare it with four state-of-the-art methods. The experimental results demonstrate that the proposed LDSDC yields best overall performance.
引用
收藏
页码:1799 / 1814
页数:16
相关论文
共 50 条
  • [21] Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity
    Fatehi, Kavan
    Rezvani, Mohsen
    Fateh, Mansoor
    Pajoohan, Mohammad-Reza
    INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES, 2018, 14 (03) : 38 - 55
  • [22] Spectral Clustering by Subspace Randomization and Graph Fusion for High-Dimensional Data
    Cai, Xiaosha
    Huang, Dong
    Wang, Chang-Dong
    Kwoh, Chee-Keong
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2020, PT I, 2020, 12084 : 330 - 342
  • [23] Subspace Clustering in High-Dimensional Data Streams: A Systematic Literature Review
    Ghani, Nur Laila Ab
    Aziz, Izzatdin Abdul
    AbdulKadir, Said Jadid
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (02): : 4649 - 4668
  • [24] Adaptive multi-view subspace clustering for high-dimensional data
    Yan, Fei
    Wang, Xiao-dong
    Zeng, Zhi-qiang
    Hong, Chao-qun
    PATTERN RECOGNITION LETTERS, 2020, 130 : 299 - 305
  • [25] A feature group weighting method for subspace clustering of high-dimensional data
    Chen, Xiaojun
    Ye, Yunming
    Xu, Xiaofei
    Huang, Joshua Zhexue
    PATTERN RECOGNITION, 2012, 45 (01) : 434 - 446
  • [26] Synchronization-based scalable subspace clustering of high-dimensional data
    Junming Shao
    Xinzuo Wang
    Qinli Yang
    Claudia Plant
    Christian Böhm
    Knowledge and Information Systems, 2017, 52 : 83 - 111
  • [27] EDSC: Efficient Document Subspace Clustering Technique for High-Dimensional Data
    Radhika, K. R.
    Pushpa, C. N.
    Thriveni, J.
    Venugopal, K. R.
    2016 INTERNATIONAL CONFERENCE ON COMPUTATIONAL TECHNIQUES IN INFORMATION AND COMMUNICATION TECHNOLOGIES (ICCTICT), 2016,
  • [28] Synchronization-based scalable subspace clustering of high-dimensional data
    Shao, Junming
    Wang, Xinzuo
    Yang, Qinli
    Plant, Claudia
    Boehm, Christian
    KNOWLEDGE AND INFORMATION SYSTEMS, 2017, 52 (01) : 83 - 111
  • [29] A novel algorithm for fast and scalable subspace clustering of high-dimensional data
    Kaur A.
    Datta A.
    Journal of Big Data, 2015, 2 (01)
  • [30] Robust Local Triangular Kernel Density-based Clustering for High-dimensional Data
    Musdholifah, Aina
    Hashim, Siti Zaiton Mohd
    2013 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (CSIT), 2013, : 24 - 32