A simple rapid sample-based clustering for large-scale data

被引:1
|
作者
Chen, Yewang [1 ]
Yang, Yuanyuan [1 ]
Pei, Songwen [2 ]
Chen, Yi [3 ,4 ]
Du, Jixiang [1 ]
机构
[1] Huaqiao Univ, Coll Comp Sci & Technol, Xiamen 362021, Peoples R China
[2] Univ Shanghai Sci & Technol, Shanghai Key Lab Modern Opt Syst, Shanghai, Peoples R China
[3] Beijing Technol & Business Univ, China Food Flavor & Nutr Hlth Innovat Ctr, Beijing 100048, Peoples R China
[4] Beijing Technol & Business Univ, Beijing Key Lab Big Data Technol Food Safety, Beijing 100048, Peoples R China
基金
中国国家自然科学基金;
关键词
Clustering; Sample-based clustering; Large-scale data; DBSCAN;
D O I
10.1016/j.engappai.2024.108551
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale data clustering is a crucial task in addressing big data challenges. However, existing approaches often struggle to efficiently and effectively identify different types of big data, making it a significant challenge. In this paper, we propose a novel sample-based clustering algorithm, which is very simple but extremely efficient, and runs in about O ( n x r ) expected time, where n is the size of the dataset and r is the category number. The method is based on two key assumptions: (1) The data of each sufficient sample should have similar data distribution, as well as category distribution, to the entire data set; (2) the representative of each category in all sufficient samples conform to Gaussian distribution. It processes data in two stages, one is to classify data in each local sample independently, and the other is to globally classify data by assigning each point to the category of its nearest representative category center. The experimental results show that the proposed algorithm is effective, which outperforms other current variants of clustering algorithm.
引用
收藏
页数:12
相关论文
共 50 条
  • [11] Parallel gravitational clustering based on grid partitioning for large-scale data
    Lei Chen
    Fadong Chen
    Zhaohua Liu
    Mingyang Lv
    Tingqin He
    Shiwen Zhang
    Applied Intelligence, 2023, 53 : 2506 - 2526
  • [12] Parallel gravitational clustering based on grid partitioning for large-scale data
    Chen, Lei
    Chen, Fadong
    Liu, Zhaohua
    Lv, Mingyang
    He, Tingqin
    Zhang, Shiwen
    APPLIED INTELLIGENCE, 2023, 53 (03) : 2506 - 2526
  • [13] Fuzzy clustering algorithm based on multiple medoids for large-scale data
    Chen A.-G.
    Wang S.-T.
    Kongzhi yu Juece/Control and Decision, 2016, 31 (12): : 2122 - 2130
  • [14] CLUSTERING LARGE-SCALE DATA BASED ON MODIFIED AFFINITY PROPAGATION ALGORITHM
    Serdah, Ahmed M.
    Ashour, Wesam M.
    JOURNAL OF ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING RESEARCH, 2016, 6 (01) : 23 - 33
  • [15] A Simple and Robust Clustering Scheme for Large-Scale and Dynamic VANETs
    Banikhalaf, Mustafa
    Khder, Moaiad Ahmad
    IEEE ACCESS, 2020, 8 : 103565 - 103575
  • [16] Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection
    Li, Zhihan
    Zhao, Youjian
    Liu, Rong
    Pei, Dan
    2018 IEEE/ACM 26TH INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2018,
  • [17] Robust large-scale clustering based on correntropy
    Jin, Guodong
    Gao, Jing
    Tan, Lining
    PLOS ONE, 2022, 17 (11):
  • [18] A Novel Clustering Algorithm on Large-Scale Graph Data
    Zhang, Hao
    Zhou, Wei
    Wan, Xiaoyu
    Fu, Ge
    Xu, Zhiyong
    Han, Jizhong
    2014 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CCBD), 2014, : 47 - 54
  • [19] Large-scale clustering of CAGE tag expression data
    Shimokawa, Kazuro
    Okamura-Oho, Yuko
    Kurita, Takio
    Frith, Martin C.
    Kawai, Jun
    Carninci, Piero
    Hayashizaki, Yoshihide
    BMC BIOINFORMATICS, 2007, 8 (1)
  • [20] Large-scale clustering of cDNA-fingerprinting data
    Herwig, R
    Poustka, AJ
    Müller, C
    Bull, C
    Lehrach, H
    O'Brien, J
    GENOME RESEARCH, 1999, 9 (11) : 1093 - 1105