A simple rapid sample-based clustering for large-scale data

被引:1
|
作者
Chen, Yewang [1 ]
Yang, Yuanyuan [1 ]
Pei, Songwen [2 ]
Chen, Yi [3 ,4 ]
Du, Jixiang [1 ]
机构
[1] Huaqiao Univ, Coll Comp Sci & Technol, Xiamen 362021, Peoples R China
[2] Univ Shanghai Sci & Technol, Shanghai Key Lab Modern Opt Syst, Shanghai, Peoples R China
[3] Beijing Technol & Business Univ, China Food Flavor & Nutr Hlth Innovat Ctr, Beijing 100048, Peoples R China
[4] Beijing Technol & Business Univ, Beijing Key Lab Big Data Technol Food Safety, Beijing 100048, Peoples R China
基金
中国国家自然科学基金;
关键词
Clustering; Sample-based clustering; Large-scale data; DBSCAN;
D O I
10.1016/j.engappai.2024.108551
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale data clustering is a crucial task in addressing big data challenges. However, existing approaches often struggle to efficiently and effectively identify different types of big data, making it a significant challenge. In this paper, we propose a novel sample-based clustering algorithm, which is very simple but extremely efficient, and runs in about O ( n x r ) expected time, where n is the size of the dataset and r is the category number. The method is based on two key assumptions: (1) The data of each sufficient sample should have similar data distribution, as well as category distribution, to the entire data set; (2) the representative of each category in all sufficient samples conform to Gaussian distribution. It processes data in two stages, one is to classify data in each local sample independently, and the other is to globally classify data by assigning each point to the category of its nearest representative category center. The experimental results show that the proposed algorithm is effective, which outperforms other current variants of clustering algorithm.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Efficient Subspace Clustering of Large-scale Data Streams with Misses
    Traganitis, Panagiotis A.
    Giannakis, Georgios B.
    2016 ANNUAL CONFERENCE ON INFORMATION SCIENCE AND SYSTEMS (CISS), 2016,
  • [42] YADING: Fast Clustering of Large-Scale Time Series Data
    Ding, Rui
    Wang, Qiang
    Dang, Yingnong
    Fu, Qiang
    Zhang, Haidong
    Zhang, Dongmei
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (05): : 473 - 484
  • [43] An adaptive clustering algorithm by neighbourhood search for large-scale data
    Sevinc, Busra
    Gurler, Selma
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2023, 93 (01) : 175 - 187
  • [44] Improving large-scale proteomics by clustering of mass spectrometry data
    Beer, I
    Barnea, E
    Ziv, T
    Admon, A
    PROTEOMICS, 2004, 4 (04) : 950 - 960
  • [45] Parallel Clustering Algorithm for Large-Scale Biological Data Sets
    Wang, Minchao
    Zhang, Wu
    Ding, Wang
    Dai, Dongbo
    Zhang, Huiran
    Xie, Hao
    Chen, Luonan
    Guo, Yike
    Xie, Jiang
    PLOS ONE, 2014, 9 (04):
  • [46] Spectral clustering based on iterative optimization for large-scale and high-dimensional data
    Zhao, Yang
    Yuan, Yuan
    Nie, Feiping
    Wang, Qi
    NEUROCOMPUTING, 2018, 318 : 227 - 235
  • [47] A Spark-based Artificial Bee Colony Algorithm for Large-scale Data Clustering
    Wang, Yanjie
    Qian, Quan
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 1213 - 1218
  • [48] Large-Scale Fingerprint Data Retrieval Based C-Means Clustering Model
    Wang, Decai
    Zhang, Weibing
    Chang, Xia
    Gao, Yuelin
    2023 11TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: IOT AND SMART CITY, ITIOTSC 2023, 2023, : 5 - 9
  • [49] On large-scale sample surveys
    Mahalanobis, PC
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY OF LONDON SERIES B-BIOLOGICAL SCIENCES, 1941, 231 : 329 - 451
  • [50] Greedy clustering with sample-based heuristics for k-anonymisation
    Loukides, Grigorios
    Shao, Jianhua
    PROCEEDINGS OF THE FIRST INTERNATIONAL SYMPOSIUM ON DATA, PRIVACY, AND E-COMMERCE, 2007, : 191 - 196