A simple rapid sample-based clustering for large-scale data

被引:1
|
作者
Chen, Yewang [1 ]
Yang, Yuanyuan [1 ]
Pei, Songwen [2 ]
Chen, Yi [3 ,4 ]
Du, Jixiang [1 ]
机构
[1] Huaqiao Univ, Coll Comp Sci & Technol, Xiamen 362021, Peoples R China
[2] Univ Shanghai Sci & Technol, Shanghai Key Lab Modern Opt Syst, Shanghai, Peoples R China
[3] Beijing Technol & Business Univ, China Food Flavor & Nutr Hlth Innovat Ctr, Beijing 100048, Peoples R China
[4] Beijing Technol & Business Univ, Beijing Key Lab Big Data Technol Food Safety, Beijing 100048, Peoples R China
基金
中国国家自然科学基金;
关键词
Clustering; Sample-based clustering; Large-scale data; DBSCAN;
D O I
10.1016/j.engappai.2024.108551
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale data clustering is a crucial task in addressing big data challenges. However, existing approaches often struggle to efficiently and effectively identify different types of big data, making it a significant challenge. In this paper, we propose a novel sample-based clustering algorithm, which is very simple but extremely efficient, and runs in about O ( n x r ) expected time, where n is the size of the dataset and r is the category number. The method is based on two key assumptions: (1) The data of each sufficient sample should have similar data distribution, as well as category distribution, to the entire data set; (2) the representative of each category in all sufficient samples conform to Gaussian distribution. It processes data in two stages, one is to classify data in each local sample independently, and the other is to globally classify data by assigning each point to the category of its nearest representative category center. The experimental results show that the proposed algorithm is effective, which outperforms other current variants of clustering algorithm.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Reduce Redundancies: Signal-based Clustering of Large-scale Fingerprint Data
    Mueller, Mathias
    Schmalzbauer, Martin
    Meyer, Steffen
    Nicklas, Daniela
    2018 IEEE 29TH ANNUAL INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR AND MOBILE RADIO COMMUNICATIONS (PIMRC), 2018, : 842 - 848
  • [32] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
    Nie, Tiezheng
    Lee, Wang-chien
    Shen, Derong
    Yu, Ge
    Kou, Yue
    WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149
  • [33] Ten simple rules for large-scale data processing
    Fungtammasan, Arkarachai
    Lee, Alexandra
    Taroni, Jaclyn
    Wheeler, Kurt
    Chin, Chen-Shan
    Davis, Sean
    Greene, Casey
    PLOS COMPUTATIONAL BIOLOGY, 2022, 18 (02)
  • [34] Spectral clustering with linear embedding: A discrete clustering method for large-scale data
    Gao, Chenhui
    Chen, Wenzhi
    Nie, Feiping
    Yu, Weizhong
    Wang, Zonghui
    PATTERN RECOGNITION, 2024, 151
  • [35] Rapid Trend Prediction for Large-Scale Cloud Database KPIs by Clustering
    Wang, Xiaoling
    Li, Ning
    Zhang, Lijun
    Zhang, Xiaofang
    Zhao, Qiong
    2021 IEEE/ACM INTERNATIONAL WORKSHOP ON CLOUD INTELLIGENCE (CLOUDINTELLIGENCE 2021), 2021, : 1 - 6
  • [36] Large-Scale Spectral Clustering Based on Representative Points
    Yang, Libo
    Liu, Xuemei
    Nie, Feiping
    Liu, Mingtang
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2019, 2019
  • [37] Large-Scale Image Clustering Based on Camera Fingerprints
    Lin, Xufeng
    Li, Chang-Tsun
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2017, 12 (04) : 793 - 808
  • [38] Large-scale spectral clustering based on pairwise constraints
    Semertzidis, T.
    Rafailidis, D.
    Strintzis, M. G.
    Daras, P.
    INFORMATION PROCESSING & MANAGEMENT, 2015, 51 (05) : 616 - 624
  • [39] Penalized clustering of large-scale functional data with multiple covariates
    Ma, Ping
    Zhong, Wenxuan
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2008, 103 (482) : 625 - 636
  • [40] A Data Cleansing Method for Clustering Large-Scale Transaction Databases
    Loh, Woong-Kee
    Moon, Yang-Sae
    Kang, Jun-Gyu
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2010, E93D (11) : 3120 - 3123