A simple rapid sample-based clustering for large-scale data

被引:1
|
作者
Chen, Yewang [1 ]
Yang, Yuanyuan [1 ]
Pei, Songwen [2 ]
Chen, Yi [3 ,4 ]
Du, Jixiang [1 ]
机构
[1] Huaqiao Univ, Coll Comp Sci & Technol, Xiamen 362021, Peoples R China
[2] Univ Shanghai Sci & Technol, Shanghai Key Lab Modern Opt Syst, Shanghai, Peoples R China
[3] Beijing Technol & Business Univ, China Food Flavor & Nutr Hlth Innovat Ctr, Beijing 100048, Peoples R China
[4] Beijing Technol & Business Univ, Beijing Key Lab Big Data Technol Food Safety, Beijing 100048, Peoples R China
基金
中国国家自然科学基金;
关键词
Clustering; Sample-based clustering; Large-scale data; DBSCAN;
D O I
10.1016/j.engappai.2024.108551
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale data clustering is a crucial task in addressing big data challenges. However, existing approaches often struggle to efficiently and effectively identify different types of big data, making it a significant challenge. In this paper, we propose a novel sample-based clustering algorithm, which is very simple but extremely efficient, and runs in about O ( n x r ) expected time, where n is the size of the dataset and r is the category number. The method is based on two key assumptions: (1) The data of each sufficient sample should have similar data distribution, as well as category distribution, to the entire data set; (2) the representative of each category in all sufficient samples conform to Gaussian distribution. It processes data in two stages, one is to classify data in each local sample independently, and the other is to globally classify data by assigning each point to the category of its nearest representative category center. The experimental results show that the proposed algorithm is effective, which outperforms other current variants of clustering algorithm.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval
    Liao, Kaiyang
    Liu, Guizhong
    Xiao, Li
    Liu, Chaoteng
    KNOWLEDGE-BASED SYSTEMS, 2013, 49 : 123 - 133
  • [2] A study of large-scale data clustering based on fuzzy clustering
    Li, Yangyang
    Yang, Guoli
    He, Haiyang
    Jiao, Licheng
    Shang, Ronghua
    SOFT COMPUTING, 2016, 20 (08) : 3231 - 3242
  • [3] An Evolutionary Approach for Sample-Based Clustering on Microarray Data
    Glez-Pena, Daniel
    Diaz, Fernando
    Mendez, Jose R.
    Corchado, Juan M.
    Fdez-Riverola, Florentino
    DISTRIBUTED COMPUTING, ARTIFICIAL INTELLIGENCE, BIOINFORMATICS, SOFT COMPUTING, AND AMBIENT ASSISTED LIVING, PT II, PROCEEDINGS, 2009, 5518 : 972 - +
  • [4] A study of large-scale data clustering based on fuzzy clustering
    Yangyang Li
    Guoli Yang
    Haiyang He
    Licheng Jiao
    Ronghua Shang
    Soft Computing, 2016, 20 : 3231 - 3242
  • [5] Sample-Based Attribute Selective AnDE for Large Data
    Chen, Shenglei
    Martinez, Ana M.
    Webb, Geoffrey I.
    Wang, Limin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (01) : 172 - 185
  • [6] Large-scale parallel data clustering
    Judd, D
    McKinley, PK
    Jain, AK
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1998, 20 (08) : 871 - 876
  • [7] Concept Factorization Based Multiview Clustering for Large-Scale Data
    Chen, Man-Sheng
    Wang, Chang-Dong
    Huang, Dong
    Lai, Jian-Huang
    Yu, Philip S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (11) : 5784 - 5796
  • [8] On the Clustering of Large-scale Data: A Matrix-based Approach
    Wang, Lijun
    Dong, Ming
    2011 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2011, : 139 - 144
  • [9] A stratified sampling based clustering algorithm for large-scale data
    Zhao, Xingwang
    Liang, Jiye
    Dang, Chuangyin
    KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 416 - 428
  • [10] A Unified Framework for Representation-Based Subspace Clustering of Out-of-Sample and Large-Scale Data
    Peng, Xi
    Tang, Huajin
    Zhang, Lei
    Yi, Zhang
    Xiao, Shijie
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2016, 27 (12) : 2499 - 2512