A simple rapid sample-based clustering for large-scale data

被引：1

作者：

Chen, Yewang ^{[1
]}

Yang, Yuanyuan ^{[1
]}

Pei, Songwen ^{[2
]}

Chen, Yi ^{[3
,4
]}

Du, Jixiang ^{[1
]}

机构：

[1] Huaqiao Univ, Coll Comp Sci & Technol, Xiamen 362021, Peoples R China

[2] Univ Shanghai Sci & Technol, Shanghai Key Lab Modern Opt Syst, Shanghai, Peoples R China

[3] Beijing Technol & Business Univ, China Food Flavor & Nutr Hlth Innovat Ctr, Beijing 100048, Peoples R China

[4] Beijing Technol & Business Univ, Beijing Key Lab Big Data Technol Food Safety, Beijing 100048, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2024年 / 133卷

基金：

中国国家自然科学基金;

关键词：

Clustering; Sample-based clustering; Large-scale data; DBSCAN;

D O I：

10.1016/j.engappai.2024.108551

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large-scale data clustering is a crucial task in addressing big data challenges. However, existing approaches often struggle to efficiently and effectively identify different types of big data, making it a significant challenge. In this paper, we propose a novel sample-based clustering algorithm, which is very simple but extremely efficient, and runs in about O ( n x r ) expected time, where n is the size of the dataset and r is the category number. The method is based on two key assumptions: (1) The data of each sufficient sample should have similar data distribution, as well as category distribution, to the entire data set; (2) the representative of each category in all sufficient samples conform to Gaussian distribution. It processes data in two stages, one is to classify data in each local sample independently, and the other is to globally classify data by assigning each point to the category of its nearest representative category center. The experimental results show that the proposed algorithm is effective, which outperforms other current variants of clustering algorithm.

引用

页数：12

共 50 条

[41] Efficient Subspace Clustering of Large-scale Data Streams with Misses
Traganitis, Panagiotis A.
Giannakis, Georgios B.
2016 ANNUAL CONFERENCE ON INFORMATION SCIENCE AND SYSTEMS (CISS), 2016,
[42] YADING: Fast Clustering of Large-Scale Time Series Data
Ding, Rui
Wang, Qiang
Dang, Yingnong
Fu, Qiang
Zhang, Haidong
Zhang, Dongmei
PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (05): : 473 - 484
[43] An adaptive clustering algorithm by neighbourhood search for large-scale data
Sevinc, Busra
Gurler, Selma
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2023, 93 (01) : 175 - 187
[44] Improving large-scale proteomics by clustering of mass spectrometry data
Beer, I
Barnea, E
Ziv, T
Admon, A
PROTEOMICS, 2004, 4 (04) : 950 - 960
[45] Parallel Clustering Algorithm for Large-Scale Biological Data Sets
Wang, Minchao
Zhang, Wu
Ding, Wang
Dai, Dongbo
Zhang, Huiran
Xie, Hao
Chen, Luonan
Guo, Yike
Xie, Jiang
PLOS ONE, 2014, 9 (04):
[46] Spectral clustering based on iterative optimization for large-scale and high-dimensional data
Zhao, Yang
Yuan, Yuan
Nie, Feiping
Wang, Qi
NEUROCOMPUTING, 2018, 318 : 227 - 235
[47] A Spark-based Artificial Bee Colony Algorithm for Large-scale Data Clustering
Wang, Yanjie
Qian, Quan
IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 1213 - 1218
[48] Large-Scale Fingerprint Data Retrieval Based C-Means Clustering Model
Wang, Decai
Zhang, Weibing
Chang, Xia
Gao, Yuelin
2023 11TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: IOT AND SMART CITY, ITIOTSC 2023, 2023, : 5 - 9
[49] On large-scale sample surveys
Mahalanobis, PC
PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY OF LONDON SERIES B-BIOLOGICAL SCIENCES, 1941, 231 : 329 - 451
[50] Greedy clustering with sample-based heuristics for k-anonymisation
Loukides, Grigorios
Shao, Jianhua
PROCEEDINGS OF THE FIRST INTERNATIONAL SYMPOSIUM ON DATA, PRIVACY, AND E-COMMERCE, 2007, : 191 - 196

← 1 2 3 4 5 →