A simple rapid sample-based clustering for large-scale data

被引：1

作者：

Chen, Yewang ^{[1
]}

Yang, Yuanyuan ^{[1
]}

Pei, Songwen ^{[2
]}

Chen, Yi ^{[3
,4
]}

Du, Jixiang ^{[1
]}

机构：

[1] Huaqiao Univ, Coll Comp Sci & Technol, Xiamen 362021, Peoples R China

[2] Univ Shanghai Sci & Technol, Shanghai Key Lab Modern Opt Syst, Shanghai, Peoples R China

[3] Beijing Technol & Business Univ, China Food Flavor & Nutr Hlth Innovat Ctr, Beijing 100048, Peoples R China

[4] Beijing Technol & Business Univ, Beijing Key Lab Big Data Technol Food Safety, Beijing 100048, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2024年 / 133卷

基金：

中国国家自然科学基金;

关键词：

Clustering; Sample-based clustering; Large-scale data; DBSCAN;

D O I：

10.1016/j.engappai.2024.108551

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large-scale data clustering is a crucial task in addressing big data challenges. However, existing approaches often struggle to efficiently and effectively identify different types of big data, making it a significant challenge. In this paper, we propose a novel sample-based clustering algorithm, which is very simple but extremely efficient, and runs in about O ( n x r ) expected time, where n is the size of the dataset and r is the category number. The method is based on two key assumptions: (1) The data of each sufficient sample should have similar data distribution, as well as category distribution, to the entire data set; (2) the representative of each category in all sufficient samples conform to Gaussian distribution. It processes data in two stages, one is to classify data in each local sample independently, and the other is to globally classify data by assigning each point to the category of its nearest representative category center. The experimental results show that the proposed algorithm is effective, which outperforms other current variants of clustering algorithm.

引用

页数：12

共 50 条

[11] Parallel gravitational clustering based on grid partitioning for large-scale data
Lei Chen
Fadong Chen
Zhaohua Liu
Mingyang Lv
Tingqin He
Shiwen Zhang
Applied Intelligence, 2023, 53 : 2506 - 2526
[12] Parallel gravitational clustering based on grid partitioning for large-scale data
Chen, Lei
Chen, Fadong
Liu, Zhaohua
Lv, Mingyang
He, Tingqin
Zhang, Shiwen
APPLIED INTELLIGENCE, 2023, 53 (03) : 2506 - 2526
[13] Fuzzy clustering algorithm based on multiple medoids for large-scale data
Chen A.-G.
Wang S.-T.
Kongzhi yu Juece/Control and Decision, 2016, 31 (12): : 2122 - 2130
[14] CLUSTERING LARGE-SCALE DATA BASED ON MODIFIED AFFINITY PROPAGATION ALGORITHM
Serdah, Ahmed M.
Ashour, Wesam M.
JOURNAL OF ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING RESEARCH, 2016, 6 (01) : 23 - 33
[15] A Simple and Robust Clustering Scheme for Large-Scale and Dynamic VANETs
Banikhalaf, Mustafa
Khder, Moaiad Ahmad
IEEE ACCESS, 2020, 8 : 103565 - 103575
[16] Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection
Li, Zhihan
Zhao, Youjian
Liu, Rong
Pei, Dan
2018 IEEE/ACM 26TH INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2018,
[17] Robust large-scale clustering based on correntropy
Jin, Guodong
Gao, Jing
Tan, Lining
PLOS ONE, 2022, 17 (11):
[18] A Novel Clustering Algorithm on Large-Scale Graph Data
Zhang, Hao
Zhou, Wei
Wan, Xiaoyu
Fu, Ge
Xu, Zhiyong
Han, Jizhong
2014 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CCBD), 2014, : 47 - 54
[19] Large-scale clustering of CAGE tag expression data
Shimokawa, Kazuro
Okamura-Oho, Yuko
Kurita, Takio
Frith, Martin C.
Kawai, Jun
Carninci, Piero
Hayashizaki, Yoshihide
BMC BIOINFORMATICS, 2007, 8 (1)
[20] Large-scale clustering of cDNA-fingerprinting data
Herwig, R
Poustka, AJ
Müller, C
Bull, C
Lehrach, H
O'Brien, J
GENOME RESEARCH, 1999, 9 (11) : 1093 - 1105

← 1 2 3 4 5 →