Euclidean distance stratified random sampling based clustering model for big data mining

被引：1

作者：

Pandey, Kamlesh Kumar ^{[1
]}

Shukla, Diwakar ^{[1
]}

机构：

[1] Dr Hari Singh Gour Vishwavidyalaya, Dept Comp Sci & Applicat, Sagar, Madhya Pradesh, India

来源：

COMPUTATIONAL AND MATHEMATICAL METHODS | 2021年 / 3卷 / 06期

关键词：

big data mining; big data sampling; big data clustering; Euclidean distance based stratum; random sampling; sample extension; SSK-Means; stratified sampling; FRAMEWORK; ALGORITHM;

D O I：

10.1002/cmm4.1206

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Big data mining is related to large-scale data analysis and faces computational cost-related challenges due to the exponential growth of digital technologies. Classical data mining algorithms suffer from computational deficiency, memory utilization, resource optimization, scale-up, and speed-up related challenges in big data mining. Sampling is one of the most effective data reduction techniques that reduces the computational cost, improves scalability and computational speed with high efficiency for any data mining algorithm in single and multiple machine execution environments. This study suggested a Euclidean distance-based stratum method for stratum creation and a stratified random sampling-based big data mining model using the K-Means clustering (SSK-Means) algorithm in a single machine execution environment. The performance of the SSK-Means algorithm has achieved better cluster quality, speed-up, scale-up, and memory utilization against the random sampling-based K-Means and classical K-Means algorithms using silhouette coefficient, Davies Bouldin index, Calinski Harabasz index, execution time, and speedup ratio internal measures.

引用

页数：14

共 50 条

[41] Distributed Spectral Clustering based on Euclidean Distance Matrix Completion
Scardapane, Simone
Altilio, Rosa
Panella, Massimo
Uncini, Aurelio
2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 3093 - 3100
[42] Random Sample Partition-Based Clustering Ensemble Algorithm for Big Data
Du, Xueqin
He, Yulin
Huang, Joshua Zhexue
2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 5885 - 5887
[43] A Survey of Distance Metrics in Clustering Data Mining Techniques
Mercioni, Marina Adriana
Holban, Stefan
ICGSP '19 - PROCEEDINGS OF THE 2019 3RD INTERNATIONAL CONFERENCE ON GRAPHICS AND SIGNAL PROCESSING, 2019, : 44 - 47
[44] Design of customer marketing big data processing system based on data mining clustering technology
Wang, Jingzhe
PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON ADVANCES IN MECHANICAL ENGINEERING AND INDUSTRIAL INFORMATICS (AMEII 2016), 2016, 73 : 100 - 104
[45] Model-Based Clustering of Categorical Data Based on the Hamming Distance
Argiento, Raffaele
Filippi-Mazzola, Edoardo
Paci, Lucia
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2024,
[46] Big Data Clustering via Random Sketching and Validation
Traganitis, Panagiotis A.
Slavakis, Konstantinos
Giannakis, Georgios B.
CONFERENCE RECORD OF THE 2014 FORTY-EIGHTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, 2014, : 1046 - 1050
[47] RANDOM WALK SAMPLING FOR BIG DATA OVER NETWORKS
Basirian, Saeed
Jung, Alexander
2017 INTERNATIONAL CONFERENCE ON SAMPLING THEORY AND APPLICATIONS (SAMPTA), 2017, : 427 - 431
[48] Data Mining Techniques for Producing Clustering in Big Data with MapReduce Function
Presskila, X. Arogya
Robinson, Y. Harold
Studies in Big Data, 2021, 93 : 195 - 203
[49] Stemflow estimation in a redwood forest using model-based stratified random sampling
Lewis, J
ENVIRONMETRICS, 2003, 14 (06) : 559 - 571
[50] An efficient sampling-based visualization technique for big data clustering with crisp partitions
Rajendra Prasad, K.
Mohammed, Moulana
Narasimha Prasad, L. V.
Anguraj, Dinesh Kumar
DISTRIBUTED AND PARALLEL DATABASES, 2021, 39 (03) : 813 - 832

← 1 2 3 4 5 →