Euclidean distance stratified random sampling based clustering model for big data mining

被引:1
|
作者
Pandey, Kamlesh Kumar [1 ]
Shukla, Diwakar [1 ]
机构
[1] Dr Hari Singh Gour Vishwavidyalaya, Dept Comp Sci & Applicat, Sagar, Madhya Pradesh, India
关键词
big data mining; big data sampling; big data clustering; Euclidean distance based stratum; random sampling; sample extension; SSK-Means; stratified sampling; FRAMEWORK; ALGORITHM;
D O I
10.1002/cmm4.1206
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Big data mining is related to large-scale data analysis and faces computational cost-related challenges due to the exponential growth of digital technologies. Classical data mining algorithms suffer from computational deficiency, memory utilization, resource optimization, scale-up, and speed-up related challenges in big data mining. Sampling is one of the most effective data reduction techniques that reduces the computational cost, improves scalability and computational speed with high efficiency for any data mining algorithm in single and multiple machine execution environments. This study suggested a Euclidean distance-based stratum method for stratum creation and a stratified random sampling-based big data mining model using the K-Means clustering (SSK-Means) algorithm in a single machine execution environment. The performance of the SSK-Means algorithm has achieved better cluster quality, speed-up, scale-up, and memory utilization against the random sampling-based K-Means and classical K-Means algorithms using silhouette coefficient, Davies Bouldin index, Calinski Harabasz index, execution time, and speedup ratio internal measures.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Distributed Spectral Clustering based on Euclidean Distance Matrix Completion
    Scardapane, Simone
    Altilio, Rosa
    Panella, Massimo
    Uncini, Aurelio
    2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 3093 - 3100
  • [42] Random Sample Partition-Based Clustering Ensemble Algorithm for Big Data
    Du, Xueqin
    He, Yulin
    Huang, Joshua Zhexue
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 5885 - 5887
  • [43] A Survey of Distance Metrics in Clustering Data Mining Techniques
    Mercioni, Marina Adriana
    Holban, Stefan
    ICGSP '19 - PROCEEDINGS OF THE 2019 3RD INTERNATIONAL CONFERENCE ON GRAPHICS AND SIGNAL PROCESSING, 2019, : 44 - 47
  • [44] Design of customer marketing big data processing system based on data mining clustering technology
    Wang, Jingzhe
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON ADVANCES IN MECHANICAL ENGINEERING AND INDUSTRIAL INFORMATICS (AMEII 2016), 2016, 73 : 100 - 104
  • [45] Model-Based Clustering of Categorical Data Based on the Hamming Distance
    Argiento, Raffaele
    Filippi-Mazzola, Edoardo
    Paci, Lucia
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2024,
  • [46] Big Data Clustering via Random Sketching and Validation
    Traganitis, Panagiotis A.
    Slavakis, Konstantinos
    Giannakis, Georgios B.
    CONFERENCE RECORD OF THE 2014 FORTY-EIGHTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, 2014, : 1046 - 1050
  • [47] RANDOM WALK SAMPLING FOR BIG DATA OVER NETWORKS
    Basirian, Saeed
    Jung, Alexander
    2017 INTERNATIONAL CONFERENCE ON SAMPLING THEORY AND APPLICATIONS (SAMPTA), 2017, : 427 - 431
  • [48] Data Mining Techniques for Producing Clustering in Big Data with MapReduce Function
    Presskila, X. Arogya
    Robinson, Y. Harold
    Studies in Big Data, 2021, 93 : 195 - 203
  • [49] Stemflow estimation in a redwood forest using model-based stratified random sampling
    Lewis, J
    ENVIRONMETRICS, 2003, 14 (06) : 559 - 571
  • [50] An efficient sampling-based visualization technique for big data clustering with crisp partitions
    Rajendra Prasad, K.
    Mohammed, Moulana
    Narasimha Prasad, L. V.
    Anguraj, Dinesh Kumar
    DISTRIBUTED AND PARALLEL DATABASES, 2021, 39 (03) : 813 - 832