Distributed K-Means algorithm based on a Spark optimization sample

被引:0
|
作者
Feng, Yongan [1 ]
Zou, Jiapeng [1 ]
Liu, Wanjun [1 ]
Lv, Fu [1 ]
机构
[1] Liaoning Tech Univ, Huludao, Peoples R China
来源
PLOS ONE | 2024年 / 19卷 / 12期
关键词
D O I
10.1371/journal.pone.0308993
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
To address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a weighted jump-bank approach is introduced to enable efficient random sampling and preclustering. By incorporating weights and jump pointers, this approach improves the quality of initial centers and reduces sensitivity to their selection. Secondly, we utilize a weighted max-min distance with variance to calculate distances, considering both weight and variance information. This enables SOSK-Means to identify clusters that are farther apart and denser, enhancing clustering accuracy. The selection of the best initial centers is performed using the mean square error criterion. This ensures that the initial centers better represent the distribution and structure of the dataset, leading to improved clustering performance. During the iteration process, a novel distance comparison method is employed to reduce computation time, optimizing the overall efficiency of the algorithm. Additionally, SOSK-Means incorporates a Directed Acyclic Graph (DAG) to optimize performance through distributed strategies, leveraging the capabilities of the Spark framework. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] A GENERALIZED k-MEANS PROBLEM FOR CLUSTERING AND AN ADMM-BASED k-MEANS ALGORITHM
    Ling, Liyun
    Gu, Yan
    Zhang, Su
    Wen, Jie
    JOURNAL OF INDUSTRIAL AND MANAGEMENT OPTIMIZATION, 2024, 20 (06) : 2089 - 2115
  • [32] An Improved K-means Algorithm for Test Case Optimization
    Tan, Tian-Tian
    Wang, Bao-Sheng
    Tang, Yong
    Zhou, Xu
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2019), 2019, : 169 - 172
  • [33] K-means Optimization Algorithm for Solving Clustering Problem
    Dong, Jinxin
    Qi, Minyong
    WKDD: 2009 SECOND INTERNATIONAL WORKSHOP ON KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, : 52 - 55
  • [34] Improving K-Means with Harris Hawks Optimization Algorithm
    Zhang, Li-Gang
    Xue, Xingsi
    Chu, Shu-Chuan
    ADVANCES IN INTELLIGENT SYSTEMS AND COMPUTING (ECC 2021), 2022, 268 : 95 - 104
  • [35] Optimization of K-Means clustering Using Genetic Algorithm
    Irfan, Shadab
    Dwivedi, Gaurav
    Ghosh, Subhajit
    2017 INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES FOR SMART NATION (IC3TSN), 2017, : 157 - 162
  • [36] Detailed Analysis and Optimization of CUDA K-means Algorithm
    Krulis, Martin
    Kratochvil, Miroslav
    PROCEEDINGS OF THE 49TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2020, 2020,
  • [37] A Modified K-means Algorithm - Two-Layer K-means Algorithm
    Liu, Chen-Chung
    Chu, Shao-Wei
    Chan, Yung-Kuan
    Yu, Shyr-Shen
    2014 TENTH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING (IIH-MSP 2014), 2014, : 447 - 450
  • [38] A Distributed K-means Clustering Algorithm in Wireless Sensor Networks
    Zhou, Jin
    Zhang, Yuan
    Jiang, Yuyan
    Chen, C. L. Philip
    Chen, Long
    2015 INTERNATIONAL CONFERENCE ON INFORMATIVE AND CYBERNETICS FOR COMPUTATIONAL SOCIAL SYSTEMS (ICCSS), 2015, : 26 - 30
  • [39] NEW ALGORITHM FOR CLUSTERING DISTRIBUTED DATA USING K-MEANS
    Khedr, Ahmed M.
    Bhatnagar, Raj K.
    COMPUTING AND INFORMATICS, 2014, 33 (04) : 943 - 964
  • [40] Research on k-means Clustering Algorithm An Improved k-means Clustering Algorithm
    Shi Na
    Liu Xumin
    Guan Yong
    2010 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY AND SECURITY INFORMATICS (IITSI 2010), 2010, : 63 - 67