Distributed K-Means algorithm based on a Spark optimization sample

被引:0
|
作者
Feng, Yongan [1 ]
Zou, Jiapeng [1 ]
Liu, Wanjun [1 ]
Lv, Fu [1 ]
机构
[1] Liaoning Tech Univ, Huludao, Peoples R China
来源
PLOS ONE | 2024年 / 19卷 / 12期
关键词
D O I
10.1371/journal.pone.0308993
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
To address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a weighted jump-bank approach is introduced to enable efficient random sampling and preclustering. By incorporating weights and jump pointers, this approach improves the quality of initial centers and reduces sensitivity to their selection. Secondly, we utilize a weighted max-min distance with variance to calculate distances, considering both weight and variance information. This enables SOSK-Means to identify clusters that are farther apart and denser, enhancing clustering accuracy. The selection of the best initial centers is performed using the mean square error criterion. This ensures that the initial centers better represent the distribution and structure of the dataset, leading to improved clustering performance. During the iteration process, a novel distance comparison method is employed to reduce computation time, optimizing the overall efficiency of the algorithm. Additionally, SOSK-Means incorporates a Directed Acyclic Graph (DAG) to optimize performance through distributed strategies, leveraging the capabilities of the Spark framework. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Optimization of the Distributed K-means Clustering Algorithm Based on Set Pair Analysis
    Ling, Song
    Qi Yunfeng
    2015 8th International Congress on Image and Signal Processing (CISP), 2015, : 1593 - 1598
  • [2] K-means Clustering Optimization Algorithm Based on MapReduce
    Li, Zhihua
    Song, Xudong
    Zhu, Wenhui
    Chen, Yanxia
    PROCEEDINGS OF THE 2015 INTERNATIONAL SYMPOSIUM ON COMPUTERS & INFORMATICS, 2015, 13 : 198 - 203
  • [3] Optimization and improvement based on K-Means Cluster algorithm
    Wu, Jieming
    Yu, Wenhu
    2009 SECOND INTERNATIONAL SYMPOSIUM ON KNOWLEDGE ACQUISITION AND MODELING: KAM 2009, VOL 3, 2009, : 335 - 339
  • [4] The Parallelization and Optimization of K-means Algorithm Based on MGPUSim
    Mo, Zhangbin
    Wang, Yaobin
    Zhang, Qingming
    Zhang, Guangbing
    Guo, Mingfeng
    Zhang, Yaqing
    Shen, Chao
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT IV, 2022, 13532 : 309 - 320
  • [5] Heterogeneous Parallel and Distributed Optimization of K-means Algorithm on Sunway Supercomputer
    Chen, Jiawei
    Tan, Rong
    Zhang, Yiwen
    2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017), 2017, : 931 - 937
  • [6] Optimisation of K-means algorithm based on sample density canopy
    Shen, Guo-xin
    Jiang, Zhong-yun
    INTERNATIONAL JOURNAL OF AD HOC AND UBIQUITOUS COMPUTING, 2021, 38 (1-3) : 62 - 69
  • [7] An Optimal Distributed K-Means Clustering Algorithm Based on CloudStack
    Mao, Yingchi
    Xu, Ziyang
    Li, Xiaofang
    Ping, Ping
    2015 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, 2015, : 3149 - 3156
  • [8] An Optimal Distributed K-Means Clustering Algorithm Based on CloudStack
    Mao, Yingchi
    Xu, Ziyang
    Ping, Ping
    Wang, Longbao
    2015 NINTH INTERNATIONAL CONFERENCE ON FRONTIER OF COMPUTER SCIENCE AND TECHNOLOGY FCST 2015, 2015, : 386 - 391
  • [9] Interpretation and optimization of the k-means algorithm
    Kristian Sabo
    Rudolf Scitovski
    Applications of Mathematics, 2014, 59 : 391 - 406
  • [10] Interpretation and optimization of the k-means algorithm
    Sabo, Kristian
    Scitovski, Rudolf
    APPLICATIONS OF MATHEMATICS, 2014, 59 (04) : 391 - 406