Distributed K-Means algorithm based on a Spark optimization sample

被引:0
|
作者
Feng, Yongan [1 ]
Zou, Jiapeng [1 ]
Liu, Wanjun [1 ]
Lv, Fu [1 ]
机构
[1] Liaoning Tech Univ, Huludao, Peoples R China
来源
PLOS ONE | 2024年 / 19卷 / 12期
关键词
D O I
10.1371/journal.pone.0308993
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
To address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a weighted jump-bank approach is introduced to enable efficient random sampling and preclustering. By incorporating weights and jump pointers, this approach improves the quality of initial centers and reduces sensitivity to their selection. Secondly, we utilize a weighted max-min distance with variance to calculate distances, considering both weight and variance information. This enables SOSK-Means to identify clusters that are farther apart and denser, enhancing clustering accuracy. The selection of the best initial centers is performed using the mean square error criterion. This ensures that the initial centers better represent the distribution and structure of the dataset, leading to improved clustering performance. During the iteration process, a novel distance comparison method is employed to reduce computation time, optimizing the overall efficiency of the algorithm. Additionally, SOSK-Means incorporates a Directed Acyclic Graph (DAG) to optimize performance through distributed strategies, leveraging the capabilities of the Spark framework. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] Distributed and multi-core version of k-means algorithm
    Savvas, Ilias K.
    Tselios, Dimitrios
    Garani, Georgia
    INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2019, 10 (03) : 283 - 291
  • [42] K and starting means for k-means algorithm
    Fahim, Ahmed
    JOURNAL OF COMPUTATIONAL SCIENCE, 2021, 55
  • [43] Fuzzy Time Series Based on K-means and Particle Swarm Optimization Algorithm
    Tian, Zonghao
    Wang, Peng
    He, Tianyu
    MAN-MACHINE-ENVIRONMENT SYSTEM ENGINEERING, MMESE, 2016, 406 : 181 - 189
  • [44] Optimization of regional economic industrial structure based on fuzzy k-means algorithm
    Wang, Yin
    MATHEMATICS AND FINANCIAL ECONOMICS, 2023,
  • [45] An improved particle swarm optimization algorithm based k-means clustering analysis
    Wei, Benzheng
    Zhao, Zhimin
    Journal of Information and Computational Science, 2010, 7 (02): : 511 - 518
  • [46] Insulator segmentation algorithm based on k-means
    Zhang, Kaibi
    Yang, Lin
    2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 4747 - 4751
  • [47] K-means Algorithm Based on Fitting Function
    Chu, SiYong
    Deng, YanNi
    Tu, LinLi
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON APPLIED SCIENCE AND ENGINEERING INNOVATION, 2015, 12 : 1940 - 1945
  • [48] K-Means algorithm based on Cloud Computing
    Xu, Yunfeng
    Zhang, Yan
    Ma, Rui
    2012 FIFTH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2012), VOL 2, 2012, : 363 - 365
  • [49] A Clustering Method Based on K-Means Algorithm
    Li, Youguo
    Wu, Haiyan
    INTERNATIONAL CONFERENCE ON SOLID STATE DEVICES AND MATERIALS SCIENCE, 2012, 25 : 1104 - 1109
  • [50] K-means Clustering Based on Improved Quantum Particle Swarm Optimization Algorithm
    Bai, Lili
    Song, Zerui
    Bao, Haijie
    Jiang, Jingqing
    2021 13TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTATIONAL INTELLIGENCE (ICACI), 2021, : 140 - 145