Distributed K-Means algorithm based on a Spark optimization sample

被引:0
|
作者
Feng, Yongan [1 ]
Zou, Jiapeng [1 ]
Liu, Wanjun [1 ]
Lv, Fu [1 ]
机构
[1] Liaoning Tech Univ, Huludao, Peoples R China
来源
PLOS ONE | 2024年 / 19卷 / 12期
关键词
D O I
10.1371/journal.pone.0308993
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
To address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a weighted jump-bank approach is introduced to enable efficient random sampling and preclustering. By incorporating weights and jump pointers, this approach improves the quality of initial centers and reduces sensitivity to their selection. Secondly, we utilize a weighted max-min distance with variance to calculate distances, considering both weight and variance information. This enables SOSK-Means to identify clusters that are farther apart and denser, enhancing clustering accuracy. The selection of the best initial centers is performed using the mean square error criterion. This ensures that the initial centers better represent the distribution and structure of the dataset, leading to improved clustering performance. During the iteration process, a novel distance comparison method is employed to reduce computation time, optimizing the overall efficiency of the algorithm. Additionally, SOSK-Means incorporates a Directed Acyclic Graph (DAG) to optimize performance through distributed strategies, leveraging the capabilities of the Spark framework. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] Cooperative Clustering Algorithm Based on Brain Storm Optimization and K-Means
    Tuba, Eva
    Strumberger, Ivana
    Bacanin, Nebojsa
    Zivkovic, Dejan
    Tuba, Milan
    2018 28TH INTERNATIONAL CONFERENCE RADIOELEKTRONIKA (RADIOELEKTRONIKA), 2018,
  • [22] Sampling fuzzy k-means clustering algorithm based on clonal optimization
    Yu, Haiqing
    Li, Ping
    Fan, Yugang
    WCICA 2006: SIXTH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION, VOLS 1-12, CONFERENCE PROCEEDINGS, 2006, : 6102 - +
  • [23] A Novel Sample Weighting K-Means Clustering Algorithm based on Angles Information
    Gu, Lei
    2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 3697 - 3702
  • [24] Distributed Clustering Based on K-means and CPGA
    Zhou, Jun
    Liu, Zhijing
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 444 - 447
  • [25] An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm
    Sardar T.H.
    Ansari Z.
    Ansari, Zahid (zahid_cs@pace.edu.in), 1600, Springer (101): : 641 - 650
  • [26] Privacy Preserving Distributed Cell-based K-means Clustering Algorithm
    Su, Fang
    Zu, Yun-xiao
    Li, Wei-hai
    INTERNATIONAL CONFERENCE ON MATHEMATICS, MODELLING AND SIMULATION TECHNOLOGIES AND APPLICATIONS (MMSTA 2017), 2017, 215 : 377 - 383
  • [27] k-Means Clustering Algorithm and Its Simulation Based on Distributed Computing Platform
    Wu, Chunqiong
    Yan, Bingwen
    Yu, Rongrui
    Yu, Baoqin
    Zhou, Xiukao
    Yu, Yanliang
    Chen, Na
    COMPLEXITY, 2021, 2021
  • [28] A distributed load clustering algorithm based on quantile radius dynamic K-means
    Liu J.
    Liu Y.
    Cheng M.
    Yu L.
    Dianli Xitong Baohu yu Kongzhi/Power System Protection and Control, 2019, 47 (24): : 15 - 22
  • [29] K-means algorithm based on particle swarm optimization algorithm for anomaly intrusion detection
    Xiao, Lizhong
    Shao, Zhiqing
    Liu, Gang
    WCICA 2006: SIXTH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION, VOLS 1-12, CONFERENCE PROCEEDINGS, 2006, : 5854 - +
  • [30] Performance Analysis of Parallel K-Means with Optimization Algorithms for Clustering on Spark
    Santhi, V.
    Jose, Rini
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY (ICDCIT 2018), 2018, 10722 : 158 - 162