A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引:0
|
作者
Ankita Sinha
Prasanta K. Jana
机构
[1] IIT (ISM),Department of Computer Science and Engineering
[2] Dhanbad,undefined
来源
关键词
Mahalanobis distance; Apache Hadoop; -means++ initialization; Genetic algorithm;
D O I
暂无
中图分类号
学科分类号
摘要
Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ ++ $$\end{document} initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.
引用
收藏
页码:1562 / 1579
页数:17
相关论文
共 50 条
  • [41] K-Means Parallel Algorithm of Big Data Clustering Based on Mapreduce PCAM Method
    Li, Yongyi
    Yang, Zhongqiang
    Han, Kaixu
    Engineering Intelligent Systems, 2021, 29 (06): : 411 - 418
  • [42] Distributed Algorithm for Text Documents Clustering Based on k-Means Approach
    Sarnovsky, Martin
    Carnoka, Noema
    INFORMATION SYSTEMS ARCHITECTURE AND TECHNOLOGY, ISAT 2015, PT II, 2016, 430 : 165 - 174
  • [43] MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability
    Ludwig, Simone A.
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2015, 6 (06) : 923 - 934
  • [44] An Improved K-means Algorithm based on Mapreduce and Grid
    Ma, Li
    Gu, Lei
    Li, Bo
    Ma, Yue
    Wang, Jin
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2015, 8 (01): : 189 - 199
  • [45] Optimized big data K-means clustering using MapReduce
    Cui, Xiaoli
    Zhu, Pingfei
    Yang, Xin
    Li, Keqiu
    Ji, Changqing
    JOURNAL OF SUPERCOMPUTING, 2014, 70 (03): : 1249 - 1259
  • [46] Optimized big data K-means clustering using MapReduce
    Xiaoli Cui
    Pingfei Zhu
    Xin Yang
    Keqiu Li
    Changqing Ji
    The Journal of Supercomputing, 2014, 70 : 1249 - 1259
  • [47] Research on k-means Clustering Algorithm An Improved k-means Clustering Algorithm
    Shi Na
    Liu Xumin
    Guan Yong
    2010 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY AND SECURITY INFORMATICS (IITSI 2010), 2010, : 63 - 67
  • [48] Application of Hybrid Clustering using Parallel K-Means Algorithm and DIANA Algorithm
    Umam, Khoirul
    Bustamam, Alhadi
    Lestari, Dian
    SYMPOSIUM ON BIOMATHEMATICS (SYMOMATH 2016), 2017, 1825
  • [49] MapReduce-Based Graph Structural Clustering Algorithm
    Zhang W.-P.
    Li Z.-J.
    Li R.-H.
    Liu Y.-H.
    Mao R.
    Qiao S.-J.
    Ruan Jian Xue Bao/Journal of Software, 2018, 29 (03): : 627 - 641
  • [50] A GPS location data clustering approach based on a niche genetic algorithm and hybrid K-means
    Ma, Hongjiang
    Zhou, Xiangbing
    INTELLIGENT DATA ANALYSIS, 2019, 23 : S175 - S198