A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引:0
|
作者
Ankita Sinha
Prasanta K. Jana
机构
[1] IIT (ISM),Department of Computer Science and Engineering
[2] Dhanbad,undefined
来源
关键词
Mahalanobis distance; Apache Hadoop; -means++ initialization; Genetic algorithm;
D O I
暂无
中图分类号
学科分类号
摘要
Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ ++ $$\end{document} initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.
引用
收藏
页码:1562 / 1579
页数:17
相关论文
共 50 条
  • [21] A Novel MapReduce Based k-Means Clustering
    Sinha, Ankita
    Jana, Prasanta K.
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND COMMUNICATION, 2017, 458 : 247 - 255
  • [22] A MapReduce-based parallel K-means clustering for large-scale CIM data verification
    Deng, Chuang
    Liu, Yang
    Xu, Lixiong
    Yang, Jie
    Liu, Junyong
    Li, Siguang
    Li, Maozhen
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (11): : 3096 - 3114
  • [23] Parallel K-Means Clustering Based on MapReduce
    Zhao, Weizhong
    Ma, Huifang
    He, Qing
    CLOUD COMPUTING, PROCEEDINGS, 2009, 5931 : 674 - 679
  • [24] Bearing Fault Diagnosis using Hybrid Genetic Algorithm K-means Clustering
    Ettefagh, M. M.
    Ghaemi, M.
    Asr, M. Yazdanian
    2014 IEEE INTERNATIONAL SYMPOSIUM ON INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS (INISTA 2014), 2014, : 84 - 89
  • [25] Optimization of K-Means clustering Using Genetic Algorithm
    Irfan, Shadab
    Dwivedi, Gaurav
    Ghosh, Subhajit
    2017 INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES FOR SMART NATION (IC3TSN), 2017, : 157 - 162
  • [26] A K-means Based Genetic Algorithm for Data Clustering
    Pizzuti, Clara
    Procopio, Nicola
    INTERNATIONAL JOINT CONFERENCE SOCO'16- CISIS'16-ICEUTE'16, 2017, 527 : 211 - 222
  • [27] Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA)
    Alshammari, Sayer
    Zolkepli, Maslina Binti
    Abdullah, Rusli Bin
    RECENT ADVANCES ON SOFT COMPUTING AND DATA MINING (SCDM 2020), 2020, 978 : 98 - 108
  • [28] NEW ALGORITHM FOR CLUSTERING DISTRIBUTED DATA USING K-MEANS
    Khedr, Ahmed M.
    Bhatnagar, Raj K.
    COMPUTING AND INFORMATICS, 2014, 33 (04) : 943 - 964
  • [29] An Analytic Survey on MapReduce based K-Means and its Hybrid Clustering Algorithms
    Bagde, Utkarsha
    Tripathi, Priyanka
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION (ICCMC 2018), 2018, : 32 - 36
  • [30] An Optimal Distributed K-Means Clustering Algorithm Based on CloudStack
    Mao, Yingchi
    Xu, Ziyang
    Li, Xiaofang
    Ping, Ping
    2015 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, 2015, : 3149 - 3156