A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引:0
|
作者
Ankita Sinha
Prasanta K. Jana
机构
[1] IIT (ISM),Department of Computer Science and Engineering
[2] Dhanbad,undefined
来源
关键词
Mahalanobis distance; Apache Hadoop; -means++ initialization; Genetic algorithm;
D O I
暂无
中图分类号
学科分类号
摘要
Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ ++ $$\end{document} initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.
引用
收藏
页码:1562 / 1579
页数:17
相关论文
共 50 条
  • [31] An Optimal Distributed K-Means Clustering Algorithm Based on CloudStack
    Mao, Yingchi
    Xu, Ziyang
    Ping, Ping
    Wang, Longbao
    2015 NINTH INTERNATIONAL CONFERENCE ON FRONTIER OF COMPUTER SCIENCE AND TECHNOLOGY FCST 2015, 2015, : 386 - 391
  • [32] An Effective and Efficient Clustering Based on K-Means Using MapReduce and TLBO
    Pedireddla, Praveen Kumar
    Yadwad, Sunita A.
    PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION TECHNOLOGIES, IC3T 2015, VOL 3, 2016, 381 : 619 - 628
  • [33] Genetic TKM: A Hybrid Clustering Method Based on Genetic Algorithm, Tabu Search and K-Means
    Yaghini, Masoud
    Gereilinia, Nasim
    INTERNATIONAL JOURNAL OF APPLIED METAHEURISTIC COMPUTING, 2013, 4 (01) : 67 - 77
  • [34] A hybrid clustering technique combining a novel genetic algorithm with K-Means
    Rahman, Md Anisur
    Islam, Md Zahidul
    KNOWLEDGE-BASED SYSTEMS, 2014, 71 : 345 - 365
  • [35] A K-means Optimized Clustering Algorithm Based on Improved Genetic Algorithm
    Pu, Qiu-Mei
    Wu, Qiong
    Li, Qian
    Lecture Notes in Electrical Engineering, 2022, 801 LNEE : 133 - 140
  • [36] K-Means and Fuzzy based Hybrid Clustering Algorithm for WSN
    Angadi, Basavaraj M.
    Kakkasageri, Mahabaleshwar S.
    INTERNATIONAL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 2023, 69 (04) : 793 - 801
  • [37] On K-means Data Clustering Algorithm with Genetic Algorithm
    Kapil, Shruti
    Chawla, Meenu
    Ansari, Mohd Dilshad
    2016 FOURTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2016, : 202 - 206
  • [38] Clustering with Niching Genetic K-means algorithm
    Sheng, WG
    Tucker, A
    Liu, XH
    GENETIC AND EVOLUTIONARY COMPUTATION GECCO 2004 , PT 2, PROCEEDINGS, 2004, 3103 : 162 - 173
  • [39] Modified K-Means Algorithm for Genetic Clustering
    Bonab, Mohammad Babrdel
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2011, 11 (09): : 24 - 28
  • [40] A k-means based clustering algorithm
    Bloisi, Domenico Daniele
    Locchi, Luca
    COMPUTER VISION SYSTEMS, PROCEEDINGS, 2008, 5008 : 109 - 118