A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引：0

作者：

Ankita Sinha

Prasanta K. Jana

机构：

[1] IIT (ISM),Department of Computer Science and Engineering

[2] Dhanbad,undefined

来源：

The Journal of Supercomputing | 2018年 / 74卷

关键词：

Mahalanobis distance; Apache Hadoop; -means++ initialization; Genetic algorithm;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ ++ $$\end{document} initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.

引用

页码：1562 / 1579

页数：17

共 50 条

[21] A Novel MapReduce Based k-Means Clustering
Sinha, Ankita
Jana, Prasanta K.
PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND COMMUNICATION, 2017, 458 : 247 - 255
[22] A MapReduce-based parallel K-means clustering for large-scale CIM data verification
Deng, Chuang
Liu, Yang
Xu, Lixiong
Yang, Jie
Liu, Junyong
Li, Siguang
Li, Maozhen
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (11): : 3096 - 3114
[23] Parallel K-Means Clustering Based on MapReduce
Zhao, Weizhong
Ma, Huifang
He, Qing
CLOUD COMPUTING, PROCEEDINGS, 2009, 5931 : 674 - 679
[24] Bearing Fault Diagnosis using Hybrid Genetic Algorithm K-means Clustering
Ettefagh, M. M.
Ghaemi, M.
Asr, M. Yazdanian
2014 IEEE INTERNATIONAL SYMPOSIUM ON INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS (INISTA 2014), 2014, : 84 - 89
[25] Optimization of K-Means clustering Using Genetic Algorithm
Irfan, Shadab
Dwivedi, Gaurav
Ghosh, Subhajit
2017 INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES FOR SMART NATION (IC3TSN), 2017, : 157 - 162
[26] A K-means Based Genetic Algorithm for Data Clustering
Pizzuti, Clara
Procopio, Nicola
INTERNATIONAL JOINT CONFERENCE SOCO'16- CISIS'16-ICEUTE'16, 2017, 527 : 211 - 222
[27] Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA)
Alshammari, Sayer
Zolkepli, Maslina Binti
Abdullah, Rusli Bin
RECENT ADVANCES ON SOFT COMPUTING AND DATA MINING (SCDM 2020), 2020, 978 : 98 - 108
[28] NEW ALGORITHM FOR CLUSTERING DISTRIBUTED DATA USING K-MEANS
Khedr, Ahmed M.
Bhatnagar, Raj K.
COMPUTING AND INFORMATICS, 2014, 33 (04) : 943 - 964
[29] An Analytic Survey on MapReduce based K-Means and its Hybrid Clustering Algorithms
Bagde, Utkarsha
Tripathi, Priyanka
PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION (ICCMC 2018), 2018, : 32 - 36
[30] An Optimal Distributed K-Means Clustering Algorithm Based on CloudStack
Mao, Yingchi
Xu, Ziyang
Li, Xiaofang
Ping, Ping
2015 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, 2015, : 3149 - 3156

← 1 2 3 4 5 →