A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

被引：0

作者：

Ankita Sinha

Prasanta K. Jana

机构：

[1] IIT (ISM),Department of Computer Science and Engineering

[2] Dhanbad,undefined

来源：

The Journal of Supercomputing | 2018年 / 74卷

关键词：

Mahalanobis distance; Apache Hadoop; -means++ initialization; Genetic algorithm;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means++\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ ++ $$\end{document} initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.

引用

页码：1562 / 1579

页数：17

共 50 条

[41] K-Means Parallel Algorithm of Big Data Clustering Based on Mapreduce PCAM Method
Li, Yongyi
Yang, Zhongqiang
Han, Kaixu
Engineering Intelligent Systems, 2021, 29 (06): : 411 - 418
[42] Distributed Algorithm for Text Documents Clustering Based on k-Means Approach
Sarnovsky, Martin
Carnoka, Noema
INFORMATION SYSTEMS ARCHITECTURE AND TECHNOLOGY, ISAT 2015, PT II, 2016, 430 : 165 - 174
[43] MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability
Ludwig, Simone A.
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2015, 6 (06) : 923 - 934
[44] An Improved K-means Algorithm based on Mapreduce and Grid
Ma, Li
Gu, Lei
Li, Bo
Ma, Yue
Wang, Jin
INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2015, 8 (01): : 189 - 199
[45] Optimized big data K-means clustering using MapReduce
Cui, Xiaoli
Zhu, Pingfei
Yang, Xin
Li, Keqiu
Ji, Changqing
JOURNAL OF SUPERCOMPUTING, 2014, 70 (03): : 1249 - 1259
[46] Optimized big data K-means clustering using MapReduce
Xiaoli Cui
Pingfei Zhu
Xin Yang
Keqiu Li
Changqing Ji
The Journal of Supercomputing, 2014, 70 : 1249 - 1259
[47] Research on k-means Clustering Algorithm An Improved k-means Clustering Algorithm
Shi Na
Liu Xumin
Guan Yong
2010 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY AND SECURITY INFORMATICS (IITSI 2010), 2010, : 63 - 67
[48] Application of Hybrid Clustering using Parallel K-Means Algorithm and DIANA Algorithm
Umam, Khoirul
Bustamam, Alhadi
Lestari, Dian
SYMPOSIUM ON BIOMATHEMATICS (SYMOMATH 2016), 2017, 1825
[49] MapReduce-Based Graph Structural Clustering Algorithm
Zhang W.-P.
Li Z.-J.
Li R.-H.
Liu Y.-H.
Mao R.
Qiao S.-J.
Ruan Jian Xue Bao/Journal of Software, 2018, 29 (03): : 627 - 641
[50] A GPS location data clustering approach based on a niche genetic algorithm and hybrid K-means
Ma, Hongjiang
Zhou, Xiangbing
INTELLIGENT DATA ANALYSIS, 2019, 23 : S175 - S198

← 1 2 3 4 5 →