Data mining on vast data sets as a cluster system benchmark

被引：3

作者：

Heinecke, Alexander ^{[2
]}

Karlstetter, Roman ^{[2
]}

Pflueger, Dirk ^{[1
]}

Bungartz, Hans-Joachim ^{[2
]}

机构：

[1] Univ Stuttgart, Inst Parallele & Verteilte Syst, D-70569 Stuttgart, Germany

[2] Tech Univ Munich, Inst Informat, D-85748 Garching, Germany

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2016年 / 28卷 / 07期

关键词：

data mining; CPU; GPU architectures; platform comparison; Intel Xeon Phi; NVIDIA Kepler;

D O I：

10.1002/cpe.3514

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strong-scaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator. Copyright (c) 2015 John Wiley & Sons, Ltd.

引用

页码：2145 / 2165

页数：21

共 50 条

[11] From visualisation to data mining with large data sets
Adelmann, A
Ryne, RD
Shalf, JM
Siegerist, C
2005 IEEE PARTICLE ACCELERATOR CONFERENCE (PAC), VOLS 1-4, 2005, : 542 - 544
[12] Massive data sets, data mining, and decision support
Dalal, S
Dumais, S
Kettenring, J
Kurien, V
McIntosh, A
Maitra, R
MINING AND MODELING MASSIVE DATA SETS IN SCIENCE, ENGINEERING, AND BUSINESS WITH A SUBTHEME IN ENVIRONMENTAL STATISTICS, 1997, 29 (01): : 329 - 329
[13] Data mining from extreme data sets: Very large and/or very skewed data sets
Hall, LO
2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 2555 - 2555
[14] MineBench:: A benchmark suite for data mining workloads
Narayanan, Ramanathan
Ozisikyilmaz, Berkin
Zambreno, Joseph
Memik, Gokhan
Choudhary, Alok
PROCEEDINGS OF THE IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION, 2006, : 182 - +
[15] Cluster Analysis-Based Approaches for Geospatiotemporal Data Mining of Massive Data Sets for Identification of Forest Threats
Richard Tran Mills
Hoffman, Forrest M.
Kumar, Jitendra
Hargrove, William W.
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 1612 - 1621
[16] Influence of the VM Manager on Private Cluster Data Mining System
Czerwinski, Dariusz
COMPUTER NETWORKS, CN 2014, 2014, 431 : 47 - 56
[17] Mining HTS data sets.
Engels, M
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2001, 222 : U408 - U408
[18] Rough sets as a framework for data mining
Butalia, A. H.
Dhore, M. L.
IMECS 2007: INTERNATIONAL MULTICONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, VOLS I AND II, 2007, : 728 - +
[19] An experiment with fuzzy sets in data mining
Olson, David L.
Moshkovich, Helen
Mechitov, Alexander
COMPUTATIONAL SCIENCE - ICCS 2007, PT 2, PROCEEDINGS, 2007, 4488 : 462 - +
[20] A method generating data sets to test data mining algorithms
School of Information Science and Engineering, Northeastern University, Shenyang 110004, China
Dongbei Daxue Xuebao, 2008, 3 (328-331):

← 1 2 3 4 5 →