Data mining on vast data sets as a cluster system benchmark

被引：3

作者：

Heinecke, Alexander ^{[2
]}

Karlstetter, Roman ^{[2
]}

Pflueger, Dirk ^{[1
]}

Bungartz, Hans-Joachim ^{[2
]}

机构：

[1] Univ Stuttgart, Inst Parallele & Verteilte Syst, D-70569 Stuttgart, Germany

[2] Tech Univ Munich, Inst Informat, D-85748 Garching, Germany

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2016年 / 28卷 / 07期

关键词：

data mining; CPU; GPU architectures; platform comparison; Intel Xeon Phi; NVIDIA Kepler;

D O I：

10.1002/cpe.3514

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strong-scaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator. Copyright (c) 2015 John Wiley & Sons, Ltd.

引用

页码：2145 / 2165

页数：21

共 50 条

[41] The Research of High Efficient Data Mining Algorithms for Massive Data Sets
Tao Cuixia
MECHATRONICS ENGINEERING, COMPUTING AND INFORMATION TECHNOLOGY, 2014, 556-562 : 3901 - 3904
[42] Handling of incomplete data sets using ICA and SOM in data mining
Hongyi Peng
Siming Zhu
Neural Computing and Applications, 2007, 16 : 167 - 172
[43] Mining for empty rectangles in large data sets
Edmonds, J
Gryz, J
Liang, DM
Miller, RJ
DATABASE THEORY - ICDT 2001, PROCEEDINGS, 2001, 1973 : 174 - 188
[44] Neighborhood Rough Sets for Dynamic Data Mining
Zhang, Junbo
Li, Tianrui
Ruan, Da
Liu, Dun
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2012, 27 (04) : 317 - 342
[45] Mining bi-sets in numerical data
Besson, Jeremy
Robardet, Celine
De Raedt, Luc
Boulicaut, Jean-Francois
KNOWLEDGE DISCOVERY IN INDUCTIVE DATABASES, 2007, 4747 : 11 - +
[46] The research of data mining based on extension sets
Lu, Q
Yu, YQ
Third International Conference on Information Technology and Applications, Vol 2, Proceedings, 2005, : 234 - 237
[47] On generalized quantifiers, finite sets and data mining
Hájek, P
INTELLIGENT INFORMATION PROCESSING AND WEB MINING, 2003, : 489 - 496
[48] Mining knowledge in astrophysical massive data sets
Brescia, Massimo
Longo, Giuseppe
Pasian, Fabio
NUCLEAR INSTRUMENTS & METHODS IN PHYSICS RESEARCH SECTION A-ACCELERATORS SPECTROMETERS DETECTORS AND ASSOCIATED EQUIPMENT, 2010, 623 (02): : 845 - 849
[49] Warehousing and mining massive RFID data sets
Han, Jiawei
Gonzalez, Hector
Li, Xiaolei
Klabjan, Diego
ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2006, 4093 : 1 - 18
[50] Fuzzy sets for data mining and recommendation algorithms
Man, Na
Wang, Kechao
Liu, Lin
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (04) : 3737 - 3745

← 1 2 3 4 5 →