Data mining on vast data sets as a cluster system benchmark

被引：3

作者：

Heinecke, Alexander ^{[2
]}

Karlstetter, Roman ^{[2
]}

Pflueger, Dirk ^{[1
]}

Bungartz, Hans-Joachim ^{[2
]}

机构：

[1] Univ Stuttgart, Inst Parallele & Verteilte Syst, D-70569 Stuttgart, Germany

[2] Tech Univ Munich, Inst Informat, D-85748 Garching, Germany

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2016年 / 28卷 / 07期

关键词：

data mining; CPU; GPU architectures; platform comparison; Intel Xeon Phi; NVIDIA Kepler;

D O I：

10.1002/cpe.3514

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strong-scaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator. Copyright (c) 2015 John Wiley & Sons, Ltd.

引用

页码：2145 / 2165

页数：21

共 50 条

[1] Data Mining on Imbalanced Data Sets
Gu, Qiong
Cai, Zhihua
Zhu, Li
Huang, Bo
2008 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER THEORY AND ENGINEERING, 2008, : 1020 - 1024
[2] Data mining and metrics on data sets
Biebler, Karl-Ernst
Wodny, Michael
Jaeger, Bernd
INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 1, PROCEEDINGS, 2006, : 638 - +
[3] A platform for parallel data mining on cluster system
Wu, SC
Wu, GF
Yu, ZC
Ban, H
CURRENT TRENDS IN HIGH PERFORMANCE COMPUTING AND ITS APPLICATIONS, PROCEEDINGS, 2005, : 155 - 164
[4] Heuristic extraction of fuzzy classification rules using data mining techniques: An empirical study on benchmark data sets
Ishibuchi, H
Yamamoto, T
2004 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-3, PROCEEDINGS, 2004, : 161 - 166
[5] Benchmark Data Sets of Boron Cluster Dihydrogen Bonding for the Validation of Approximate Computational Methods
Fanfrlik, Jindrich
Pecina, Adam
Rezac, Jan
Lepsik, Martin
Sarosi, Menyhart B.
Hnyk, Drahomir
Hobza, Pavel
CHEMPHYSCHEM, 2020, 21 (23) : 2599 - 2604
[6] Mining transformed data sets
Burns, A
Kusiak, A
Letsche, T
KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2004, 3213 : 148 - 154
[7] Benchmark AFLOW Data Sets for Machine Learning
Conrad L. Clement
Steven K. Kauwe
Taylor D. Sparks
Integrating Materials and Manufacturing Innovation, 2020, 9 : 153 - 156
[8] Visual data mining of large data sets using Vitamin-S system
Antoch, J
NEURAL NETWORK WORLD, 2005, 15 (04) : 283 - 293
[9] Benchmark AFLOW Data Sets for Machine Learning
Clement, Conrad L.
Kauwe, Steven K.
Sparks, Taylor D.
INTEGRATING MATERIALS AND MANUFACTURING INNOVATION, 2020, 9 (02) : 153 - 156
[10] Visual data mining of large spatial data sets
Keim, DA
Panse, C
Sips, M
DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2003, 2822 : 201 - 215

← 1 2 3 4 5 →