Data mining on vast data sets as a cluster system benchmark

被引：3

作者：

Heinecke, Alexander ^{[2
]}

Karlstetter, Roman ^{[2
]}

Pflueger, Dirk ^{[1
]}

Bungartz, Hans-Joachim ^{[2
]}

机构：

[1] Univ Stuttgart, Inst Parallele & Verteilte Syst, D-70569 Stuttgart, Germany

[2] Tech Univ Munich, Inst Informat, D-85748 Garching, Germany

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2016年 / 28卷 / 07期

关键词：

data mining; CPU; GPU architectures; platform comparison; Intel Xeon Phi; NVIDIA Kepler;

D O I：

10.1002/cpe.3514

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strong-scaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator. Copyright (c) 2015 John Wiley & Sons, Ltd.

引用

页码：2145 / 2165

页数：21

共 50 条

[21] AMBiDDS: A system for Automatic Mining of BIg Discrete Data-Sets
Agren, Ola
2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), 2015, : 424 - 427
[22] Application of rough sets in power system control center data mining
Lambert-Torres, G
2002 IEEE POWER ENGINEERING SOCIETY WINTER MEETING, VOLS 1 AND 2, CONFERENCE PROCEEDINGS, 2002, : 627 - 631
[23] Data mining of large high throughput screening data sets
Young, SS
Rusinko, A
DIMENSION REDUCTION, COMPUTATIONAL COMPLEXITY AND INFORMATION, 1998, 30 : 543 - 543
[24] Benchmark data sets on noncovalent interaction energies and their accuracy
Rezac, Jan
Riley, Kevin E.
Hobza, Pavel
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2013, 245
[25] BENCHMARK DATA SETS FOR THE FLEXIBLE EVALUATION OF STATISTICAL SOFTWARE
WILSON, SR
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1983, 1 (01) : 29 - 39
[26] VariBench, new variation benchmark categories and data sets
Shirvanizadeh, Niloofar
Vihinen, Mauno
FRONTIERS IN BIOINFORMATICS, 2023, 3
[27] A novel data structure for efficient representation of large data sets in data mining
Pai, Radhika M.
Ananthanarayana, V. S.
2006 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATIONS, VOLS 1 AND 2, 2007, : 533 - 538
[28] Pattern and Cluster Mining on Text Data
Agnihotri, Deepak
Verma, Kesari
Tripathi, Priyanka
2014 FOURTH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT), 2014, : 428 - 432
[29] Technique of Cluster analysis in Data mining
Yu, WenYang
Yang, YuBing
Wu, XianWei
PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON APPLIED SCIENCE AND ENGINEERING INNOVATION, 2015, 12 : 1089 - 1092
[30] Techniques of Cluster Algorithms in Data Mining
Johannes Grabmeier
Andreas Rudolph
Data Mining and Knowledge Discovery, 2002, 6 : 303 - 360

← 1 2 3 4 5 →