Data mining on vast data sets as a cluster system benchmark

被引:3
|
作者
Heinecke, Alexander [2 ]
Karlstetter, Roman [2 ]
Pflueger, Dirk [1 ]
Bungartz, Hans-Joachim [2 ]
机构
[1] Univ Stuttgart, Inst Parallele & Verteilte Syst, D-70569 Stuttgart, Germany
[2] Tech Univ Munich, Inst Informat, D-85748 Garching, Germany
来源
关键词
data mining; CPU; GPU architectures; platform comparison; Intel Xeon Phi; NVIDIA Kepler;
D O I
10.1002/cpe.3514
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strong-scaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator. Copyright (c) 2015 John Wiley & Sons, Ltd.
引用
收藏
页码:2145 / 2165
页数:21
相关论文
共 50 条
  • [1] Data Mining on Imbalanced Data Sets
    Gu, Qiong
    Cai, Zhihua
    Zhu, Li
    Huang, Bo
    2008 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER THEORY AND ENGINEERING, 2008, : 1020 - 1024
  • [2] Data mining and metrics on data sets
    Biebler, Karl-Ernst
    Wodny, Michael
    Jaeger, Bernd
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 1, PROCEEDINGS, 2006, : 638 - +
  • [3] A platform for parallel data mining on cluster system
    Wu, SC
    Wu, GF
    Yu, ZC
    Ban, H
    CURRENT TRENDS IN HIGH PERFORMANCE COMPUTING AND ITS APPLICATIONS, PROCEEDINGS, 2005, : 155 - 164
  • [4] Heuristic extraction of fuzzy classification rules using data mining techniques: An empirical study on benchmark data sets
    Ishibuchi, H
    Yamamoto, T
    2004 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-3, PROCEEDINGS, 2004, : 161 - 166
  • [5] Benchmark Data Sets of Boron Cluster Dihydrogen Bonding for the Validation of Approximate Computational Methods
    Fanfrlik, Jindrich
    Pecina, Adam
    Rezac, Jan
    Lepsik, Martin
    Sarosi, Menyhart B.
    Hnyk, Drahomir
    Hobza, Pavel
    CHEMPHYSCHEM, 2020, 21 (23) : 2599 - 2604
  • [6] Mining transformed data sets
    Burns, A
    Kusiak, A
    Letsche, T
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2004, 3213 : 148 - 154
  • [7] Benchmark AFLOW Data Sets for Machine Learning
    Conrad L. Clement
    Steven K. Kauwe
    Taylor D. Sparks
    Integrating Materials and Manufacturing Innovation, 2020, 9 : 153 - 156
  • [8] Visual data mining of large data sets using Vitamin-S system
    Antoch, J
    NEURAL NETWORK WORLD, 2005, 15 (04) : 283 - 293
  • [9] Benchmark AFLOW Data Sets for Machine Learning
    Clement, Conrad L.
    Kauwe, Steven K.
    Sparks, Taylor D.
    INTEGRATING MATERIALS AND MANUFACTURING INNOVATION, 2020, 9 (02) : 153 - 156
  • [10] Visual data mining of large spatial data sets
    Keim, DA
    Panse, C
    Sips, M
    DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2003, 2822 : 201 - 215