Data mining on vast data sets as a cluster system benchmark

被引:3
|
作者
Heinecke, Alexander [2 ]
Karlstetter, Roman [2 ]
Pflueger, Dirk [1 ]
Bungartz, Hans-Joachim [2 ]
机构
[1] Univ Stuttgart, Inst Parallele & Verteilte Syst, D-70569 Stuttgart, Germany
[2] Tech Univ Munich, Inst Informat, D-85748 Garching, Germany
来源
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2016年 / 28卷 / 07期
关键词
data mining; CPU; GPU architectures; platform comparison; Intel Xeon Phi; NVIDIA Kepler;
D O I
10.1002/cpe.3514
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strong-scaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator. Copyright (c) 2015 John Wiley & Sons, Ltd.
引用
收藏
页码:2145 / 2165
页数:21
相关论文
共 50 条
  • [21] AMBiDDS: A system for Automatic Mining of BIg Discrete Data-Sets
    Agren, Ola
    2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), 2015, : 424 - 427
  • [22] Application of rough sets in power system control center data mining
    Lambert-Torres, G
    2002 IEEE POWER ENGINEERING SOCIETY WINTER MEETING, VOLS 1 AND 2, CONFERENCE PROCEEDINGS, 2002, : 627 - 631
  • [23] Data mining of large high throughput screening data sets
    Young, SS
    Rusinko, A
    DIMENSION REDUCTION, COMPUTATIONAL COMPLEXITY AND INFORMATION, 1998, 30 : 543 - 543
  • [24] Benchmark data sets on noncovalent interaction energies and their accuracy
    Rezac, Jan
    Riley, Kevin E.
    Hobza, Pavel
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2013, 245
  • [26] VariBench, new variation benchmark categories and data sets
    Shirvanizadeh, Niloofar
    Vihinen, Mauno
    FRONTIERS IN BIOINFORMATICS, 2023, 3
  • [27] A novel data structure for efficient representation of large data sets in data mining
    Pai, Radhika M.
    Ananthanarayana, V. S.
    2006 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATIONS, VOLS 1 AND 2, 2007, : 533 - 538
  • [28] Pattern and Cluster Mining on Text Data
    Agnihotri, Deepak
    Verma, Kesari
    Tripathi, Priyanka
    2014 FOURTH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT), 2014, : 428 - 432
  • [29] Technique of Cluster analysis in Data mining
    Yu, WenYang
    Yang, YuBing
    Wu, XianWei
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON APPLIED SCIENCE AND ENGINEERING INNOVATION, 2015, 12 : 1089 - 1092
  • [30] Techniques of Cluster Algorithms in Data Mining
    Johannes Grabmeier
    Andreas Rudolph
    Data Mining and Knowledge Discovery, 2002, 6 : 303 - 360