Data mining on vast data sets as a cluster system benchmark

被引:3
|
作者
Heinecke, Alexander [2 ]
Karlstetter, Roman [2 ]
Pflueger, Dirk [1 ]
Bungartz, Hans-Joachim [2 ]
机构
[1] Univ Stuttgart, Inst Parallele & Verteilte Syst, D-70569 Stuttgart, Germany
[2] Tech Univ Munich, Inst Informat, D-85748 Garching, Germany
来源
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2016年 / 28卷 / 07期
关键词
data mining; CPU; GPU architectures; platform comparison; Intel Xeon Phi; NVIDIA Kepler;
D O I
10.1002/cpe.3514
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strong-scaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator. Copyright (c) 2015 John Wiley & Sons, Ltd.
引用
收藏
页码:2145 / 2165
页数:21
相关论文
共 50 条
  • [11] From visualisation to data mining with large data sets
    Adelmann, A
    Ryne, RD
    Shalf, JM
    Siegerist, C
    2005 IEEE PARTICLE ACCELERATOR CONFERENCE (PAC), VOLS 1-4, 2005, : 542 - 544
  • [12] Massive data sets, data mining, and decision support
    Dalal, S
    Dumais, S
    Kettenring, J
    Kurien, V
    McIntosh, A
    Maitra, R
    MINING AND MODELING MASSIVE DATA SETS IN SCIENCE, ENGINEERING, AND BUSINESS WITH A SUBTHEME IN ENVIRONMENTAL STATISTICS, 1997, 29 (01): : 329 - 329
  • [13] Data mining from extreme data sets: Very large and/or very skewed data sets
    Hall, LO
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 2555 - 2555
  • [14] MineBench:: A benchmark suite for data mining workloads
    Narayanan, Ramanathan
    Ozisikyilmaz, Berkin
    Zambreno, Joseph
    Memik, Gokhan
    Choudhary, Alok
    PROCEEDINGS OF THE IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION, 2006, : 182 - +
  • [15] Cluster Analysis-Based Approaches for Geospatiotemporal Data Mining of Massive Data Sets for Identification of Forest Threats
    Richard Tran Mills
    Hoffman, Forrest M.
    Kumar, Jitendra
    Hargrove, William W.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 1612 - 1621
  • [16] Influence of the VM Manager on Private Cluster Data Mining System
    Czerwinski, Dariusz
    COMPUTER NETWORKS, CN 2014, 2014, 431 : 47 - 56
  • [17] Mining HTS data sets.
    Engels, M
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2001, 222 : U408 - U408
  • [18] Rough sets as a framework for data mining
    Butalia, A. H.
    Dhore, M. L.
    IMECS 2007: INTERNATIONAL MULTICONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, VOLS I AND II, 2007, : 728 - +
  • [19] An experiment with fuzzy sets in data mining
    Olson, David L.
    Moshkovich, Helen
    Mechitov, Alexander
    COMPUTATIONAL SCIENCE - ICCS 2007, PT 2, PROCEEDINGS, 2007, 4488 : 462 - +
  • [20] A method generating data sets to test data mining algorithms
    School of Information Science and Engineering, Northeastern University, Shenyang 110004, China
    Dongbei Daxue Xuebao, 2008, 3 (328-331):