Data mining on vast data sets as a cluster system benchmark

被引:3
|
作者
Heinecke, Alexander [2 ]
Karlstetter, Roman [2 ]
Pflueger, Dirk [1 ]
Bungartz, Hans-Joachim [2 ]
机构
[1] Univ Stuttgart, Inst Parallele & Verteilte Syst, D-70569 Stuttgart, Germany
[2] Tech Univ Munich, Inst Informat, D-85748 Garching, Germany
来源
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2016年 / 28卷 / 07期
关键词
data mining; CPU; GPU architectures; platform comparison; Intel Xeon Phi; NVIDIA Kepler;
D O I
10.1002/cpe.3514
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strong-scaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator. Copyright (c) 2015 John Wiley & Sons, Ltd.
引用
收藏
页码:2145 / 2165
页数:21
相关论文
共 50 条
  • [41] The Research of High Efficient Data Mining Algorithms for Massive Data Sets
    Tao Cuixia
    MECHATRONICS ENGINEERING, COMPUTING AND INFORMATION TECHNOLOGY, 2014, 556-562 : 3901 - 3904
  • [42] Handling of incomplete data sets using ICA and SOM in data mining
    Hongyi Peng
    Siming Zhu
    Neural Computing and Applications, 2007, 16 : 167 - 172
  • [43] Mining for empty rectangles in large data sets
    Edmonds, J
    Gryz, J
    Liang, DM
    Miller, RJ
    DATABASE THEORY - ICDT 2001, PROCEEDINGS, 2001, 1973 : 174 - 188
  • [44] Neighborhood Rough Sets for Dynamic Data Mining
    Zhang, Junbo
    Li, Tianrui
    Ruan, Da
    Liu, Dun
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2012, 27 (04) : 317 - 342
  • [45] Mining bi-sets in numerical data
    Besson, Jeremy
    Robardet, Celine
    De Raedt, Luc
    Boulicaut, Jean-Francois
    KNOWLEDGE DISCOVERY IN INDUCTIVE DATABASES, 2007, 4747 : 11 - +
  • [46] The research of data mining based on extension sets
    Lu, Q
    Yu, YQ
    Third International Conference on Information Technology and Applications, Vol 2, Proceedings, 2005, : 234 - 237
  • [47] On generalized quantifiers, finite sets and data mining
    Hájek, P
    INTELLIGENT INFORMATION PROCESSING AND WEB MINING, 2003, : 489 - 496
  • [48] Mining knowledge in astrophysical massive data sets
    Brescia, Massimo
    Longo, Giuseppe
    Pasian, Fabio
    NUCLEAR INSTRUMENTS & METHODS IN PHYSICS RESEARCH SECTION A-ACCELERATORS SPECTROMETERS DETECTORS AND ASSOCIATED EQUIPMENT, 2010, 623 (02): : 845 - 849
  • [49] Warehousing and mining massive RFID data sets
    Han, Jiawei
    Gonzalez, Hector
    Li, Xiaolei
    Klabjan, Diego
    ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2006, 4093 : 1 - 18
  • [50] Fuzzy sets for data mining and recommendation algorithms
    Man, Na
    Wang, Kechao
    Liu, Lin
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (04) : 3737 - 3745