Tiled-MapReduce: Optimizing Resource Usages of Data-parallel Applications on Multicore with Tiling

被引：56

作者：

Chen, Rong ^{[1
]}

Chen, Haibo ^{[1
]}

Zang, Binyu ^{[1
]}

机构：

[1] Fudan Univ, Parallel Proc Inst, Shanghai, Peoples R China

来源：

PACT 2010: PROCEEDINGS OF THE NINETEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES | 2010年

关键词：

MapReduce; Tiled-MapReduce; Tiling; Multicore; CORE;

D O I：

10.1145/1854273.1854337

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The prevalence of chip multiprocessor opens opportunities of running data-parallel applications originally in clusters on a single machine with many cores. Map Reduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform. The differences such as memory hierarchy and communication patterns between clusters and multicore platforms raise new challenges to design and implement an efficient Map Reduce system on multicore. This paper argues that it is more efficient for MapReduce to iteratively process small chunks of data in turn than processing a large chunk of data at one time on shared memory multicore platforms. Based on the argument, we extend the general Map Reduce programming model with "tiling strategy", called Tiled-MapReduce (TMR). TMR partitions a large Map Reduce job into a number of small sub-jobs and iteratively processes one sub-job at a time with efficient use of resources; TMR finally merges the results of all sub-jobs for output. Based on Tiled-MapReduce, we design and implement several optimizing techniques targeting multicore, including the reuse of input and intermediate data structure among sub-jobs, a NUCA/NUMA-aware scheduler, and pipelining a sub-job's reduce phase with the successive sub-job's map phase, to optimize the memory, cache and CPU resources accordingly. We have implemented a prototype of Tiled-MapReduce based on Phoenix, an already highly optimized Map Reduce runtime for shared memory multiprocessors. The prototype, namely Ostrich, runs on an Intel machine with 16 cores. Experiments on four different types of benchmarks show that Ostrich saves up to 85% memory, causes less cache misses and makes more efficient uses of CPU cores, resulting in a speedup ranging from 1.2X to 3.3X.

引用

页码：523 / 534

页数：12

共 50 条

[21] Extending data-parallel languages for irregularly structured applications
Bandera, G
Trabado, GP
Zapata, EL
ADVANCES IN HIGH PERFORMANCE COMPUTING, 1997, 30 : 235 - 251
[22] Visual data-parallel programming for signal processing applications
Boulet, P
Dekeyser, JL
Levaire, JL
Marquet, P
Soula, J
Demeure, A
NINTH EUROMICRO WORKSHOP ON PARALLEL AND DISTRIBUTED PROCESSING, PROCEEDINGS, 2001, : 105 - 112
[23] Architecture Exploration for Efficient Data Transfer and Storage in Data-Parallel Applications
Corvino, Rosilde
Gamatie, Abdoulaye
Boulet, Pierre
EURO-PAR 2010 PARALLEL PROCESSING, PT I, 2010, 6271 : 101 - 116
[24] VLADYMIR -: a C++ matrix library for data-parallel applications
Lätt, J
Chopard, B
FUTURE GENERATION COMPUTER SYSTEMS, 2004, 20 (06) : 1023 - 1039
[25] A Scalable Hybrid Architecture for High Performance Data-Parallel Applications
Yang, Moucheng
Jin, Jifang
Li, Zhehao
Zhou, Xuegong
Wang, Shaojun
Wang, Lingli
2017 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY (ICFPT), 2017, : 191 - 194
[26] Last Level Collective Hardware Prefetching For Data-Parallel Applications
Michelogiannakis, George
Shalf, John
2017 IEEE 24TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2017, : 72 - 83
[27] Exploiting graphical processing units for data-parallel scientific applications
Leist, A.
Playne, D. P.
Hawick, K. A.
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2009, 21 (18): : 2400 - 2437
[28] A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures
Navarro, Cristobal A.
Hitschfeld-Kahler, Nancy
Mateu, Luis
COMMUNICATIONS IN COMPUTATIONAL PHYSICS, 2014, 15 (02) : 285 - 329
[29] ReLoca: Optimize Resource Allocation for Data-parallel Jobs using Deep Learning
Hu, Zhiyao
Li, Dongsheng
Zhang, Dongxiang
Chen, Yixin
IEEE INFOCOM 2020 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS, 2020, : 1163 - 1171
[30] FuPerMod: a software tool for the optimization of data-parallel applications on heterogeneous platforms
Clarke, David
Zhong, Ziming
Rychkov, Vladimir
Lastovetsky, Alexey
JOURNAL OF SUPERCOMPUTING, 2014, 69 (01): : 61 - 69

← 1 2 3 4 5 →