Task Parallel Framework and Its Application in Nested Parallel Algorithms on the SW26010 Many-core Platform

被引：0

作者：

Sun Q. ^{[1
]}

Li L.-S. ^{[1
]}

Zhao H.-T. ^{[1
]}

Zhao H. ^{[1
]}

Wu C.-M. ^{[1
]}

机构：

[1] Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing

来源：

Wu, Chang-Mao (changmaowu@foxmail.com) | 1600年 / Chinese Academy of Sciences卷 / 32期

关键词：

Nested parallel algorithm; Parallel computing; SW26010 many-core CPU; SWAN; Task parallel framework;

D O I：

10.13328/j.cnki.jos.006007

中图分类号：

学科分类号：

摘要：

Task parallelism is one of the fundamental patterns for designing parallel algorithms. Due to algorithm complexity and distinctive hardware features, however, implementation of algorithms in task parallelism often remains to be challenging. On the newly SW26010 many-core CPU platform, a general runtime framework, SWAN, which supports nested task parallelism is proposed in this study. SWAN provides high-level abstractions for programmers to implement task parallelism so that they can focus mainly on the algorithm itself, enjoying an enhanced productivity. In the aspect of performance, the shared resources and information manipulated by SWAN are partitioned in a fine-grained manner to avoid fierce contention among working threads. The core data structures within SWAN take advantage of the high-bandwidth memory access mechanism, fast on-chip scratchpad cache as well as atomic operations of the platform to reduce the overhead of SWAN itself. Besides, SWAN provides dynamic load-balancing strategies in runtime to ensure a full occupation of the threads. In the experiment, a set of recursive algorithms in nested parallelism, including the N-queens problem, binary-tree traversal, quick sort, and convex hull, are implemented using SWAN on the target platform. The experimental results reveal that each of the algorithms can gain a significant speedup, from 4.5x to 32x, against its serial counterpart, which suggests that SWAN has a high usability and performance. © Copyright 2021, Institute of Software, the Chinese Academy of Sciences. All rights reserved.

引用

页码：2352 / 2364

页数：12

共 29 条

[21] Robison A, Voss M, Kukanov A., Optimization via reflection on work stealing in TBB, Proc. of the IEEE Int'l Symp. on Parallel and Distributed Processing, pp. 1-8, (2008)
[22] Chatterjee S, Grossman M, Sbrlea A, Sarkar V., Dynamic task parallelism with a GPU work-stealing runtime system, Proc. of the 24th Int'l Workshop on Languages and Compilers for Parallel Computing, pp. 8-10, (2013)
[23] Cedric A, Samuel T, Raymond N, Pierre-Andre W., StarPU: A unified platform for task scheduling on heterogeneous multicore architectures, Proc. of the Euro-Par 15th Int'l Conf. on Parallel Processing. LNCS, 5704, pp. 863-874, (2009)
[24] Tzeng S, Patney A, Owens JD., Task management for irregular parallel workloads on the GPU, Proc. of the ACM SIGGRAPH/ EUROGRAPHICS Conf. on High Performance Graphics, pp. 29-37, (2010)
[25] (2021)
[26] Copty N, Duran A, Hoeflinger J, Massaioli F, Massaioli F, Teruel X, Unnikrishnan P, Zhang G., The design of OpenMP tasks, IEEE Trans. on Parallel & Distributed Systems, 20, 3, pp. 404-418, (2009)
[27] Fu HH, Liao JF, Yang JZ, Et al., The Sunway TaihuLight supercomputer: System and applications, SCIENCE CHINA: Information Sciences, 59, 7, pp. 1-16, (2016)
[28] Acar UA, Blelloch GE, Blumofe RD., The data locality of work stealing, Theory of Computing Systems, 35, 3, pp. 321-347, (2002)
[29] Hamidzadeh B, Lilja DJ., Dynamic scheduling strategies for shared-memory multiprocessors, Proc. of the Int'l Conf. on Distributed Computing Systems, pp. 208-215, (1996)

← 1 2 3 →