Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters

被引:4
|
作者
Tsugane, Keisuke [1 ]
Lee, Jinpil [2 ]
Murai, Hitoshi [2 ]
Sato, Mitsuhisa [1 ,2 ]
机构
[1] Univ Tsukuba, Grad Sch Syst & Informat Engn, Ibaraki, Japan
[2] RIKEN, Adv Inst Computat Sci, Kobe, Hyogo, Japan
关键词
Task Parallelism; Many-core cluster; PGAS; XcalableMP;
D O I
10.1145/3149457.3154482
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Large-scale clusters based on many-core processors such as Intel Xeon Phi have recently been deployed. Multi-tasking execution using task dependencies in OpenMP 4.0 is a promising candidate for facilitating the parallelization of such many-core processors, because this enables users to avoid global synchronization through fine-grained task-to-task synchronization using userspecified data dependencies. Recently, the partitioned global address space (PGAS) model has emerged as a usable distribute-dmemory programming model. In this paper, we propose a multitasking execution model in the PGAS language XcalableMP (XMP) for many-core clusters. The model provides a method to describe interactions between tasks based on point-to-point communications on the global address space. A communication is executed non-collectively among nodes. We implemented the proposed execution model in XMP, and designed a simple code transformation algorithm to MPI and OpenMP. We implemented two benchmarks using our model for preliminary evaluation, namely blocked Cholesky factorization and the Laplace equation solver. Most of the implementations using our model outperform the conventional barrier-based data-parallel model. To improve the performance in many-core clusters,we propose a communication optimization method by dedicating a single thread for communications, to avoid performance problems related to the current multi-threaded MPI execution. As a result, the performances of blocked Cholesky factorization and the Laplace equation solver using this communication optimization are improved to 138% and 119% compared with the barrier-based implementation in Intel Xeon Phi KNL clusters, respectively. From the viewpoint of productivity, the program implemented by our model in XMP is almost the same as the implementation based on the OpenMP task depend clause, because XMP enables the parallelization of the serial source code with additional directives and small changes as well as OpenMP.
引用
收藏
页码:75 / 85
页数:11
相关论文
共 40 条
  • [21] A Fine-Grained Parallel Particle Swarm Optimization on Many-core and Multi-core Architectures
    Nedjah, Nadia
    Calazan, Rogerio de Moraes
    Mourelle, Luiza de Macedo
    PARALLEL COMPUTING TECHNOLOGIES (PACT 2017), 2017, 10421 : 215 - 224
  • [22] Many Cores and Still Delays - Simulating Multi-Core Communication Software Execution
    Dale, Oystein
    Kristiansen, Stein
    Plagemann, Thomas Peter
    Volnes, Espen
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON COMPUTER MODELING AND SIMULATION (ICCMS 2019) AND 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND APPLICATIONS (ICICA 2019), 2019, : 171 - 175
  • [23] Many-objective multi-tasking optimization using adaptive differential evolutionary and reference-point based nondominated sorting
    Li, Lu
    Chai, Zhengyi
    Li, Yalun
    Cheng, Yanyang
    Nie, Ying
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 248
  • [24] Performance Optimization and Comparison of the Alternating Direction Implicit CFD Solver on Multi-core and Many-Core Architectures
    Deng Liang
    Zhao Dan
    Bai Hanli
    Wang Fang
    CHINESE JOURNAL OF ELECTRONICS, 2018, 27 (03) : 540 - 548
  • [25] Multi-core versus many-core computing for many-task Branch-and-Bound applied to big optimization problems
    Melab, N.
    Gmys, J.
    Mezmaz, M.
    Tuyttens, D.
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 82 : 472 - 481
  • [26] Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning
    Bethel, E. Wes
    Howison, Mark
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2012, 26 (04): : 399 - 412
  • [27] Performance Optimization and Comparison of the Alternating Direction Implicit CFD Solver on Multi-core and Many-Core Architectures
    DENG Liang
    ZHAO Dan
    BAI Hanli
    WANG Fang
    Chinese Journal of Electronics, 2018, 27 (03) : 540 - 548
  • [28] Architecture-based design and optimization of genetic algorithms on multi- and many-core systems
    Zheng, Long
    Lu, Yanchao
    Guo, Minyi
    Guo, Song
    Xu, Cheng-Zhong
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 38 : 75 - 91
  • [29] High Communication Throughput and Low Scan Cycle Time with Multi/Many-Core Programmable Logic Controllers
    Canedo, Arquimedes
    Ludwig, Hartmut
    Al Faruque, Mohammad Abdullah
    IEEE EMBEDDED SYSTEMS LETTERS, 2014, 6 (02) : 21 - 24
  • [30] Accelerating the SCE-UA Global Optimization Method Based on Multi-Core CPU and Many-Core GPU
    Kan, Guangyuan
    Liang, Ke
    Li, Jiren
    Ding, Liuqian
    He, Xiaoyan
    Hu, Youbing
    Amo-Boateng, Mark
    ADVANCES IN METEOROLOGY, 2016, 2016