Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters

被引:4
|
作者
Tsugane, Keisuke [1 ]
Lee, Jinpil [2 ]
Murai, Hitoshi [2 ]
Sato, Mitsuhisa [1 ,2 ]
机构
[1] Univ Tsukuba, Grad Sch Syst & Informat Engn, Ibaraki, Japan
[2] RIKEN, Adv Inst Computat Sci, Kobe, Hyogo, Japan
关键词
Task Parallelism; Many-core cluster; PGAS; XcalableMP;
D O I
10.1145/3149457.3154482
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Large-scale clusters based on many-core processors such as Intel Xeon Phi have recently been deployed. Multi-tasking execution using task dependencies in OpenMP 4.0 is a promising candidate for facilitating the parallelization of such many-core processors, because this enables users to avoid global synchronization through fine-grained task-to-task synchronization using userspecified data dependencies. Recently, the partitioned global address space (PGAS) model has emerged as a usable distribute-dmemory programming model. In this paper, we propose a multitasking execution model in the PGAS language XcalableMP (XMP) for many-core clusters. The model provides a method to describe interactions between tasks based on point-to-point communications on the global address space. A communication is executed non-collectively among nodes. We implemented the proposed execution model in XMP, and designed a simple code transformation algorithm to MPI and OpenMP. We implemented two benchmarks using our model for preliminary evaluation, namely blocked Cholesky factorization and the Laplace equation solver. Most of the implementations using our model outperform the conventional barrier-based data-parallel model. To improve the performance in many-core clusters,we propose a communication optimization method by dedicating a single thread for communications, to avoid performance problems related to the current multi-threaded MPI execution. As a result, the performances of blocked Cholesky factorization and the Laplace equation solver using this communication optimization are improved to 138% and 119% compared with the barrier-based implementation in Intel Xeon Phi KNL clusters, respectively. From the viewpoint of productivity, the program implemented by our model in XMP is almost the same as the implementation based on the OpenMP task depend clause, because XMP enables the parallelization of the serial source code with additional directives and small changes as well as OpenMP.
引用
收藏
页码:75 / 85
页数:11
相关论文
共 40 条
  • [31] Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
    Nagasaka, Yusuke
    Matsuoka, Satoshi
    Azad, Ariful
    Buluc, Aydin
    PARALLEL COMPUTING, 2019, 90
  • [32] A Multi-Core CPU and Many-Core GPU Based Fast Parallel Shuffled Complex Evolution Global Optimization Approach
    Kan, Guangyuan
    Lei, Tianjie
    Liang, Ke
    Li, Jiren
    Ding, Liuqian
    He, Xiaoyan
    Yu, Haijun
    Zhang, Dawei
    Zuo, Depeng
    Bao, Zhenxin
    Amo-Boateng, Mark
    Hu, Youbing
    Zhang, Mengjie
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (02) : 332 - 344
  • [33] Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors
    Elafrou, Athena
    Goumas, Georgios
    Koziris, Nectarios
    2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 292 - 301
  • [34] Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures
    Smelyanskiy, Mikhail
    Sewall, Jason
    Kalamkar, Dhiraj D.
    Satish, Nadathur
    Dubey, Pradeep
    Astafiev, Nikita
    Burylov, Ilya
    Nikolaev, Andrey
    Maidanov, Sergey
    Li, Shuo
    Kulkarni, Sunil
    Finan, Charles H.
    Gonina, Ekaterina
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1154 - 1162
  • [35] Energy Optimization for Many-Core Platforms: Communication and PVT Aware Voltage-Island Formation and Voltage Selection Algorithm
    Majzoub, Sohaib S.
    Saleh, Resve A.
    Wilton, Steven J. E.
    Ward, Rabab K.
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2010, 29 (05) : 816 - 829
  • [36] Multi-rate DAG Scheduling Considering Communication Contention for NoC-based Embedded Many-core Processor
    Igarashi, Shingo
    Kitagawa, Yuto
    Ishigooka, Tasuku
    Horiguchi, Tatsuya
    Azumi, Takuya
    2019 IEEE/ACM 23RD INTERNATIONAL SYMPOSIUM ON DISTRIBUTED SIMULATION AND REAL TIME APPLICATIONS (DS-RT), 2019, : 283 - 292
  • [37] UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters
    Liao, Longlong
    Li, Kenli
    Li, Keqin
    Yang, Canqun
    Tian, Qi
    PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2018,
  • [38] Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors
    Mathuriya, Amrita
    Luo, Ye
    Benali, Anouar
    Shulenburger, Luke
    Kim, Jeongnim
    2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, : 213 - 223
  • [39] Machine Learning Enabled Solutions for Design and Optimization Challenges in Networks-on-Chip based Multi/Many-Core Architectures
    Reza, Md Farhadur
    ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 2023, 19 (03)
  • [40] 0-1 ILP-based run-time hierarchical energy optimization for heterogeneous cluster-based multi/many-core systems
    Yang, Simei
    Nours, Sebastien Le
    Real, Maria Mendez
    Pillement, Sebastien
    JOURNAL OF SYSTEMS ARCHITECTURE, 2021, 116