Fault-Aware Group-Collective Communication Creation and Repair in MPI

被引:1
|
作者
Rocco, Roberto [1 ]
Palermo, Gianluca [1 ]
机构
[1] Politecn Milan, Dipartimento Elettron & Informaz, Milan, Italy
来源
关键词
Fault Management; MPI; ULFM; DESIGN;
D O I
10.1007/978-3-031-39698-4_4
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The increasing size of HPC systems indicates that executions involve more nodes and processes, making the faults' presence a more frequent eventuality. This issue becomes especially relevant since MPI, the de-facto standard for inter-process communication, lacks proper fault management functionalities. Past efforts produced extensions to the MPI standard enabling fault management, including ULFM. While providing powerful tools to handle faults, ULFM still faces limitations like the collectiveness of the repair procedure. With this paper, we overcome those limitations and achieve fault-aware group-collective communicator creation and repair. We integrate our solution into an existing fault-resiliency framework and measure the overhead in the application code. The experimental campaign shows that our solution is scalable and introduces a limited overhead, and the group-collective repair is a viable opportunity for ULFM-based applications.
引用
收藏
页码:47 / 61
页数:15
相关论文
共 7 条
  • [1] Fault-aware Communication Mapping for NoCs with Guaranteed Latency
    Sorin Manolache
    Petru Eles
    Zebo Peng
    International Journal of Parallel Programming, 2007, 35 : 125 - 156
  • [2] Fault-aware communication mapping for NoCs with guaranteed latency
    Manolache, Sorin
    Eles, Petru
    Peng, Zebo
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2007, 35 (02) : 125 - 156
  • [3] Analyzing fault aware collective performance in a process fault tolerant MPI
    Hursey, Joshua
    Graham, Richard L.
    PARALLEL COMPUTING, 2012, 38 (1-2) : 15 - 25
  • [4] Network Performance Aware MPI Collective Communication Operations in the Cloud
    Gong, Yifan
    He, Bingsheng
    Zhong, Jianlong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (11) : 3079 - 3089
  • [5] Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc*
    Khorassani, Kawthar Shafie
    Chen, Chen-Chun
    Subramoni, Hari
    Panda, Dhabaleswar K.
    2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, IPDPS, 2023, : 646 - 656
  • [6] Multiple Virtual Lanes-aware MPI Collective Communication in Multi-core Clusters
    Li, Bo
    Huo, Zhigang
    Zhang, Panyong
    Meng, Dan
    16TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), PROCEEDINGS, 2009, : 304 - 311
  • [7] Group management schemes for implementing MPI collective communication over IP-multicast
    Yuan, X
    Daniels, S
    Faraj, A
    Karwande, A
    PROCEEDINGS OF THE 6TH JOINT CONFERENCE ON INFORMATION SCIENCES, 2002, : 309 - 313