POSTER: The Legio Fault Resilience Framework: Design and Rationale

被引:0
|
作者
Rocco, Roberto [1 ]
Palermo, Gianluca [1 ]
机构
[1] Politecn Milan, Milan, Italy
关键词
HPC; MPI; ULFM; Fault Tolerance; CHECKPOINT/RESTART;
D O I
10.1145/3587135.3592180
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The increasing size of HPC clusters makes fault management mandatory. The current MPI standard does not specify the behaviour after the incurrence of a fault, precluding any possible solution. In this work, we present Legio, a framework leveraging the ULFM extension functionalities to introduce fault resilience properties in MPI applications.
引用
收藏
页码:205 / 206
页数:2
相关论文
共 50 条
  • [1] Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI
    Rocco, Roberto
    Repetti, Luca
    Boella, Elisabetta
    Gregori, Daniele
    Palermo, Gianluca
    2024 32ND EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PDP 2024, 2024, : 44 - 51
  • [2] A Study of Physical Resilience and Aging (SPRING): Conceptual framework, rationale, and study design
    Walston, Jeremy
    Varadhan, Ravi
    Xue, Qian-Li
    Buta, Brian
    Sieber, Frederick
    Oni, Julius
    Imus, Phil
    Crews, Deidra C. C.
    Artz, Andrew
    Schrack, Jennifer
    Kalyani, Rita R. R.
    Abadir, Peter
    Carlson, Michelle
    Hladek, Melissa
    McAdams-DeMarco, Mara
    Jones, Rick
    Johnson, Aaron
    Shafi, Tariq
    Newman, Anne B. B.
    Bandeen-Roche, Karen
    JOURNAL OF THE AMERICAN GERIATRICS SOCIETY, 2023, 71 (08) : 2393 - 2405
  • [3] Fault Injection Framework for System Resilience Evaluation
    Naughton, Thomas
    Bland, Wesley
    Vallee, Geoffroy
    Engelmann, Christian
    Scott, Stephen L.
    RESILIENCE 2009: WORKSHOP ON RESILIENCY IN HIGH-PERFORMANCE COMPUTING, 2009, : 23 - 28
  • [4] Legio: fault resiliency for embarrassingly parallel MPI applications
    Rocco, Roberto
    Gadioli, Davide
    Palermo, Gianluca
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (02): : 2175 - 2195
  • [5] Legio: fault resiliency for embarrassingly parallel MPI applications
    Roberto Rocco
    Davide Gadioli
    Gianluca Palermo
    The Journal of Supercomputing, 2022, 78 : 2175 - 2195
  • [6] Poster Abstract: Energy Optimization Framework for WSN Design
    Anwar, Al-Khateeb
    Lavagno, Luciano
    PROCEEDINGS OF THE 9TH ACM/IEEE INTERNATIONAL CONFERENCE ON INFORMATION PROCESSING IN SENSOR NETWORKS, 2010, : 368 - 369
  • [7] Poster: On the Resilience of DNS Infrastructure
    Shulman, Haya
    Ezra, Shiran
    CCS'14: PROCEEDINGS OF THE 21ST ACM CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2014, : 1499 - 1501
  • [8] A framework for evaluating comprehensive fault resilience mechanisms in numerical programs
    Sui Chen
    Greg Bronevetsky
    Bin Li
    Marc Casas Guix
    Lu Peng
    The Journal of Supercomputing, 2015, 71 : 2963 - 2984
  • [9] A framework for evaluating comprehensive fault resilience mechanisms in numerical programs
    Chen, Sui
    Bronevetsky, Greg
    Li, Bin
    Guix, Marc Casas
    Peng, Lu
    JOURNAL OF SUPERCOMPUTING, 2015, 71 (08): : 2963 - 2984
  • [10] Feature, specification and evidence framework for communicating design rationale
    Mirabito, Yakira
    Kayo, Megane Annaelle Tchatchouang
    Goucher-Lambert, Kosa
    DESIGN SCIENCE, 2024, 10