POSTER: The Legio Fault Resilience Framework: Design and Rationale

被引:0
|
作者
Rocco, Roberto [1 ]
Palermo, Gianluca [1 ]
机构
[1] Politecn Milan, Milan, Italy
关键词
HPC; MPI; ULFM; Fault Tolerance; CHECKPOINT/RESTART;
D O I
10.1145/3587135.3592180
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The increasing size of HPC clusters makes fault management mandatory. The current MPI standard does not specify the behaviour after the incurrence of a fault, precluding any possible solution. In this work, we present Legio, a framework leveraging the ULFM extension functionalities to introduce fault resilience properties in MPI applications.
引用
收藏
页码:205 / 206
页数:2
相关论文
共 50 条
  • [41] Research Methods in Urban Design: A Framework for Researching the Performance and Resilience of Places
    Lehmann, Steffen
    BUILDINGS, 2023, 13 (06)
  • [42] Design and implementation of a Byzantine fault tolerance framework for Web services
    Zhao, Wenbing
    JOURNAL OF SYSTEMS AND SOFTWARE, 2009, 82 (06) : 1004 - 1015
  • [43] Secure Compiler Framework to Design Fault Attack Resistant Software
    Keerthi, K.
    Rebeiro, Chester
    2023 53RD ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS - SUPPLEMENTAL VOLUME, DSN-S, 2023, : 206 - 208
  • [44] A framework for the design of fault-tolerant systems-of-systems☆
    Ferreira, Francisco Henrique Cerdeira
    Nakagawa, Elisa Yumi
    Bertolino, Antonia
    Lonetti, Francesca
    Neves, Vania de Oliveira
    dos Santos, Rodrigo Pereira
    JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 211
  • [45] A hybrid framework for design and analysis of fault-tolerant architectures
    Bhaduri, Debayan
    Shukla, Sandeep
    Coker, Deji
    Taylor, Valerie
    Graham, Paul
    Gokhale, Maya
    2006 DESIGN AUTOMATION AND TEST IN EUROPE, VOLS 1-3, PROCEEDINGS, 2006, : 333 - +
  • [46] An integrated fault-tolerant design framework for VLIW processors
    Chen, YY
    Horng, SJ
    Lai, HC
    18TH IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS, PROCEEDINGS, 2003, : 555 - 562
  • [47] Poster abstract: A data-driven design framework for urban slum housing - Case of Mumbai
    Debnath, Ramit
    Bardhan, Ronita
    Jain, Rishee K.
    BUILDSYS'16: PROCEEDINGS OF THE 3RD ACM CONFERENCE ON SYSTEMS FOR ENERGY-EFFCIENT BUILT ENVIRONMENTS, 2016, : 239 - 240
  • [48] On the rationale of resilience in the domain of safety: A literature review
    Bergstrom, Johan
    van Winsen, Roel
    Henriqson, Eder
    RELIABILITY ENGINEERING & SYSTEM SAFETY, 2015, 141 : 131 - 141
  • [49] Visual Design Tips to Develop an Inviting Poster for Poster Presentations
    Tomita K.
    TechTrends, 2017, 61 (4) : 313 - 315
  • [50] Design rationale in system design
    Verries, J.
    Sahraoui, A. E. K.
    Paludetto, M.
    ICSENG 2008: INTERNATIONAL CONFERENCE ON SYSTEMS ENGINEERING, 2008, : 380 - 385