RADIC Based Fault Tolerance System with Dynamic Resource Controller

被引:0
|
作者
Villamayor, Jorge [1 ]
Rexachs, Dolores [1 ]
Luque, Emilio [1 ]
机构
[1] Univ Autonoma Barcelona, CAOS Comp Architecture & Operating Syst, Barcelona, Spain
来源
COMPUTATIONAL SCIENCE - ICCS 2018, PT III | 2018年 / 10862卷
关键词
High-Performance Computing; Fault Tolerance; Application layer FT; Sender-based message logging;
D O I
10.1007/978-3-319-93713-7_58
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The continuously growing High-Performance Computing requirements increments the number of components and at the same time failure probabilities. Long-running parallel applications are directly affected by this phenomena, disrupting its executions on failure occurrences. MPI, a well-known standard for parallel applications follows a fail-stop semantic, requiring the application owners restart the whole execution when hard failures appear losing time and computation data. Fault Tolerance (FT) techniques approach this issue by providing high availability to the users' applications execution, though adding significant resource and time costs. In this paper, we present a Fault Tolerance Manager (FTM) framework based on RADIC architecture, which provides FT protection to parallel applications implemented with MPI, in order to successfully complete executions despite failures. The solution is implemented in the application-layer following the uncoordinated and semi-coordinated rollback recovery protocols. It uses a sender-based message logger to store exchanged messages between the application processes; and checkpoints only the processes data required to restart them in case of failures. The solution uses the concepts of ULFM for failure detection and recovery. Furthermore, a dynamic resource controller is added to the proposal, which monitors the message logger buffers and performs actions to maintain an acceptable level of protection. Experimental validation verifies the FTM functionality using two private clusters infrastructures.
引用
收藏
页码:624 / 631
页数:8
相关论文
共 50 条
  • [1] Fault tolerance at system level based on RADIC architecture
    Castro-Leon, Marcela
    Meyer, Hugo
    Rexachs, Dolores
    Luque, Emilio
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2015, 86 : 98 - 111
  • [2] Functional tests of the RADIC fault tolerance architecture
    Duarte, Angelo
    Rexachs, Dolores
    Luque, Emilio
    15TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS, 2007, : 278 - +
  • [3] Dynamic Fault Tolerance Through Resource Pooling
    Fuchs, Christian M.
    Murillo, Nadia M.
    Plaat, Aske
    van der Kouwe, Erik
    Stefanov, Todor P.
    2018 NASA/ESA CONFERENCE ON ADAPTIVE HARDWARE AND SYSTEMS (AHS 2018), 2018, : 9 - 16
  • [4] Integrated Design of Dynamic Controller with Fault Diagnosis and Tolerance
    Li, Zhenhai
    Zolotas, Argyrios
    Jaimoukha, Imad
    Grigoriadis, Karolos
    MED: 2009 17TH MEDITERRANEAN CONFERENCE ON CONTROL & AUTOMATION, VOLS 1-3, 2009, : 694 - 699
  • [5] Dynamic fault tolerance in distributed simulation system
    Ma, Min
    Jin, Shiyao
    Ye, Chaoqun
    Liu, Xiaojian
    COMPUTATIONAL SCIENCE - ICCS 2006, PT 1, PROCEEDINGS, 2006, 3991 : 769 - 776
  • [6] FPGA Based Dual Redundancy CAN Controller with Fault Tolerance
    Jacintha, V
    Shakthimurugan, K. H.
    Kripakaran, V.
    Lokeshwaran, S.
    2018 4TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENERGY SYSTEMS (ICEES), 2018, : 667 - 671
  • [7] Dynamic group-based fault tolerance technique for reliable resource management in mobile cloud computing
    Park, JiSu
    Yu, HeonChang
    Kim, Hyongsoon
    Lee, Eunyoung
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (10): : 2756 - 2769
  • [8] Triple transistor based fault tolerance for resource constrained applications
    Mukherjee, Atin
    Dhar, Anindya Sundar
    MICROELECTRONICS JOURNAL, 2017, 68 : 1 - 6
  • [9] A Physics-Based Fault Tolerance Mechanism for UAVs' Flight Controller
    Costa, Diogo
    Khan, Anamta
    Ivaki, Naghmeh
    Madeira, Henrique
    DEPENDABLE COMPUTING-EDCC 2024 WORKSHOPS, SAFEAUTONOMY, TRUST IN BLOCKCHAIN, 2024, 2078 : 22 - 35
  • [10] An online fault injection method for the dynamic partial reconfiguration system based on a lightweight ICAP controller
    Wang Guohua
    Tian Congsheng
    Luo Dongming
    IEICE ELECTRONICS EXPRESS, 2018, 15 (19):