RADIC Based Fault Tolerance System with Dynamic Resource Controller

被引:0
|
作者
Villamayor, Jorge [1 ]
Rexachs, Dolores [1 ]
Luque, Emilio [1 ]
机构
[1] Univ Autonoma Barcelona, CAOS Comp Architecture & Operating Syst, Barcelona, Spain
来源
COMPUTATIONAL SCIENCE - ICCS 2018, PT III | 2018年 / 10862卷
关键词
High-Performance Computing; Fault Tolerance; Application layer FT; Sender-based message logging;
D O I
10.1007/978-3-319-93713-7_58
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The continuously growing High-Performance Computing requirements increments the number of components and at the same time failure probabilities. Long-running parallel applications are directly affected by this phenomena, disrupting its executions on failure occurrences. MPI, a well-known standard for parallel applications follows a fail-stop semantic, requiring the application owners restart the whole execution when hard failures appear losing time and computation data. Fault Tolerance (FT) techniques approach this issue by providing high availability to the users' applications execution, though adding significant resource and time costs. In this paper, we present a Fault Tolerance Manager (FTM) framework based on RADIC architecture, which provides FT protection to parallel applications implemented with MPI, in order to successfully complete executions despite failures. The solution is implemented in the application-layer following the uncoordinated and semi-coordinated rollback recovery protocols. It uses a sender-based message logger to store exchanged messages between the application processes; and checkpoints only the processes data required to restart them in case of failures. The solution uses the concepts of ULFM for failure detection and recovery. Furthermore, a dynamic resource controller is added to the proposal, which monitors the message logger buffers and performs actions to maintain an acceptable level of protection. Experimental validation verifies the FTM functionality using two private clusters infrastructures.
引用
收藏
页码:624 / 631
页数:8
相关论文
共 50 条
  • [41] Service Based Software Fault-Tolerance for Manufacturing System
    Jeong, HwaYoung
    Hong, BongHwa
    COMPUTER APPLICATIONS FOR SOFTWARE ENGINEERING, DISASTER RECOVERY, AND BUSINESS CONTINUITY, 2012, 340 : 171 - +
  • [42] Deicing System Based on Fault-Tolerance Control for Aircraft
    Tao, Jun
    Xu, Huibin
    Tao, Jianwu
    2008 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL TECHNOLOGY, VOLS 1-5, 2008, : 485 - 488
  • [43] Improved robustness and sensor fault tolerance via a generalized internal model-based fault-tolerant controller
    Yang, S. S.
    Chen, J.
    PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART I-JOURNAL OF SYSTEMS AND CONTROL ENGINEERING, 2007, 221 (I7) : 957 - 973
  • [44] Dynamic Fault Tree Models for FPGA Fault Tolerance and Reliability
    Elderhalli, Yassmeen
    El-Araby, Nahla
    Hasan, Osman
    Jantsch, Axel
    Tahar, Sofiene
    2021 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2021), 2021, : 194 - 199
  • [45] A Dynamic Proportional-Integral Observer-Based Nonlinear Fault-Tolerant Controller Design for Nonlinear System With Partially Unknown Dynamic
    Han, Jian
    Liu, Xiuhua
    Wei, Xinjiang
    Sun, Shaoxin
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2022, 52 (08): : 5092 - 5104
  • [46] A Motor Fault Diagnosis System Based on Cerebellar Model Articulation Controller
    Chen, Pi-Yun
    Chao, Kuei-Hsiang
    Tseng, Yu-Cheng
    IEEE ACCESS, 2019, 7 : 120326 - 120336
  • [47] Enhancing Fault Tolerance and Resource Utilization in Unidirectional Quorum-Based Cycle Routing
    Kleinheksel, Cory J.
    Somani, Arun K.
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2018, 26 (02) : 934 - 947
  • [48] FAULT DIAGNOSIS MODELLING OF POWER SYSTEM CONTROLLER BASED ON PLC TECHNOLOGY
    She, Dong
    MECHATRONIC SYSTEMS AND CONTROL, 2025, 53 (01): : 32 - 41
  • [49] Facilitating Autonomous Systems with AI-Based Fault Tolerance and Computational Resource Economy
    Deliparaschos, Kyriakos M.
    Michail, Konstantinos
    Zolotas, Argyrios C.
    ELECTRONICS, 2020, 9 (05)
  • [50] Resource Reliability using Fault Tolerance in Cloud Computing
    Charity, Talwana Jonathan
    Hua, Gu Chun
    PROCEEDINGS ON 2016 2ND INTERNATIONAL CONFERENCE ON NEXT GENERATION COMPUTING TECHNOLOGIES (NGCT), 2016, : 65 - 71