RADIC Based Fault Tolerance System with Dynamic Resource Controller

被引:0
|
作者
Villamayor, Jorge [1 ]
Rexachs, Dolores [1 ]
Luque, Emilio [1 ]
机构
[1] Univ Autonoma Barcelona, CAOS Comp Architecture & Operating Syst, Barcelona, Spain
来源
COMPUTATIONAL SCIENCE - ICCS 2018, PT III | 2018年 / 10862卷
关键词
High-Performance Computing; Fault Tolerance; Application layer FT; Sender-based message logging;
D O I
10.1007/978-3-319-93713-7_58
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The continuously growing High-Performance Computing requirements increments the number of components and at the same time failure probabilities. Long-running parallel applications are directly affected by this phenomena, disrupting its executions on failure occurrences. MPI, a well-known standard for parallel applications follows a fail-stop semantic, requiring the application owners restart the whole execution when hard failures appear losing time and computation data. Fault Tolerance (FT) techniques approach this issue by providing high availability to the users' applications execution, though adding significant resource and time costs. In this paper, we present a Fault Tolerance Manager (FTM) framework based on RADIC architecture, which provides FT protection to parallel applications implemented with MPI, in order to successfully complete executions despite failures. The solution is implemented in the application-layer following the uncoordinated and semi-coordinated rollback recovery protocols. It uses a sender-based message logger to store exchanged messages between the application processes; and checkpoints only the processes data required to restart them in case of failures. The solution uses the concepts of ULFM for failure detection and recovery. Furthermore, a dynamic resource controller is added to the proposal, which monitors the message logger buffers and performs actions to maintain an acceptable level of protection. Experimental validation verifies the FTM functionality using two private clusters infrastructures.
引用
收藏
页码:624 / 631
页数:8
相关论文
共 50 条
  • [31] DYNAMIC FAULT TOLERANCE IN CRYOELECTRIC ARRAYS
    PRITCHARD, JP
    SLAY, BG
    JOURNAL OF APPLIED PHYSICS, 1968, 39 (06) : 2588 - +
  • [32] Dynamic Practical Byzantine Fault Tolerance
    Xu Hao
    Long Yu
    Liu Zhiqiang
    Liu Zhen
    Gu Dawu
    2018 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY (CNS), 2018,
  • [33] On dynamic fault tolerance for WSI networks
    Yamada, T
    Nishimura, T
    Ueno, S
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 1997, E80A (08) : 1529 - 1530
  • [34] Comprehensive Analysis of Performance, Fault-tolerance and Scalability in Grid Resource Management System
    Kong, Xiangzhen
    Huang, Jiwei
    Lin, Chuang
    2009 EIGHTH INTERNATIONAL CONFERENCE ON GRID AND COOPERATIVE COMPUTING, PROCEEDINGS, 2009, : 83 - 90
  • [35] FPGA Fault Tolerance Based on Dynamic Self-Adaptive Redundancy
    Li Z.
    Wang Q.
    Yang P.
    Xu Z.
    Liang J.
    Gao G.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2022, 59 (07): : 1428 - 1438
  • [36] MAMR(Multiple Access Multiple Resource) method for fault tolerance distributed file system
    Jang, SJ
    Kim, GY
    INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-III, PROCEEDINGS, 1997, : 1694 - 1698
  • [37] Fault Tolerance in Dynamic Cluster-Based Wireless Sensor Networks
    Zeb, Asim
    Islam, A. K. M. Muzahidul
    Mansoor, Nafees
    Baharun, Sabariah
    Komaki, Shozo
    2015 12TH INTERNATIONAL BHURBAN CONFERENCE ON APPLIED SCIENCES AND TECHNOLOGY (IBCAST), 2015, : 646 - 649
  • [38] Network based controller applied to a highly dynamic system
    Morawski, Michal
    Zajaczkowski, Antoni M.
    2008 IEEE INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION, PROCEEDINGS, 2008, : 1302 - 1309
  • [39] A resource manager for optimal resource selection and fault tolerance service in grids
    Lee, HM
    Chin, SH
    Lee, JH
    Lee, DW
    Chung, KS
    Jung, SY
    Yu, HC
    2004 IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID - CCGRID 2004, 2004, : 572 - 579
  • [40] High dynamic precision adaptive control system for solution of fault tolerance problem
    Vershinin, YA
    Garvey, SD
    Holding, DJ
    INTELLIGENT AUTONOMOUS VEHICLES 2001, 2002, : 267 - 271