Resilience-Aware Resource Management for Exascale Computing Systems

被引:5
|
作者
Dauwe, Daniel [1 ]
Pasricha, Sudeep [1 ,2 ]
Maciejewski, Anthony A. [1 ]
Siegel, Howard Jay [1 ,2 ]
机构
[1] Colorado State Univ, Dept Elect & Comp Engn, Ft Collins, CO 80523 USA
[2] Colorado State Univ, Dept Comp Sci, Ft Collins, CO 80523 USA
来源
关键词
Exascale resilience; checkpoint restart; multilevel checkpointing; message logging; fault tolerance; HPC resource management;
D O I
10.1109/TSUSC.2018.2797890
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increases in complexity and number of nodes in large-scale high performance computing (HPC) systems over time, the probability of applications experiencing runtime failures has increased significantly. Projections indicate that exascale-sized systems are likely to operate with mean time between failures (MTBF) of as little as a few minutes. Several strategies have been proposed in recent years for enabling systems of these extreme sizes to be resilient against failures. This work provides a comparison of four state-of-the-art HPC resilience protocols that are being considered for use in exascale systems. We explore the behavior of each resilience protocol operating under the simulated execution of a diverse set of applications and study the performance degradation that a large-scale system experiences from the overhead associated with each resilience protocol as well as the re-computation needed to recover when a failure occurs. Using the results from these analyses, we examine how resource management on exascale systems can be improved by allowing the system to select the optimal resilience protocol depending upon each application's execution characteristics, as well as providing the system resource manager the ability to make scheduling decisions that are "resilience aware" through the use of more accurate execution time predictions.
引用
收藏
页码:332 / 345
页数:14
相关论文
共 50 条
  • [1] Physical process resilience-aware network design for SCADA systems
    Genge, Bela
    Siaterlis, Christos
    COMPUTERS & ELECTRICAL ENGINEERING, 2014, 40 (01) : 142 - 157
  • [2] Metrics and methods for resilience-aware design of process systems: advances and challenges
    Chrisandina, Natasha J.
    Vedant, Shivam
    Iakovou, Eleftherios
    Pistikopoulos, Efstratios N.
    El-Halwagi, Mahmoud M.
    CURRENT OPINION IN CHEMICAL ENGINEERING, 2024, 43
  • [3] Resilience-Aware Frequency Tuning for Neural-Network-Based Approximate Computing Chips
    Wang, Ying
    Deng, Jiachao
    Fang, Yuntan
    Li, Huawei
    Li, Xiaowei
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2017, 25 (10) : 2736 - 2748
  • [4] Energy-Aware Resource Management for Computing Systems
    Siegel, Howard Jay
    Khemka, Bhavesh
    Friese, Ryan
    Pasricha, Sudeep
    Maciejewski, Anthony A.
    Koenig, Gregory A.
    Powers, Sarah
    Hilton, Marcia
    Rambharos, Rajendra
    Okonski, Gene
    Poole, Steve
    2014 SEVENTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2014, : 7 - 12
  • [5] Energy-Aware Resource Management for Computing Systems
    Siegel, H. J.
    2014 SEVENTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2014, : XI - XII
  • [6] Resource and Energy Management in High-Performance Computing: From Heterogeneous to Exascale Systems
    Ahmad, Ishfaq
    2017 INTERNATIONAL CONFERENCE ON INFOCOM TECHNOLOGIES AND UNMANNED SYSTEMS (TRENDS AND FUTURE DIRECTIONS) (ICTUS), 2017, : 70 - 70
  • [7] Profit-aware Resource Management for Edge Computing Systems
    Anglano, Cosimo
    Canonico, Massimo
    Guazzone, Marco
    EDGESYS'18: PROCEEDINGS OF THE FIRST ACM INTERNATIONAL WORKSHOP ON EDGE SYSTEMS, ANALYTICS AND NETWORKING, 2018, : 25 - 30
  • [8] Resilience-aware Optimal Design and Energy Management Scheme of Multi-energy Microgrids
    Masrur, Hasan
    Islam, Md Rabiul
    Muttaqi, Kashem M.
    Gamil, Mahmoud M.
    Huang, Yongyi
    Senjyu, Tomonobu
    2021 IEEE INDUSTRY APPLICATIONS SOCIETY ANNUAL MEETING (IAS), 2021,
  • [9] Energy-Aware Resource Management in Vehicular Edge Computing Systems
    Bahreini, Tayebeh
    Brocanelli, Marco
    Grosu, Daniel
    2020 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2020), 2020, : 49 - 58
  • [10] Towards trust-aware resource management in grid computing systems
    Azzedin, F
    Maheswaran, M
    CCGRID 2002: 2ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2002, : 452 - 457