Resilience-Aware Resource Management for Exascale Computing Systems

被引:5
|
作者
Dauwe, Daniel [1 ]
Pasricha, Sudeep [1 ,2 ]
Maciejewski, Anthony A. [1 ]
Siegel, Howard Jay [1 ,2 ]
机构
[1] Colorado State Univ, Dept Elect & Comp Engn, Ft Collins, CO 80523 USA
[2] Colorado State Univ, Dept Comp Sci, Ft Collins, CO 80523 USA
来源
关键词
Exascale resilience; checkpoint restart; multilevel checkpointing; message logging; fault tolerance; HPC resource management;
D O I
10.1109/TSUSC.2018.2797890
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increases in complexity and number of nodes in large-scale high performance computing (HPC) systems over time, the probability of applications experiencing runtime failures has increased significantly. Projections indicate that exascale-sized systems are likely to operate with mean time between failures (MTBF) of as little as a few minutes. Several strategies have been proposed in recent years for enabling systems of these extreme sizes to be resilient against failures. This work provides a comparison of four state-of-the-art HPC resilience protocols that are being considered for use in exascale systems. We explore the behavior of each resilience protocol operating under the simulated execution of a diverse set of applications and study the performance degradation that a large-scale system experiences from the overhead associated with each resilience protocol as well as the re-computation needed to recover when a failure occurs. Using the results from these analyses, we examine how resource management on exascale systems can be improved by allowing the system to select the optimal resilience protocol depending upon each application's execution characteristics, as well as providing the system resource manager the ability to make scheduling decisions that are "resilience aware" through the use of more accurate execution time predictions.
引用
收藏
页码:332 / 345
页数:14
相关论文
共 50 条
  • [21] Resource management systems in cluster computing
    Gentzsch, W
    Ferstl, F
    1ST AUSTRIAN-HUNGARIAN WORKSHOP ON DISTRIBUTED AND PARALLEL SYSTEMS, PROCEEDINGS, 1996, 1996 (09): : 21 - 22
  • [22] Resource Management for Reconfigurable Computing Systems
    Azibi, Abdo
    Ayadi, Ramzi
    Kaddachi, Med Lassaad
    2019 6TH INTERNATIONAL CONFERENCE ON ELECTRICAL AND ELECTRONICS ENGINEERING (ICEEE 2019), 2019, : 121 - 125
  • [23] Using group replication for resilience on exascale systems
    Bougeret, Marin
    Casanova, Henri
    Robert, Yves
    Vivien, Frederic
    Zaidouni, Dounia
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2014, 28 (02): : 210 - 224
  • [24] A QoS-aware resource management scheme over fog computing infrastructures in IoT systems
    Najwa Abu-Amssimir
    Ali Al-Haj
    Multimedia Tools and Applications, 2023, 82 : 28281 - 28300
  • [25] A QoS-aware resource management scheme over fog computing infrastructures in IoT systems
    Abu-Amssimir, Najwa
    Al-Haj, Ali
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (18) : 28281 - 28300
  • [26] Mobile-aware dynamic resource management for edge computing
    Filiposka, Sonja
    Mishev, Anastas
    Gilly, Katja
    TRANSACTIONS ON EMERGING TELECOMMUNICATIONS TECHNOLOGIES, 2019, 30 (06):
  • [27] Reliable Power and Time-Constraints-Aware Predictive Management of Heterogeneous Exascale Systems
    Fornaciari, William
    Agosta, Giovanni
    Atienza, David
    Brandolese, Carlo
    Cammoun, Leila
    Cremona, Luca
    Cilardo, Alessandro
    Farres, Albert
    Flich, Jose
    Hernandez, Carles
    Kulchewski, Michal
    Libutti, Simone
    Maria Martinez, Jose
    Massari, Giuseppe
    Oleksiak, Ariel
    Pupykina, Anna
    Reghenzani, Federico
    Tornero, Rafael
    Zanella, Michele
    Zapater, Marina
    Zoni, Davide
    2018 INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION (SAMOS XVIII), 2018, : 187 - 194
  • [28] Modeling resource management in concurrent computing systems
    Scherson, ID
    Ramanathan, D
    Subramanian, R
    Chrzatowski-Wachtel, P
    Concurrent Information Processing and Computing, 2005, 195 : 3 - 18
  • [29] Energy-Aware Resource Management for Federated Learning in Multi-Access Edge Computing Systems
    Zaw, Chit Wutyee
    Pandey, Shashi Raj
    Kim, Kitae
    Hong, Choong Seon
    IEEE ACCESS, 2021, 9 : 34938 - 34950
  • [30] Communication and Computation Aware Task Scheduling Framework Toward Exascale Computing
    Sandokji, Suhelah
    Eassa, Fathy
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (07) : 119 - 128