Resilience-Aware Resource Management for Exascale Computing Systems

被引：5

作者：

Dauwe, Daniel ^{[1
]}

Pasricha, Sudeep ^{[1
,2
]}

Maciejewski, Anthony A. ^{[1
]}

Siegel, Howard Jay ^{[1
,2
]}

机构：

[1] Colorado State Univ, Dept Elect & Comp Engn, Ft Collins, CO 80523 USA

[2] Colorado State Univ, Dept Comp Sci, Ft Collins, CO 80523 USA

来源：

IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING | 2018年 / 3卷 / 04期

关键词：

Exascale resilience; checkpoint restart; multilevel checkpointing; message logging; fault tolerance; HPC resource management;

D O I：

10.1109/TSUSC.2018.2797890

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the increases in complexity and number of nodes in large-scale high performance computing (HPC) systems over time, the probability of applications experiencing runtime failures has increased significantly. Projections indicate that exascale-sized systems are likely to operate with mean time between failures (MTBF) of as little as a few minutes. Several strategies have been proposed in recent years for enabling systems of these extreme sizes to be resilient against failures. This work provides a comparison of four state-of-the-art HPC resilience protocols that are being considered for use in exascale systems. We explore the behavior of each resilience protocol operating under the simulated execution of a diverse set of applications and study the performance degradation that a large-scale system experiences from the overhead associated with each resilience protocol as well as the re-computation needed to recover when a failure occurs. Using the results from these analyses, we examine how resource management on exascale systems can be improved by allowing the system to select the optimal resilience protocol depending upon each application's execution characteristics, as well as providing the system resource manager the ability to make scheduling decisions that are "resilience aware" through the use of more accurate execution time predictions.

引用

页码：332 / 345

页数：14

共 50 条

[1] Physical process resilience-aware network design for SCADA systems
Genge, Bela
Siaterlis, Christos
COMPUTERS & ELECTRICAL ENGINEERING, 2014, 40 (01) : 142 - 157
[2] Metrics and methods for resilience-aware design of process systems: advances and challenges
Chrisandina, Natasha J.
Vedant, Shivam
Iakovou, Eleftherios
Pistikopoulos, Efstratios N.
El-Halwagi, Mahmoud M.
CURRENT OPINION IN CHEMICAL ENGINEERING, 2024, 43
[3] Resilience-Aware Frequency Tuning for Neural-Network-Based Approximate Computing Chips
Wang, Ying
Deng, Jiachao
Fang, Yuntan
Li, Huawei
Li, Xiaowei
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2017, 25 (10) : 2736 - 2748
[4] Energy-Aware Resource Management for Computing Systems
Siegel, Howard Jay
Khemka, Bhavesh
Friese, Ryan
Pasricha, Sudeep
Maciejewski, Anthony A.
Koenig, Gregory A.
Powers, Sarah
Hilton, Marcia
Rambharos, Rajendra
Okonski, Gene
Poole, Steve
2014 SEVENTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2014, : 7 - 12
[5] Energy-Aware Resource Management for Computing Systems
Siegel, H. J.
2014 SEVENTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2014, : XI - XII
[6] Resource and Energy Management in High-Performance Computing: From Heterogeneous to Exascale Systems
Ahmad, Ishfaq
2017 INTERNATIONAL CONFERENCE ON INFOCOM TECHNOLOGIES AND UNMANNED SYSTEMS (TRENDS AND FUTURE DIRECTIONS) (ICTUS), 2017, : 70 - 70
[7] Profit-aware Resource Management for Edge Computing Systems
Anglano, Cosimo
Canonico, Massimo
Guazzone, Marco
EDGESYS'18: PROCEEDINGS OF THE FIRST ACM INTERNATIONAL WORKSHOP ON EDGE SYSTEMS, ANALYTICS AND NETWORKING, 2018, : 25 - 30
[8] Resilience-aware Optimal Design and Energy Management Scheme of Multi-energy Microgrids
Masrur, Hasan
Islam, Md Rabiul
Muttaqi, Kashem M.
Gamil, Mahmoud M.
Huang, Yongyi
Senjyu, Tomonobu
2021 IEEE INDUSTRY APPLICATIONS SOCIETY ANNUAL MEETING (IAS), 2021,
[9] Energy-Aware Resource Management in Vehicular Edge Computing Systems
Bahreini, Tayebeh
Brocanelli, Marco
Grosu, Daniel
2020 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2020), 2020, : 49 - 58
[10] Towards trust-aware resource management in grid computing systems
Azzedin, F
Maheswaran, M
CCGRID 2002: 2ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2002, : 452 - 457

← 1 2 3 4 5 →