Measuring the Resiliency of Extreme-Scale Computing Environments

被引：10

作者：

Bell Labs-Nokia, 600 Mountain Ave, New Provicence ^{[1
]}

07974, United States

不详 ^{[2
]}

61801, United States

机构：

来源：

Springer Ser. Reliab. Eng. | / 609-655期

关键词：

File organization - Graphics processing unit - Supercomputers;

D O I：

10.1007/978-3-319-30599-8_24

中图分类号：

学科分类号：

摘要：

This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applica-tions, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories. © Springer International Publishing Switzerland 2016.

引用

共 50 条

[21] Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software
Wang, Ke
Kulkarni, Abhishek
Lang, Michael
Arnold, Dorian
Raicu, Ioan
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (04) : 1070 - 1084
[22] Compiler Optimization for Extreme-Scale Scripting
Armstrong, Timothy G.
Wozniak, Justin M.
Wilde, Michael
Foster, Ian T.
2014 14TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2014, : 571 - 574
[23] EXTREME-SCALE COMPUTING-WHERE 'JUST MORE OF THE SAME' DOES NOT WORK INTRODUCTION
Hoisie, Adolfy
Getov, Vladimir
COMPUTER, 2009, 42 (11) : 24 - 26
[24] On SDN-Based Extreme-Scale Networks
Ghalwash, Haitham
Huang, Chun-Hsi
2016 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2016,
[25] PittPack: An open-source Poisson's equation solver for extreme-scale computing with accelerators
Hasbestan, Jaber J.
Xiao, Cheng-Nian
Senocak, Inanc
COMPUTER PHYSICS COMMUNICATIONS, 2020, 254
[26] On Scalable Resiliency in Exascale Computing Environments
Znati, Taieb
2012 IEEE 31ST INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2012,
[27] A Vision for Managing Extreme-Scale Data Hoards
Logan, Jeremy
Mehta, Kshitij
Heber, Gerd
Klasky, Scott
Kurc, Tahsin
Podhorszki, Norbert
Widener, Patrick
Wolf, Matthew
2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 1806 - 1817
[28] Extreme-scale earthquake simulations on Sunway TaihuLight
Haohuan Fu
Bingwei Chen
Wenqiang Zhang
Zhenguo Zhang
Wei Zhang
Guangwen Yang
Xiaofei Chen
CCF Transactions on High Performance Computing, 2019, 1 : 14 - 24
[29] mOS: An Architecture for Extreme-Scale Operating Systems
Wisniewski, Robert W.
Inglett, Todd
Keppel, Pardo
Murty, Ravi
Riesen, Rolf
PROCEEDINGS OF THE 4TH INTERNATIONAL WORKSHOP ON RUNTIME AND OPERATING SYSTEMS FOR SUPERCOMPUTERS, ROSS 2014, 2014,
[30] Sublinear Algorithms for Extreme-Scale Data Analysis
Seshadhri, C.
Pinar, Ali
Thompson, David
Bennett, Janine C.
TOPOLOGICAL AND STATISTICAL METHODS FOR COMPLEX DATA: TACKLING LARGE-SCALE, HIGH-DIMENSIONAL, AND MULTIVARIATE DATA SPACES, 2015, : 39 - 54

← 1 2 3 4 5 →