Measuring the Resiliency of Extreme-Scale Computing Environments

被引:10
|
作者
Bell Labs-Nokia, 600 Mountain Ave, New Provicence [1 ]
NJ
07974, United States
不详 [2 ]
IL
61801, United States
机构
来源
关键词
File organization - Graphics processing unit - Supercomputers;
D O I
10.1007/978-3-319-30599-8_24
中图分类号
学科分类号
摘要
This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applica-tions, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories. © Springer International Publishing Switzerland 2016.
引用
收藏
相关论文
共 50 条
  • [21] Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software
    Wang, Ke
    Kulkarni, Abhishek
    Lang, Michael
    Arnold, Dorian
    Raicu, Ioan
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (04) : 1070 - 1084
  • [22] Compiler Optimization for Extreme-Scale Scripting
    Armstrong, Timothy G.
    Wozniak, Justin M.
    Wilde, Michael
    Foster, Ian T.
    2014 14TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2014, : 571 - 574
  • [23] EXTREME-SCALE COMPUTING-WHERE 'JUST MORE OF THE SAME' DOES NOT WORK INTRODUCTION
    Hoisie, Adolfy
    Getov, Vladimir
    COMPUTER, 2009, 42 (11) : 24 - 26
  • [24] On SDN-Based Extreme-Scale Networks
    Ghalwash, Haitham
    Huang, Chun-Hsi
    2016 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2016,
  • [25] PittPack: An open-source Poisson's equation solver for extreme-scale computing with accelerators
    Hasbestan, Jaber J.
    Xiao, Cheng-Nian
    Senocak, Inanc
    COMPUTER PHYSICS COMMUNICATIONS, 2020, 254
  • [26] On Scalable Resiliency in Exascale Computing Environments
    Znati, Taieb
    2012 IEEE 31ST INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2012,
  • [27] A Vision for Managing Extreme-Scale Data Hoards
    Logan, Jeremy
    Mehta, Kshitij
    Heber, Gerd
    Klasky, Scott
    Kurc, Tahsin
    Podhorszki, Norbert
    Widener, Patrick
    Wolf, Matthew
    2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 1806 - 1817
  • [28] Extreme-scale earthquake simulations on Sunway TaihuLight
    Haohuan Fu
    Bingwei Chen
    Wenqiang Zhang
    Zhenguo Zhang
    Wei Zhang
    Guangwen Yang
    Xiaofei Chen
    CCF Transactions on High Performance Computing, 2019, 1 : 14 - 24
  • [29] mOS: An Architecture for Extreme-Scale Operating Systems
    Wisniewski, Robert W.
    Inglett, Todd
    Keppel, Pardo
    Murty, Ravi
    Riesen, Rolf
    PROCEEDINGS OF THE 4TH INTERNATIONAL WORKSHOP ON RUNTIME AND OPERATING SYSTEMS FOR SUPERCOMPUTERS, ROSS 2014, 2014,
  • [30] Sublinear Algorithms for Extreme-Scale Data Analysis
    Seshadhri, C.
    Pinar, Ali
    Thompson, David
    Bennett, Janine C.
    TOPOLOGICAL AND STATISTICAL METHODS FOR COMPLEX DATA: TACKLING LARGE-SCALE, HIGH-DIMENSIONAL, AND MULTIVARIATE DATA SPACES, 2015, : 39 - 54