Big systems and big reliability challenges

被引:0
|
作者
Reed, DA [1 ]
Lu, C [1 ]
Mendes, CL [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands and with proposed petaflop system likely to contain hundreds of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. This paper quantifies system reliability using data drawn from current systems and describes possible approaches for ensuring reliable, effective use of future, large-scale systems. We also present techniques for detecting imminent failures that allow applications to execute despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.
引用
收藏
页码:729 / 736
页数:8
相关论文
共 50 条