Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands and with proposed petaflop system likely to contain hundreds of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. This paper quantifies system reliability using data drawn from current systems and describes possible approaches for ensuring reliable, effective use of future, large-scale systems. We also present techniques for detecting imminent failures that allow applications to execute despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.
机构:
Univ New South Wales, Fac Med, George Inst Global Hlth, Sydney, NSW, Australia
Peking Univ, Hlth Sci Ctr, George Inst Global Hlth China, Beijing, Peoples R ChinaUniv New South Wales, Fac Med, George Inst Global Hlth, Sydney, NSW, Australia
Anderson, Craig S.
Chaturvedi, Seemant
论文数: 0引用数: 0
h-index: 0
机构:
Univ Miami, Miller Sch Med, Dept Neurol, Coral Gables, FL 33124 USA
Univ Miami, Miller Sch Med, Stroke Program, Coral Gables, FL 33124 USAUniv New South Wales, Fac Med, George Inst Global Hlth, Sydney, NSW, Australia