Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands and with proposed petaflop system likely to contain hundreds of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. This paper quantifies system reliability using data drawn from current systems and describes possible approaches for ensuring reliable, effective use of future, large-scale systems. We also present techniques for detecting imminent failures that allow applications to execute despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.