Modeling Application Resilience in Large-scale Parallel Execution

被引:0
|
作者
Wu, Kai [1 ]
Dong, Wenqian [1 ]
Guan, Qiang [2 ]
DeBardeleben, Nathan [3 ]
Li, Dong [1 ]
机构
[1] Univ Calif Merced, Merced, CA 95343 USA
[2] Kent State Univ, Kent, OH 44242 USA
[3] Los Alamos Natl Lab, Washington, DC USA
基金
美国国家科学基金会;
关键词
D O I
10.1145/3225058.3225119
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding how the application is resilient to hardware and software errors is critical to high-performance computing. To evaluate application resilience, the application level fault injection is the most common method. However, the application level fault injection can be very expensive when running the application in parallel in large scales due to the high requirement for hardware resource during fault injection. In this paper, we introduce a new methodology to evaluate the resilience of the application running in large scales. Instead of injecting errors into the application in large-scale execution, we inject errors into the application in small-scale execution and serial execution to model and predict the fault injection result for the application running in large scales. Our models are based on a series of empirical observations. Those observations characterize error occurrences and propagation across MPI processes in small-scale execution (including serial execution) and large-scale one. Our models achieve high prediction accuracy. Evaluating with four NAS parallel benchmarks and two proxy scientific applications, we demonstrate that using the fault injection result to predict for 64 MPI processes, the average prediction error is 8%. While using the fault injection result to make the same prediction for eight MPI processes, the average prediction error decreases to 7%.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Large-scale parallel numerical integration
    de Doncker, E
    Gupta, A
    Zanny, RR
    JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 1999, 112 (1-2) : 29 - 44
  • [32] Large-scale parallel numerical integration
    Department of Computer Science, Western Michigan University, Kalamazoo, MI 49008, United States
    J Comput Appl Math, 1 (29-44):
  • [33] Automated parametric execution and documentation for large-scale simulations
    Kelsey, RL
    Bisset, KR
    Webster, RB
    ENABLING TECHNOLOGY FOR SIMULATION SCIENCE V, 2001, 4367 : 202 - 208
  • [34] A large-scale study on research code quality and execution
    Trisovic, Ana
    Lau, Matthew K.
    Pasquier, Thomas
    Crosas, Merce
    SCIENTIFIC DATA, 2022, 9 (01)
  • [35] LARGE-SCALE URBAN MODELING
    HELWEG, OJ
    JOURNAL OF THE URBAN PLANNING & DEVELOPMENT DIVISION-ASCE, 1979, 105 (02): : 89 - 101
  • [36] A large-scale study on research code quality and execution
    Ana Trisovic
    Matthew K. Lau
    Thomas Pasquier
    Mercè Crosas
    Scientific Data, 9
  • [37] LARGE-SCALE FLOODPLAIN MODELING
    GEE, DM
    ANDERSON, MG
    BAIRD, L
    EARTH SURFACE PROCESSES AND LANDFORMS, 1990, 15 (06) : 513 - 523
  • [38] LARGE-SCALE URBAN MODELING
    GRIGG, NS
    JOURNAL OF THE URBAN PLANNING & DEVELOPMENT DIVISION-ASCE, 1980, 106 (01): : 106 - 107
  • [39] A summary and synthesis of resilience in large-scale systems
    Gunderson, LH
    Pritchard, L
    Holling, CS
    Folke, C
    Peterson, GD
    RESILIENCE AND THE BEHAVIOR OF LARGE-SCALE SYSTEMS, 2002, 60 : 249 - 266
  • [40] Automated Execution of Large-Scale Daylighting and Glare Simulations in a Cloud-Based Parallel Computing Environment
    Labib, Rania
    Baltazar, Juan-Carlos
    PROCEEDINGS OF BUILDING SIMULATION 2019: 16TH CONFERENCE OF IBPSA, 2020, : 1545 - 1551