Modeling Application Resilience in Large-scale Parallel Execution

被引:0
|
作者
Wu, Kai [1 ]
Dong, Wenqian [1 ]
Guan, Qiang [2 ]
DeBardeleben, Nathan [3 ]
Li, Dong [1 ]
机构
[1] Univ Calif Merced, Merced, CA 95343 USA
[2] Kent State Univ, Kent, OH 44242 USA
[3] Los Alamos Natl Lab, Washington, DC USA
基金
美国国家科学基金会;
关键词
D O I
10.1145/3225058.3225119
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding how the application is resilient to hardware and software errors is critical to high-performance computing. To evaluate application resilience, the application level fault injection is the most common method. However, the application level fault injection can be very expensive when running the application in parallel in large scales due to the high requirement for hardware resource during fault injection. In this paper, we introduce a new methodology to evaluate the resilience of the application running in large scales. Instead of injecting errors into the application in large-scale execution, we inject errors into the application in small-scale execution and serial execution to model and predict the fault injection result for the application running in large scales. Our models are based on a series of empirical observations. Those observations characterize error occurrences and propagation across MPI processes in small-scale execution (including serial execution) and large-scale one. Our models achieve high prediction accuracy. Evaluating with four NAS parallel benchmarks and two proxy scientific applications, we demonstrate that using the fault injection result to predict for 64 MPI processes, the average prediction error is 8%. While using the fault injection result to make the same prediction for eight MPI processes, the average prediction error decreases to 7%.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] A visualized parallel network simulator for modeling large-scale distributed applications
    Lin, Siming
    Cheng, Xueqi
    Lv, Jianming
    EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2007, : 339 - 346
  • [22] Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms
    Gu, Rong
    Tang, Yun
    Tian, Chen
    Zhou, Hucheng
    Li, Guanru
    Zheng, Xudong
    Huang, Yihua
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (09) : 2539 - 2552
  • [23] Large-scale application of some modern CSM methodologies by parallel computation
    Danielson, KT
    Uras, RA
    Adley, MD
    Li, S
    ADVANCES IN ENGINEERING SOFTWARE, 2000, 31 (8-9) : 501 - 509
  • [24] Tools for Enabling Automatic Validation of Large-scale Parallel Application Simulations
    Zhang, Deli
    Hendry, Gilbert
    Dechev, Damian
    2014 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), 2014, : 601 - 604
  • [25] An Overlap Store Optimization for Large-Scale Parallel Earth Science Application
    Chen J.
    Du Y.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2019, 56 (04): : 790 - 797
  • [26] Design of large-scale parallel simulations
    Knepley, MG
    Sameh, AH
    Sarin, V
    PARALLEL COMPUTATIONAL FLUID DYNAMICS: TOWARDS TERAFLOPS, OPTIMIZATION, AND NOVEL FORMULATIONS, 2000, : 273 - 279
  • [27] A Large-scale Parallel Fuzzing System
    Li, Yang
    Feng, Chao
    Tang, Chaojing
    ICAIP 2018: 2018 THE 2ND INTERNATIONAL CONFERENCE ON ADVANCES IN IMAGE PROCESSING, 2018, : 194 - 197
  • [28] LARGE-SCALE PARALLEL PROCESSING SYSTEMS
    SIEGEL, HJ
    SCHWEDERSKI, T
    MEYER, DG
    HSU, WT
    MICROPROCESSORS AND MICROSYSTEMS, 1987, 11 (01) : 3 - 20
  • [29] Large-scale parallel data clustering
    Judd, D
    McKinley, PK
    Jain, AK
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1998, 20 (08) : 871 - 876
  • [30] Large-Scale Parallel Computing on Grids
    Bal, Henri
    Verstoep, Kees
    ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2008, 220 (02) : 3 - 17