Modeling Application Resilience in Large-scale Parallel Execution

被引:0
|
作者
Wu, Kai [1 ]
Dong, Wenqian [1 ]
Guan, Qiang [2 ]
DeBardeleben, Nathan [3 ]
Li, Dong [1 ]
机构
[1] Univ Calif Merced, Merced, CA 95343 USA
[2] Kent State Univ, Kent, OH 44242 USA
[3] Los Alamos Natl Lab, Washington, DC USA
基金
美国国家科学基金会;
关键词
D O I
10.1145/3225058.3225119
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding how the application is resilient to hardware and software errors is critical to high-performance computing. To evaluate application resilience, the application level fault injection is the most common method. However, the application level fault injection can be very expensive when running the application in parallel in large scales due to the high requirement for hardware resource during fault injection. In this paper, we introduce a new methodology to evaluate the resilience of the application running in large scales. Instead of injecting errors into the application in large-scale execution, we inject errors into the application in small-scale execution and serial execution to model and predict the fault injection result for the application running in large scales. Our models are based on a series of empirical observations. Those observations characterize error occurrences and propagation across MPI processes in small-scale execution (including serial execution) and large-scale one. Our models achieve high prediction accuracy. Evaluating with four NAS parallel benchmarks and two proxy scientific applications, we demonstrate that using the fault injection result to predict for 64 MPI processes, the average prediction error is 8%. While using the fault injection result to make the same prediction for eight MPI processes, the average prediction error decreases to 7%.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] A Parallel and Broadband Helmholtz FMBEM Model for Large-Scale Target Strength Modeling
    Wilkes, Daniel R.
    Duncan, Alec J.
    Marburg, Steffen
    JOURNAL OF THEORETICAL AND COMPUTATIONAL ACOUSTICS, 2020, 28 (03):
  • [42] Modeling performance of the multiscale fluid numerical solver on large-scale parallel computers
    Guo, Xiao-Wei
    Li, Chao
    HP3C 2020: PROCEEDINGS OF THE 2020 4TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPILATION, COMPUTING AND COMMUNICATIONS, 2020, : 25 - 29
  • [43] Dynamic Modeling and Attitude Control of Large-Scale Flexible Parallel Multibody Spacecraft
    Li, Yinkang
    Li, Shuang
    Xin, Ming
    JOURNAL OF GUIDANCE CONTROL AND DYNAMICS, 2022, 45 (12) : 2304 - 2317
  • [44] LARGE-SCALE APPLICATION OF PRILLING
    ROBERTS, AG
    SHAH, KD
    CHEMICAL ENGINEER-LONDON, 1975, (304): : 748 - 750
  • [45] LARGE-SCALE APPLICATION OF PLASMAPHERESIS
    PALMER, JW
    VOX SANGUINIS, 1963, 8 (01) : 97 - &
  • [46] Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis
    Majid, Abdul
    Khan, Mukhtaj
    Iqbal, Nadeem
    Jan, Mian Ahmad
    Khan, Mushtaq
    Salman
    JOURNAL OF GRID COMPUTING, 2019, 17 (02) : 313 - 324
  • [47] Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis
    Abdul Majid
    Mukhtaj Khan
    Nadeem Iqbal
    Mian Ahmad Jan
    Mushtaq Khan
    Journal of Grid Computing, 2019, 17 : 313 - 324
  • [48] Parallel Public Transport System and Its Application in the Evacuation of Large-scale Activities
    Zhu, Fenghua
    Chen, Songhang
    Lv, Yisheng
    Ye, Peijun
    Xiong, Gang
    Dong, Xisong
    2012 15TH INTERNATIONAL IEEE CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2012, : 102 - 107
  • [49] Automated Parallel Data Processing Engine with Application to Large-scale Feature Extraction
    Xing, Xin
    Dong, Bin
    Ajo-Franklin, Jonathan
    Wu, Kesheng
    PROCEEDINGS OF 2018 IEEE/ACM MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC 2018), 2018, : 37 - 46
  • [50] Modeling and Application Prospect of Large-scale Energy Storage Power Station
    Li X.
    Xu G.
    Zhao S.
    Li B.
    Gaodianya Jishu/High Voltage Engineering, 2024, 50 (06): : 2397 - 2409