Lightweight Measurement and Analysis of HPC Performance Variability

被引:3
|
作者
Dominguez-Trujillo, Jered [1 ]
Haskins, Keira [1 ]
Khouzani, Soheila Jafari [1 ]
Leap, Christopher [1 ]
Tashakkori, Sahba [1 ]
Wofford, Quincy [1 ]
Estrada, Trilce [1 ]
Bridges, Patrick G. [1 ]
Widener, Patrick M. [2 ]
机构
[1] Univ New Mexico, Comp Sci Dept, Albuquerque, NM 87131 USA
[2] Sandia Natl Labs, Ctr Comp Res, POB 5800, Albuquerque, NM 87185 USA
基金
美国国家科学基金会; 美国能源部;
关键词
BOOTSTRAP;
D O I
10.1109/PMBS51919.2020.00011
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.
引用
收藏
页码:50 / 60
页数:11
相关论文
共 50 条
  • [31] Performance Analysis of HPC Applications with Irregular Tree Data Structures
    Khawaja, Ahmed
    Wang, Jiajun
    Gerstlauer, Andreas
    John, Lizy K.
    Malhotra, Dhairya
    Biros, George
    2014 20TH IEEE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2014, : 418 - 425
  • [32] Measurement of process performance and variability in inoculated composting reactors using ANOVA and power analysis
    Schloss, PD
    Walker, LP
    PROCESS BIOCHEMISTRY, 2000, 35 (09) : 931 - 942
  • [33] LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures
    Fang, Bo
    Guan, Qiang
    Debardeleben, Nathan
    Pattabiraman, Karthik
    Ripeanu, Matei
    HPDC'17: PROCEEDINGS OF THE 26TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2017, : 117 - 130
  • [34] Analysis of a measurement process in the field of variability
    Netolicky, Petr
    Kupka, Lukas
    Tumova, Olga
    2020 INTERNATIONAL CONFERENCE ON DIAGNOSTICS IN ELECTRICAL ENGINEERING, DIAGNOSTIKA, 2020, : 172 - 175
  • [35] Performance & Probability Analysis of Lightweight Identification Protocol
    Manjulata
    Kumar, Adarsh
    2013 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION (ICSC), 2013, : 76 - 81
  • [36] Analysis of the Hygrothermal Performance of the Lightweight External Cladding
    Skotnicova, Iveta
    Tymova, Petra
    Galda, Zdenek
    Lausova, Lenka
    ENVIBUILD 2014, 2014, 1041 : 311 - 314
  • [37] Design and Performance Analysis of a Lightweight Flexible nZEB
    Salvalai, Graziano
    Sesana, Marta Maria
    Brutti, Diletta
    Imperadori, Marco
    SUSTAINABILITY, 2020, 12 (15)
  • [38] TensorFlow Doing HPC An Evaluation of TensorFlow Performance in HPC Applications
    Chien, Steven W. D.
    Markidis, Stefano
    Olshevsky, Vyacheslav
    Bulatov, Yaroslav
    Laure, Erwin
    Vetter, Jeffrey S.
    2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2019, : 509 - 518
  • [39] Design, Implementation & Performance Analysis of Low Cost High Performance Computing (HPC) Clusters
    Kumar, Dileep
    Memon, Sheeraz
    Thebo, Liaquat Ali
    2018 12TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS), 2018,
  • [40] Comprehensive, open-source resource usage measurement and analysis for HPC systems
    Browne, James C.
    DeLeon, Robert L.
    Patra, Abani K.
    Barth, William L.
    Hammond, John
    Jones, Matthew D.
    Furlani, Thomas R.
    Schneider, Barry I.
    Gallo, Steven M.
    Ghadersohi, Amin
    Gentner, Ryan J.
    Palmer, Jeffrey T.
    Simakov, Nikolay
    Innus, Martins
    Bruno, Andrew E.
    White, Joseph P.
    Cornelius, Cynthia D.
    Yearke, Thomas
    Marcus, Kyle
    von Laszewski, Gregor
    Wang, Fugang
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2014, 26 (13): : 2191 - 2209