Lightweight Measurement and Analysis of HPC Performance Variability

被引:3
|
作者
Dominguez-Trujillo, Jered [1 ]
Haskins, Keira [1 ]
Khouzani, Soheila Jafari [1 ]
Leap, Christopher [1 ]
Tashakkori, Sahba [1 ]
Wofford, Quincy [1 ]
Estrada, Trilce [1 ]
Bridges, Patrick G. [1 ]
Widener, Patrick M. [2 ]
机构
[1] Univ New Mexico, Comp Sci Dept, Albuquerque, NM 87131 USA
[2] Sandia Natl Labs, Ctr Comp Res, POB 5800, Albuquerque, NM 87185 USA
基金
美国国家科学基金会; 美国能源部;
关键词
BOOTSTRAP;
D O I
10.1109/PMBS51919.2020.00011
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.
引用
收藏
页码:50 / 60
页数:11
相关论文
共 50 条
  • [11] PATHA: Performance Analysis Tool for HPC Applications
    Yoo, Wucherl
    Koo, Michelle
    Cao, Yi
    Sim, Alex
    Nugent, Peter
    Wu, Kesheng
    2015 IEEE 34TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2015,
  • [12] Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC
    Hoefler, Torsten
    EURO-PAR 2010 PARALLEL PROCESSING WORKSHOPS, 2011, 6586 : 483 - 491
  • [13] Measurement and Modeling of Performance of HPC Applications Towards Overcommitting Scheduling Systems
    Minami, Shohei
    Endo, Toshio
    Nomura, Akihiro
    JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, JSSPP 2021, 2021, 12985 : 59 - 79
  • [14] Global Experiences with HPC Operational Data Measurement, Collection and Analysis
    Ott, Michael
    Shin, Woong
    Bourassa, Norman
    Wilde, Torsten
    Ceballos, Stefan
    Romanus, Melissa
    Bates, Natalie
    2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020), 2020, : 499 - 508
  • [15] HPC performance analysis of a distributed information enterprise simulation
    Hanna, JP
    Walter, MJ
    Hillman, RG
    USERS GROUP CONFERENCE, PROCEEDINGS, 2004, : 280 - 284
  • [16] Performance Analysis of Emerging Data Analytics and HPC Workloads
    Daley, Christopher S.
    Dosanjh, Sudip
    Prabhat
    Wright, Nicholas J.
    PROCEEDINGS OF PDSW-DISCS 2017: 2ND JOINT INTERNATIONAL WORKSHOP ON PARALLEL DATA STORAGE & DATA INTENSIVE SCALABLE COMPUTING SYSTEMS, 2017, : 43 - 48
  • [17] Performance issues and performance analysis tools for HPC cloud applications: a survey
    Shajulin Benedict
    Computing, 2013, 95 : 89 - 108
  • [18] A Lightweight Performance Measurement Framework for a Virtual Machine on Cloud
    Sushma, M.
    Niroop, R. S.
    Shetty, Jyothi
    2016 INTERNATIONAL CONFERENCE ON COMPUTATION SYSTEM AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTIONS (CSITSS), 2016, : 404 - 408
  • [19] Performance issues and performance analysis tools for HPC cloud applications: a survey
    Benedict, Shajulin
    COMPUTING, 2013, 95 (02) : 89 - 108
  • [20] Advanced performance analysis of HPC workloads on Cavium ThunderX
    Calore, Enrico
    Mantovani, Filippo
    Ruiz, Daniel
    PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 375 - 382