Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster

被引:1
|
作者
DeConinck, A. [1 ]
Kelly, K. [1 ]
机构
[1] Los Alamos Natl Lab, POB 1663, Los Alamos, NM 87544 USA
来源
2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015 | 2015年
关键词
D O I
10.1109/CLUSTER.2015.123
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.
引用
收藏
页码:710 / 713
页数:4
相关论文
共 50 条
  • [41] Serverless High-Performance Computing over Cloud
    Petrosyan, Davit
    Astsatryan, Hrachya
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2022, 22 (03) : 82 - 92
  • [42] High Performance Computing Over Parallel Mobile Systems
    Attia, Doha Ehab
    ElKorany, Abeer Mohamed
    Moussa, Ahmed Shawy
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2016, 7 (09) : 99 - 103
  • [43] Evaluation of column performance over the lifetime of sepharose high performance resin.
    Dripps, DJ
    Taggart, T
    Kessler, T
    Cameron, M
    Todd, R
    Seely, J
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2003, 225 : U230 - U230
  • [44] Achieving high availability and performance computing with an HA-OSCAR cluster
    Leangsuksun, CB
    Shen, LX
    Liu, T
    Scott, SL
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2005, 21 (04): : 597 - 606
  • [45] A High Performance/Price Ratio Cluster Computing Platform for Simulation Design
    Zeng, Xiaohui
    Guo, Ming
    Liu, Junrui
    Luo, Wenlang
    Kang, Jichang
    MICRO NANO DEVICES, STRUCTURE AND COMPUTING SYSTEMS, 2011, 159 : 176 - +
  • [46] HETEROGENEOUS GPU&CPU CLUSTER FOR HIGH PERFORMANCE COMPUTING IN CRYPTOGRAPHY
    Marks, Michal
    Jantura, Jaroslaw
    Niewiadomska-Szynkiewicz, Ewa
    Strzelczyk, Przemyslaw
    Gozdz, Krzysztof
    COMPUTER SCIENCE-AGH, 2012, 13 (02): : 63 - 79
  • [47] DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing
    Fan, Yuping
    Li, Boyang
    Favorite, Dustin
    Singh, Naunidh
    Childers, Taylor
    Rich, Paul
    Allcock, William
    Papka, Michael E.
    Lan, Zhiling
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (12) : 4903 - 4917
  • [48] Data monitoring in high-performance clusters for computing applications
    Torralba, G
    González, V
    Sanchis, E
    Tao, J
    Schulz, M
    Karl, W
    IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 2002, 49 (02) : 525 - 531
  • [49] NEMO A Network Monitoring Framework for High-performance Computing
    Calle, Elio Perez
    DCNET 2010/OPTICS 2010: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA COMMUNICATION NETWORKING AND INTERNATIONAL CONFERENCE ON OPTICAL COMMUNICATION SYSTEM, 2010, : 61 - 66