Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster

被引：1

作者：

DeConinck, A. ^{[1
]}

Kelly, K. ^{[1
]}

机构：

[1] Los Alamos Natl Lab, POB 1663, Los Alamos, NM 87544 USA

来源：

2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015 | 2015年

关键词：

D O I：

10.1109/CLUSTER.2015.123

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.

引用

页码：710 / 713

页数：4

共 50 条

[41] Serverless High-Performance Computing over Cloud
Petrosyan, Davit
Astsatryan, Hrachya
CYBERNETICS AND INFORMATION TECHNOLOGIES, 2022, 22 (03) : 82 - 92
[42] High Performance Computing Over Parallel Mobile Systems
Attia, Doha Ehab
ElKorany, Abeer Mohamed
Moussa, Ahmed Shawy
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2016, 7 (09) : 99 - 103
[43] Evaluation of column performance over the lifetime of sepharose high performance resin.
Dripps, DJ
Taggart, T
Kessler, T
Cameron, M
Todd, R
Seely, J
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2003, 225 : U230 - U230
[44] Achieving high availability and performance computing with an HA-OSCAR cluster
Leangsuksun, CB
Shen, LX
Liu, T
Scott, SL
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2005, 21 (04): : 597 - 606
[45] A High Performance/Price Ratio Cluster Computing Platform for Simulation Design
Zeng, Xiaohui
Guo, Ming
Liu, Junrui
Luo, Wenlang
Kang, Jichang
MICRO NANO DEVICES, STRUCTURE AND COMPUTING SYSTEMS, 2011, 159 : 176 - +
[46] HETEROGENEOUS GPU&CPU CLUSTER FOR HIGH PERFORMANCE COMPUTING IN CRYPTOGRAPHY
Marks, Michal
Jantura, Jaroslaw
Niewiadomska-Szynkiewicz, Ewa
Strzelczyk, Przemyslaw
Gozdz, Krzysztof
COMPUTER SCIENCE-AGH, 2012, 13 (02): : 63 - 79
[47] DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing
Fan, Yuping
Li, Boyang
Favorite, Dustin
Singh, Naunidh
Childers, Taylor
Rich, Paul
Allcock, William
Papka, Michael E.
Lan, Zhiling
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (12) : 4903 - 4917
[48] Data monitoring in high-performance clusters for computing applications
Torralba, G
González, V
Sanchis, E
Tao, J
Schulz, M
Karl, W
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 2002, 49 (02) : 525 - 531
[49] NEMO A Network Monitoring Framework for High-performance Computing
Calle, Elio Perez
DCNET 2010/OPTICS 2010: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA COMMUNICATION NETWORKING AND INTERNATIONAL CONFERENCE ON OPTICAL COMMUNICATION SYSTEM, 2010, : 61 - 66
[50] High available cluster computing
2000, (21):

← 1 2 3 4 5 →