Harnessing federated learning for anomaly detection in supercomputer nodes

被引:0
|
作者
Farooq, Emmen [1 ]
Milano, Michela [1 ]
Borghesi, Andrea [1 ]
机构
[1] Univ Bologna, DISI, Bologna, Italy
关键词
Federated learning; Anomaly detection; High-performance computing; Data center; Machine learning;
D O I
10.1016/j.future.2024.07.052
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
High-performance computing (HPC) systems are a crucial component of modern society, with a significant impact in areas ranging from economics to scientific research, thanks to their unrivaled computational capabilities. For this reason, the worldwide HPC installation is steeply trending upwards, with no sign of slowing down. However, these machines are both complex, comprising millions of heterogeneous components, hard to effectively manage, and very costly (both in terms of economic investment and of energy consumption). Therefore, maximizing their productivity is of paramount importance. For instance, anomalies and faults can generate significant downtime due to the difficulty of promptly detecting them, as there are potentially many sources of issues preventing the correct functioning of computing nodes. In recent years, several data-driven methods have been proposed to automatically detect anomalies in HPC systems, exploiting the fact that modern supercomputers are typically endowed with fine-grained monitoring infrastructures, collecting data that can be used to characterize the system behavior. Thus, it is possible to teach Machine Learning (ML) models to distinguish normal and anomalous states automatically. In this paper, we contribute to this line of research with a novel intuition, namely exploiting Federated Learning (FL) to improve the accuracy of anomaly detection models for HPC nodes. Although FL is not typically exploited in the HPC context, we show that FL can boost several types of underlying ML models, from supervised to unsupervised ones. We demonstrate our approach on a production Tier-0 supercomputer hosted in Italy. Applying FL to anomaly detection improves the average f-score from 0.46 to 0.87. Our research also shows FL can reduce the data collection time required to develop a representation data set, facilitating faster deployment of anomaly detection models. ML models need 5 months of training data for efficient anomaly detection performance while using FL reduces the training set by 15 times to 1.25 weeks.
引用
收藏
页码:673 / 685
页数:13
相关论文
共 50 条
  • [41] Federated Learning-Based Anomaly Classification Detection for Solar Photovoltaic Panels
    Yin, Zhimian
    Gao, Ciwei
    2024 9TH INTERNATIONAL CONFERENCE ON ELECTRONIC TECHNOLOGY AND INFORMATION SCIENCE, ICETIS 2024, 2024, : 380 - 384
  • [42] Federated Graph Anomaly Detection via Contrastive Self-Supervised Learning
    Kong, Xiangjie
    Zhang, Wenyi
    Wang, Hui
    Hou, Mingliang
    Chen, Xin
    Yan, Xiaoran
    Das, Sajal K.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 14
  • [43] Communication-Efficient Federated Learning for Anomaly Detection in Industrial Internet of Things
    Liu, Yi
    Kumar, Neeraj
    Xiong, Zehui
    Lim, Wei Yang Bryan
    Kang, Jiawen
    Niyato, Dusit
    2020 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2020,
  • [44] Heterogeneity-Aware Federated Learning for Device Anomaly Detection in Industrial IoT
    Hu, Zhuoer
    Gao, Hui
    Lu, Yueming
    Xu, Wenjun
    2022 INTERNATIONAL WIRELESS COMMUNICATIONS AND MOBILE COMPUTING, IWCMC, 2022, : 653 - 659
  • [45] Federated Learning-Based Explainable Anomaly Detection for Industrial Control Systems
    Huong, Truong Thu
    Bac, Ta Phuong
    Ha, Kieu Ngan
    Hoang, Nguyen Viet
    Hoang, Nguyen Xuan
    Hung, Nguyen Tai
    Tran, Kim Phuc
    IEEE ACCESS, 2022, 10 : 53854 - 53872
  • [46] Utility Analysis about Log Data Anomaly Detection Based on Federated Learning
    Shin, Tae-Ho
    Kim, Soo-Hyung
    APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [47] Cellular Network Antenna Tilt Anomaly Detection Using Federated Unsupervised Learning
    Mulvey, David
    Foh, Chuan Heng
    Imran, Muhammad Ali
    Tafazolli, Rahim
    ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 3048 - 3053
  • [48] The Adaptive Personalized Federated Meta-Learning for Anomaly Detection of Industrial Equipment
    Liu, Yuange
    Bao, Zhicheng
    Wang, Yuqian
    Zeng, Xingjie
    Xu, Liang
    Zhang, Weishan
    Zhao, Hongwei
    Yu, Zepei
    IEEE JOURNAL OF RADIO FREQUENCY IDENTIFICATION, 2022, 6 : 832 - 836
  • [49] Taking Advantage of the Mistakes: Rethinking Clustered Federated Learning for IoT Anomaly Detection
    Fan, Jiamin
    Wu, Kui
    Tang, Guoming
    Zhou, Yang
    Huang, Shengqiang
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (06) : 707 - 721
  • [50] Distributed Anomaly Detection in Smart Grids: A Federated Learning-Based Approach
    Jithish, J.
    Alangot, Bithin
    Mahalingam, Nagarajan
    Yeo, Kiat Seng
    IEEE ACCESS, 2023, 11 : 7157 - 7179