Harnessing federated learning for anomaly detection in supercomputer nodes

被引:0
|
作者
Farooq, Emmen [1 ]
Milano, Michela [1 ]
Borghesi, Andrea [1 ]
机构
[1] Univ Bologna, DISI, Bologna, Italy
关键词
Federated learning; Anomaly detection; High-performance computing; Data center; Machine learning;
D O I
10.1016/j.future.2024.07.052
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
High-performance computing (HPC) systems are a crucial component of modern society, with a significant impact in areas ranging from economics to scientific research, thanks to their unrivaled computational capabilities. For this reason, the worldwide HPC installation is steeply trending upwards, with no sign of slowing down. However, these machines are both complex, comprising millions of heterogeneous components, hard to effectively manage, and very costly (both in terms of economic investment and of energy consumption). Therefore, maximizing their productivity is of paramount importance. For instance, anomalies and faults can generate significant downtime due to the difficulty of promptly detecting them, as there are potentially many sources of issues preventing the correct functioning of computing nodes. In recent years, several data-driven methods have been proposed to automatically detect anomalies in HPC systems, exploiting the fact that modern supercomputers are typically endowed with fine-grained monitoring infrastructures, collecting data that can be used to characterize the system behavior. Thus, it is possible to teach Machine Learning (ML) models to distinguish normal and anomalous states automatically. In this paper, we contribute to this line of research with a novel intuition, namely exploiting Federated Learning (FL) to improve the accuracy of anomaly detection models for HPC nodes. Although FL is not typically exploited in the HPC context, we show that FL can boost several types of underlying ML models, from supervised to unsupervised ones. We demonstrate our approach on a production Tier-0 supercomputer hosted in Italy. Applying FL to anomaly detection improves the average f-score from 0.46 to 0.87. Our research also shows FL can reduce the data collection time required to develop a representation data set, facilitating faster deployment of anomaly detection models. ML models need 5 months of training data for efficient anomaly detection performance while using FL reduces the training set by 15 times to 1.25 weeks.
引用
收藏
页码:673 / 685
页数:13
相关论文
共 50 条
  • [31] Anomaly Detection from Distributed Data Sources via Federated Learning
    Cavallin, Florencia
    Mayer, Rudolf
    ADVANCED INFORMATION NETWORKING AND APPLICATIONS, AINA-2022, VOL 2, 2022, 450 : 317 - 328
  • [32] A Personalized and Differentially Private Federated Learning for Anomaly Detection of Industrial Equipment
    Zhang, Zhen
    Zhang, Weishan
    Bao, Zhicheng
    Miao, Yifan
    Liu, Yuru
    Zhao, Yikang
    Zhang, Rui
    Zhu, Wenyin
    IEEE JOURNAL OF RADIO FREQUENCY IDENTIFICATION, 2024, 8 : 468 - 475
  • [33] Federated-Learning-Based Anomaly Detection for IoT Security Attacks
    Mothukuri, Viraaji
    Khare, Prachi
    Parizi, Reza M.
    Pouriyeh, Seyedamin
    Dehghantanha, Ali
    Srivastava, Gautam
    IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (04) : 2545 - 2554
  • [34] Decentralized Federated Learning-Enabled Relation Aggregation for Anomaly Detection
    Shuai, Siyue
    Hu, Zehao
    Zhang, Bin
    Liaqat, Hannan Bin
    Kong, Xiangjie
    INFORMATION, 2023, 14 (12)
  • [35] Multi-Task Network Anomaly Detection using Federated Learning
    Zhao, Ying
    Chen, Junjun
    Wu, Di
    Teng, Jian
    Yu, Shui
    SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 273 - 279
  • [36] Distributed IIoT Anomaly Detection Scheme Based on Blockchain and Federated Learning
    Jin, Xiaojun
    Ma, Chao
    Luo, Song
    Zeng, Pengyi
    Wei, Yifei
    JOURNAL OF COMMUNICATIONS AND NETWORKS, 2024, 26 (02) : 252 - 262
  • [37] Communication-Efficient Federated Learning for Network Traffic Anomaly Detection
    Cui, Xiao
    Han, Xiaohui
    Liu, Guangqi
    Zuo, Wenbo
    Wang, Zhiwen
    2023 19TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN 2023, 2023, : 398 - 405
  • [38] Leveraging Federated Learning and Variational Autoencoders for an Enhanced Anomaly Detection System
    Nugraha, Beny
    Kota, Kavya
    Bauschert, Thomas
    2024 IEEE 10TH INTERNATIONAL CONFERENCE ON NETWORK SOFTWARIZATION, NETSOFT 2024, 2024, : 166 - 174
  • [39] DIoT: A Federated Self-learning Anomaly Detection System for IoT
    Thien Duc Nguyen
    Marchal, Samuel
    Miettinen, Markus
    Fereidooni, Hossein
    Asokan, N.
    Sadeghi, Ahmad-Reza
    2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 756 - 767
  • [40] Anomaly Detection of IoT Cyberattacks in Smart Cities Using Federated Learning and Split Learning
    Priyadarshini, Ishaani
    BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (03)