A machine-learning-guided framework for fault-tolerant DNNs

被引:4
|
作者
Traiola, Marcello [1 ]
Kritikakou, Angeliki [1 ]
Sentieys, Olivier [1 ]
机构
[1] Univ Rennes, INRIA, CNRS, IRISA, Rennes, France
关键词
Reliability Analysis; Fault Tolerance; Machine Learning; Neural Networks; ERROR;
D O I
10.23919/DATE56975.2023.10137033
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep Neural Networks (DNNs) show promising performance in several application domains. Nevertheless, DNN results may be incorrect, not only because of the network intrinsic inaccuracy, but also due to faults affecting the hardware. Ensuring the fault tolerance of DNN is crucial, but common fault tolerance approaches are not cost-effective, due to the prohibitive overheads for large DNNs. This work proposes a comprehensive framework to assess the fault tolerance of DNN parameters and cost-effectively protect them. As a first step, the proposed framework performs a statistical fault injection. The results are used in the second step with classification-based machine learning methods to obtain a bit-accurate prediction of the criticality of all network parameters. Last, Error Correction Codes (ECCs) are selectively inserted to protect only the critical parameters, hence entailing low cost. Thanks to the proposed framework, we explored and protected two Convolutional Neural Networks (CNNs), each with four different data encoding. The results show that it is possible to protect the critical network parameters with selective ECCs while saving up to 79% memory w.r.t. conventional ECC approaches.
引用
收藏
页数:2
相关论文
共 50 条
  • [31] Fault-Tolerant Deep Learning Using Regularization
    Joardar, Biresh Kumar
    Arka, Aqeeb Iqbal
    Doppa, Janardhan Rao
    Pande, Partha Pratim
    2022 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN, ICCAD, 2022,
  • [32] A Lightweight Authentication Framework for Fault-Tolerant Distributed WSN
    Sai, Kollu Siva
    Bhat, Radhakrishna
    Hegde, Manjunath
    Andrew, J.
    IEEE ACCESS, 2023, 11 : 83364 - 83376
  • [33] A Fault-Tolerant Distributed Framework for Asynchronous Iterative Computations
    Zhou, Tian
    Gao, Lixin
    Guan, Xiaohong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (08) : 2062 - 2073
  • [34] α-Renyi based framework for a Robust and Fault-Tolerant Localization
    Makkawi, Khoder
    Harbaoui, Nesrine
    Tmazirte, Nourdine Ait
    El Najjar, Maan El Badaoui
    2021 IEEE INTERNATIONAL CONFERENCE ON MULTISENSOR FUSION AND INTEGRATION FOR INTELLIGENT SYSTEMS (MFI), 2021,
  • [35] General Algorithm for Fault-tolerant Virtual Machine Assignments
    Wu, Jigang
    He, Zinan
    Zhang, Yaoguo
    Gao, Renfei
    Lam, Siew Kei
    2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017), 2017, : 990 - 995
  • [36] A Unified Iterative Learning Fault Detection and Fault-Tolerant Control
    Yan, Qiuzhen
    Yu, Youfang
    Cai, Jianping
    Zhou, Qingping
    PROCEEDINGS OF 2018 IEEE 7TH DATA DRIVEN CONTROL AND LEARNING SYSTEMS CONFERENCE (DDCLS), 2018, : 984 - 989
  • [37] Leveraging Machine Learning for Fault-Tolerant Air Pollutants Monitoring for a Smart City Design
    Khan, Muneeb A.
    Kim, Hyun-chul
    Park, Heemin
    ELECTRONICS, 2022, 11 (19)
  • [38] A hybrid framework for design and analysis of fault-tolerant architectures
    Bhaduri, Debayan
    Shukla, Sandeep
    Coker, Deji
    Taylor, Valerie
    Graham, Paul
    Gokhale, Maya
    2006 DESIGN AUTOMATION AND TEST IN EUROPE, VOLS 1-3, PROCEEDINGS, 2006, : 333 - +
  • [39] Transparent fault-tolerant Java']Java virtual machine
    Friedman, R
    Kama, A
    22ND INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2003, : 319 - 328
  • [40] A framework for the design of fault-tolerant systems-of-systems☆
    Ferreira, Francisco Henrique Cerdeira
    Nakagawa, Elisa Yumi
    Bertolino, Antonia
    Lonetti, Francesca
    Neves, Vania de Oliveira
    dos Santos, Rodrigo Pereira
    JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 211