A machine-learning-guided framework for fault-tolerant DNNs

被引:4
|
作者
Traiola, Marcello [1 ]
Kritikakou, Angeliki [1 ]
Sentieys, Olivier [1 ]
机构
[1] Univ Rennes, INRIA, CNRS, IRISA, Rennes, France
关键词
Reliability Analysis; Fault Tolerance; Machine Learning; Neural Networks; ERROR;
D O I
10.23919/DATE56975.2023.10137033
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep Neural Networks (DNNs) show promising performance in several application domains. Nevertheless, DNN results may be incorrect, not only because of the network intrinsic inaccuracy, but also due to faults affecting the hardware. Ensuring the fault tolerance of DNN is crucial, but common fault tolerance approaches are not cost-effective, due to the prohibitive overheads for large DNNs. This work proposes a comprehensive framework to assess the fault tolerance of DNN parameters and cost-effectively protect them. As a first step, the proposed framework performs a statistical fault injection. The results are used in the second step with classification-based machine learning methods to obtain a bit-accurate prediction of the criticality of all network parameters. Last, Error Correction Codes (ECCs) are selectively inserted to protect only the critical parameters, hence entailing low cost. Thanks to the proposed framework, we explored and protected two Convolutional Neural Networks (CNNs), each with four different data encoding. The results show that it is possible to protect the critical network parameters with selective ECCs while saving up to 79% memory w.r.t. conventional ECC approaches.
引用
收藏
页数:2
相关论文
共 50 条
  • [41] A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning
    Wang, Songtao
    Li, Dan
    Cheng, Yang
    Geng, Jinkun
    Wang, Yanshu
    Wang, Shuai
    Xia, Shutao
    Wu, Jianping
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2020, 28 (04) : 1752 - 1764
  • [42] Compressionless routing: A framework for adaptive and fault-tolerant routing
    Kim, JH
    Liu, ZQ
    Chien, AA
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1997, 8 (03) : 229 - 244
  • [43] A comprehensive fault-tolerant framework for wireless sensor networks
    Afsar, Mehdi
    SECURITY AND COMMUNICATION NETWORKS, 2015, 8 (17) : 3247 - 3261
  • [44] A framework for fault-tolerant control of discrete event systems
    Wen, Qin
    Kumar, Ratnesh
    Huang, Jing
    Liu, Haifeng
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2008, 53 (08) : 1839 - 1849
  • [45] An Extensible Framework for Implementing Byzantine Fault-Tolerant Protocols
    Gogada, Hanish
    Meling, Hein
    Jehl, Leander
    Olsen, John Ingve
    38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023, 2023, : 121 - 124
  • [46] Distal: A Framework for Implementing Fault-tolerant Distributed Algorithms
    Biely, Martin
    Delgado, Pamela
    Milosevic, Zarko
    Schiper, Andre
    2013 43RD ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2013,
  • [47] A fault-tolerant multi-agent development framework
    Wang, L
    Li, HF
    Goswami, D
    Wei, ZC
    PARALLEL AND DISTRIBUTED PROCESSING AND APPLICATIONS, PROCEEDINGS, 2004, 3358 : 126 - 135
  • [48] DARX - A framework for the fault-tolerant support of agent software
    Marin, O
    Bertier, M
    Sens, P
    ISSRE 2003: 14TH INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING, PROCEEDINGS, 2003, : 406 - 416
  • [49] An integrated fault-tolerant design framework for VLIW processors
    Chen, YY
    Horng, SJ
    Lai, HC
    18TH IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS, PROCEEDINGS, 2003, : 555 - 562
  • [50] Fault-tolerant polynomial smoother and fault-tolerant differential smoothers
    Hu, Feng
    Sun, Guoji
    Gongcheng Shuxue Xuebao/Chinese Journal of Engineering Mathematics, 2000, 17 (02): : 53 - 57