Soft-Error Characterization and Mitigation Strategies for Edge Tensor Processing Units in Space

被引:2
|
作者
Garrett, Tyler [1 ]
Roffe, Seth [2 ]
George, Alan [1 ]
机构
[1] Univ Pittsburgh, Pittsburgh, PA 15213 USA
[2] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA
关键词
Computational modeling; Tensors; Space vehicles; Neutrons; Performance evaluation; Image edge detection; Load modeling; Deep learning; fault-tolerant computing; machine learning (ML); onboard processing; space computing; spacecraft autonomy; tensor processing units (TPUs);
D O I
10.1109/TAES.2024.3393929
中图分类号
V [航空、航天];
学科分类号
08 ; 0825 ;
摘要
The Google Coral Edge Tensor Processing Unit (Edge TPU) offers low-power, high-performance capabilities ideal for enabling deep learning in space. However, as a commercial product, no reliability considerations are made in its design. As a device targeted by current and future space computing platforms, it is vital to mission success to understand the vulnerabilities and possible failure modes prior to flight. In this research, we evaluate the soft-error vulnerabilities of the Edge TPU and propose fault-mitigation techniques to improve device reliability. Several Edge TPUs were irradiated using a wide spectrum neutron beam at the Los Alamos Neutron Science Center to evaluate the reliability of two machine-learning applications with common use cases within the space domain: image classification and semantic segmentation. Through experimentation, a vulnerability within the onboard memory is identified. Responsible for caching model parameters for increased performance, the onboard memory represents a critical device area. Any upsets within the cache risk compromising data integrity and model determinism. Across a variety of models tested, fault accumulation and persistence are consistently observed, resulting in the degradation of model accuracy and confidence. To alleviate the impact of radiation, we propose two fault-mitigation techniques: Naive Refreshing (NR) and Golden Batch Refreshing (GBR). NR periodically reloads model parameters to clear corrupted data. GBR is proposed as an alternative method to reduce reload frequency and improve performance. By leveraging knowledge of the cache vulnerabilities and applying one or more mitigation strategies, Edge TPUs can be properly considered for integration into existing and future flight hardware.
引用
收藏
页码:5481 / 5498
页数:18
相关论文
共 39 条
  • [1] Two Soft-Error Mitigation Techniques for Functional units of DSP Processors
    Rohani, Alireza
    Kerkhoff, Hans G.
    2014 19TH IEEE EUROPEAN TEST SYMPOSIUM (ETS 2014), 2014,
  • [2] Design space exploration of non-uniform cache access for soft-error vulnerability mitigation
    Maghsoudloo, Mohammad
    Zarandi, Hamid R.
    MICROELECTRONICS RELIABILITY, 2015, 55 (11) : 2439 - 2452
  • [3] Robust C-element design for soft-error mitigation
    Wey, I-Chyn
    Wu, Bing-Chen
    Peng, Chien-Chang
    Gong, Cihun-Siyong Alex
    Yu, Chang-Hong
    IEICE ELECTRONICS EXPRESS, 2015, 12 (10):
  • [4] Soft-error mitigation by means of decoupled transactional memory threads
    Sanchez, Daniel
    Cebrian, Juan M.
    Garcia, Jose M.
    Aragon, Juan L.
    DISTRIBUTED COMPUTING, 2015, 28 (02) : 75 - 90
  • [5] Soft-error mitigation by means of decoupled transactional memory threads
    Daniel Sánchez
    Juan M. Cebrián
    José M. García
    Juan L. Aragón
    Distributed Computing, 2015, 28 : 75 - 90
  • [6] ISO26262-Compliant Soft-Error Mitigation in Register Banks
    Schat, Jan
    2017 22ND IEEE EUROPEAN TEST SYMPOSIUM (ETS), 2017,
  • [7] Soft-Error Tolerance and Mitigation in Asynchronous Burst-Mode Circuits
    Almukhaizim, Sobeeh
    Shi, Feng
    Love, Eric
    Makris, Yiorgos
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2009, 17 (07) : 869 - 882
  • [8] Experiences with software-based soft-error mitigation using AN codes
    Hoffmann, Martin
    Ulbrich, Peter
    Dietrich, Christian
    Schirmeier, Horst
    Lohmann, Daniel
    Schroeder-Preikschat, Wolfgang
    SOFTWARE QUALITY JOURNAL, 2016, 24 (01) : 87 - 113
  • [9] Experiences with software-based soft-error mitigation using AN codes
    Martin Hoffmann
    Peter Ulbrich
    Christian Dietrich
    Horst Schirmeier
    Daniel Lohmann
    Wolfgang Schröder-Preikschat
    Software Quality Journal, 2016, 24 : 87 - 113
  • [10] Accelerating Applications using Edge Tensor Processing Units
    Hsu, Kuan-Chieh
    Tseng, Hung-Wei
    SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,