Soft-Error Characterization and Mitigation Strategies for Edge Tensor Processing Units in Space

被引：2

作者：

Garrett, Tyler ^{[1
]}

Roffe, Seth ^{[2
]}

George, Alan ^{[1
]}

机构：

[1] Univ Pittsburgh, Pittsburgh, PA 15213 USA

[2] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA

来源：

IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS | 2024年 / 60卷 / 04期

关键词：

Computational modeling; Tensors; Space vehicles; Neutrons; Performance evaluation; Image edge detection; Load modeling; Deep learning; fault-tolerant computing; machine learning (ML); onboard processing; space computing; spacecraft autonomy; tensor processing units (TPUs);

D O I：

10.1109/TAES.2024.3393929

中图分类号：

V [航空、航天];

学科分类号：

08 ; 0825 ;

摘要：

The Google Coral Edge Tensor Processing Unit (Edge TPU) offers low-power, high-performance capabilities ideal for enabling deep learning in space. However, as a commercial product, no reliability considerations are made in its design. As a device targeted by current and future space computing platforms, it is vital to mission success to understand the vulnerabilities and possible failure modes prior to flight. In this research, we evaluate the soft-error vulnerabilities of the Edge TPU and propose fault-mitigation techniques to improve device reliability. Several Edge TPUs were irradiated using a wide spectrum neutron beam at the Los Alamos Neutron Science Center to evaluate the reliability of two machine-learning applications with common use cases within the space domain: image classification and semantic segmentation. Through experimentation, a vulnerability within the onboard memory is identified. Responsible for caching model parameters for increased performance, the onboard memory represents a critical device area. Any upsets within the cache risk compromising data integrity and model determinism. Across a variety of models tested, fault accumulation and persistence are consistently observed, resulting in the degradation of model accuracy and confidence. To alleviate the impact of radiation, we propose two fault-mitigation techniques: Naive Refreshing (NR) and Golden Batch Refreshing (GBR). NR periodically reloads model parameters to clear corrupted data. GBR is proposed as an alternative method to reduce reload frequency and improve performance. By leveraging knowledge of the cache vulnerabilities and applying one or more mitigation strategies, Edge TPUs can be properly considered for integration into existing and future flight hardware.

引用

页码：5481 / 5498

页数：18

共 39 条

[1] Two Soft-Error Mitigation Techniques for Functional units of DSP Processors
Rohani, Alireza
Kerkhoff, Hans G.
2014 19TH IEEE EUROPEAN TEST SYMPOSIUM (ETS 2014), 2014,
[2] Design space exploration of non-uniform cache access for soft-error vulnerability mitigation
Maghsoudloo, Mohammad
Zarandi, Hamid R.
MICROELECTRONICS RELIABILITY, 2015, 55 (11) : 2439 - 2452
[3] Robust C-element design for soft-error mitigation
Wey, I-Chyn
Wu, Bing-Chen
Peng, Chien-Chang
Gong, Cihun-Siyong Alex
Yu, Chang-Hong
IEICE ELECTRONICS EXPRESS, 2015, 12 (10):
[4] Soft-error mitigation by means of decoupled transactional memory threads
Sanchez, Daniel
Cebrian, Juan M.
Garcia, Jose M.
Aragon, Juan L.
DISTRIBUTED COMPUTING, 2015, 28 (02) : 75 - 90
[5] Soft-error mitigation by means of decoupled transactional memory threads
Daniel Sánchez
Juan M. Cebrián
José M. García
Juan L. Aragón
Distributed Computing, 2015, 28 : 75 - 90
[6] ISO26262-Compliant Soft-Error Mitigation in Register Banks
Schat, Jan
2017 22ND IEEE EUROPEAN TEST SYMPOSIUM (ETS), 2017,
[7] Soft-Error Tolerance and Mitigation in Asynchronous Burst-Mode Circuits
Almukhaizim, Sobeeh
Shi, Feng
Love, Eric
Makris, Yiorgos
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2009, 17 (07) : 869 - 882
[8] Experiences with software-based soft-error mitigation using AN codes
Hoffmann, Martin
Ulbrich, Peter
Dietrich, Christian
Schirmeier, Horst
Lohmann, Daniel
Schroeder-Preikschat, Wolfgang
SOFTWARE QUALITY JOURNAL, 2016, 24 (01) : 87 - 113
[9] Experiences with software-based soft-error mitigation using AN codes
Martin Hoffmann
Peter Ulbrich
Christian Dietrich
Horst Schirmeier
Daniel Lohmann
Wolfgang Schröder-Preikschat
Software Quality Journal, 2016, 24 : 87 - 113
[10] Accelerating Applications using Edge Tensor Processing Units
Hsu, Kuan-Chieh
Tseng, Hung-Wei
SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,

← 1 2 3 4 →