Soft-Error Characterization and Mitigation Strategies for Edge Tensor Processing Units in Space

被引:2
|
作者
Garrett, Tyler [1 ]
Roffe, Seth [2 ]
George, Alan [1 ]
机构
[1] Univ Pittsburgh, Pittsburgh, PA 15213 USA
[2] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA
关键词
Computational modeling; Tensors; Space vehicles; Neutrons; Performance evaluation; Image edge detection; Load modeling; Deep learning; fault-tolerant computing; machine learning (ML); onboard processing; space computing; spacecraft autonomy; tensor processing units (TPUs);
D O I
10.1109/TAES.2024.3393929
中图分类号
V [航空、航天];
学科分类号
08 ; 0825 ;
摘要
The Google Coral Edge Tensor Processing Unit (Edge TPU) offers low-power, high-performance capabilities ideal for enabling deep learning in space. However, as a commercial product, no reliability considerations are made in its design. As a device targeted by current and future space computing platforms, it is vital to mission success to understand the vulnerabilities and possible failure modes prior to flight. In this research, we evaluate the soft-error vulnerabilities of the Edge TPU and propose fault-mitigation techniques to improve device reliability. Several Edge TPUs were irradiated using a wide spectrum neutron beam at the Los Alamos Neutron Science Center to evaluate the reliability of two machine-learning applications with common use cases within the space domain: image classification and semantic segmentation. Through experimentation, a vulnerability within the onboard memory is identified. Responsible for caching model parameters for increased performance, the onboard memory represents a critical device area. Any upsets within the cache risk compromising data integrity and model determinism. Across a variety of models tested, fault accumulation and persistence are consistently observed, resulting in the degradation of model accuracy and confidence. To alleviate the impact of radiation, we propose two fault-mitigation techniques: Naive Refreshing (NR) and Golden Batch Refreshing (GBR). NR periodically reloads model parameters to clear corrupted data. GBR is proposed as an alternative method to reduce reload frequency and improve performance. By leveraging knowledge of the cache vulnerabilities and applying one or more mitigation strategies, Edge TPUs can be properly considered for integration into existing and future flight hardware.
引用
收藏
页码:5481 / 5498
页数:18
相关论文
共 39 条
  • [21] Novel radiation-hardened SRAM for immune soft-error in space-radiation environments
    Zhao, Qiang
    Dong, Hanwen
    Wang, Xiuying
    Hao, Licai
    Peng, Chunyu
    Lin, Zhiting
    Wu, Xiulong
    MICROELECTRONICS RELIABILITY, 2023, 140
  • [22] Comparative Analysis of Redundancy Schemes for Soft-Error Detection in Low-Cost Space Applications
    Frenkel, Charlotte
    Legat, Jean-Didier
    Bol, David
    2016 IFIP/IEEE INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION (VLSI-SOC), 2016,
  • [23] Alpha particle mitigation strategies to reduce chip soft error upsets
    Cabral, C., Jr.
    Rodbell, K. P.
    Gordon, M. S.
    JOURNAL OF APPLIED PHYSICS, 2007, 101 (01)
  • [24] DEEP LEARNING-BASED ERROR MITIGATION FOR ASSISTIVE EXOSKELETON WITH COMPUTATIONAL-RESOURCE-LIMITED PLATFORM AND EDGE TENSOR PROCESSING UNIT
    Fabarisov, Tagir
    Morozov, Andrey
    Mamaev, Ilshat
    Janschek, Klaus
    PROCEEDINGS OF ASME 2021 INTERNATIONAL MECHANICAL ENGINEERING CONGRESS AND EXPOSITION (IMECE2021), VOL 13, 2021,
  • [25] Soft-Error Resilient Read Decoupled SRAM With Multi-Node Upset Recovery for Space Applications
    Pal, Soumitra
    Sri, Dodla Divya
    Ki, Wing-Hung
    Islam, Aminul
    IEEE TRANSACTIONS ON ELECTRON DEVICES, 2021, 68 (05) : 2246 - 2254
  • [26] Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units
    Goncalves de Oliveira, Daniel Alfonso
    Pilla, Laercio Lima
    Santini, Thiago
    Rech, Paolo
    IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (03) : 791 - 804
  • [27] A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning
    Rojas, Elvis
    Perez, Diego
    Meneses, Esteban
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 190
  • [28] Learning-Based Mitigation of Soft Error Effects on Quaternion Kalman Filter Processing
    Sartori, Tarso Kraemer Sarzi
    Fourati, Hassen
    Bastos, Rodrigo Possamai
    IEEE SENSORS JOURNAL, 2024, 24 (01) : 1079 - 1089
  • [29] Design and Heavy-Ion Testing of MTJ/CMOS Hybrid LSIs for Space-Grade Soft-Error Reliability
    Watanabe, K.
    Shimada, T.
    Hirose, K.
    Shindo, H.
    Kobayashi, D.
    Tanigawa, T.
    Ikeda, S.
    Shinada, T.
    Koike, H.
    Endoh, T.
    Makino, T.
    Ohshima, T.
    2022 IEEE INTERNATIONAL RELIABILITY PHYSICS SYMPOSIUM (IRPS), 2022,
  • [30] Design and Heavy-Ion Testing of MTJ/CMOS Hybrid LSIs for Space-Grade Soft-Error Reliability
    Watanabe, K.
    Shimada, T.
    Hirose, K.
    Shindo, H.
    Kobayashi, D.
    Tanigawa, T.
    Ikeda, S.
    Shinada, T.
    Koike, H.
    Endoh, T.
    Makino, T.
    Ohshima, T.
    2022 IEEE INTERNATIONAL RELIABILITY PHYSICS SYMPOSIUM (IRPS), 2022,