A low-overhead soft-hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

被引:13
|
作者
Dang, Khanh N. [1 ]
Meyer, Michael [1 ]
Okuyama, Yuichi [1 ]
Ben Abdallah, Abderazek [1 ]
机构
[1] Univ Aizu, Grad Sch Comp Sci & Engn, Adapt Syst Lab, Aizu Wakamatsu, Fukushima 9658580, Japan
来源
JOURNAL OF SUPERCOMPUTING | 2017年 / 73卷 / 06期
关键词
3D NoCs; Fault-tolerance; Soft-hard faults; Reliability; Architecture; Design; ROUTING ALGORITHM; NETWORKS;
D O I
10.1007/s11227-016-1951-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.
引用
收藏
页码:2705 / 2729
页数:25
相关论文
共 6 条
  • [1] A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems
    Khanh N. Dang
    Michael Meyer
    Yuichi Okuyama
    Abderazek Ben Abdallah
    The Journal of Supercomputing, 2017, 73 : 2705 - 2729
  • [2] Adaptive fault-tolerant architecture and routing algorithm for reliable many-core 3D-NoC systems
    Ben Ahmed, Akram
    Ben Abdallah, Abderazek
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2016, 93-94 : 30 - 43
  • [3] High Performance Fault-Tolerant Routing Algorithm for NoC-based Many-Core Systems
    Ebrahimi, Masoumeh
    Daneshtalab, Masoud
    Plosila, Juha
    PROCEEDINGS OF THE 2013 21ST EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING, 2013, : 462 - 469
  • [4] Architecture and design of high-throughput, low-latency, and fault-tolerant routing algorithm for 3D-network-on-chip (3D-NoC)
    Ben Ahmed, Akram
    Ben Abdallah, Abderazek
    JOURNAL OF SUPERCOMPUTING, 2013, 66 (03): : 1507 - 1532
  • [5] Architecture and design of high-throughput, low-latency, and fault-tolerant routing algorithm for 3D-network-on-chip (3D-NoC)
    Akram Ben Ahmed
    Abderazek Ben Abdallah
    The Journal of Supercomputing, 2013, 66 : 1507 - 1532
  • [6] High-Performance and Fault-Tolerant 3D NoC-Bus Hybrid Architecture Using ARB-NET-Based Adaptive Monitoring Platform
    Rahmani, Amir-Mohammad
    Vaddina, Kameswar Rao
    Latif, Khalid
    Liljeberg, Pasi
    Plosila, Juha
    Tenhunen, Hannu
    IEEE TRANSACTIONS ON COMPUTERS, 2014, 63 (03) : 734 - 747