Mobile QR Code QR CODE

  1. (Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, Korea)
  2. (Department of Electronic Engineering, Hanyang University, Seoul, Korea)



Neural networks, systolic array, functional safety, fault-tolerant, fault mitigation

I. Introduction

The systolic array, a foundational architecture for general matrix multiplication (GEMM) acceleration, has been widely adopted in artificial intelligence (AI) applications [1]. It consists of interconnected processing elements (PEs) that perform Multiply-Accumulate (MAC) operations in parallel [2]. Due to its regular structure and high data reuse capability via localized inter-PE communication, the systolic array architecture is especially well- suited for linear algebra operations such as matrix multi- plication and convolution. This has led to its integration into numerous commercial hardware accelerators [3], particularly in AI systems.

However, the architectural characteristic of continuous data exchange between neighboring PEs creates a vulnerability in fault scenarios. Specifically, when a fault occurs in a single PE, its erroneous data may propagate to adjacent PEs---even those that are otherwise fault-free--- resulting in cascading errors and overall accuracy degradation.

In safety-critical applications such as autonomous driving systems and medical devices [4,5,6], GEMM accelerators must be designed with robust fault mitigation mechanisms that can ensure safe and reliable operation with- out excessive hardware overhead. As nanometer-scale CMOS technologies advance, systems become increasingly susceptible to both permanent and transient faults, making the incorporation of such mechanisms indispensable [7,8,9,10,11,12,13].

Although transient faults may cause sporadic computation errors, their impact on the overall accuracy tends to be limited---even at relatively high fault rates. In contrast, permanent faults continuously affect system behavior, requiring more deliberate and targeted mitigation strategies. Various approaches have been proposed to address this challenge. For example, Fault-Aware Pruning (FAP) and its retraining-based extension (FAP+T) attempt to minimize accuracy loss by masking faulty weights [14,15]. However, these techniques do not account for the relative importance of individual weights, often resulting in significant accuracy degradation and incurring high retraining costs, which limit their practicality in real-time applications.

Redundancy-based methods have also been pro- posed [16,17], which utilize redundant PEs to replace faulty PEs. However, these approaches suffer from poor area efficiency. To address this issue, more area-efficient redundancy architectures have been proposed, such as the one presented in [18]. Nevertheless, both types of solutions suffer from scalability issues: once all redundant PEs are exhausted, the system can no longer recover from subsequent faults. Moreover, such solutions are ill-suited for resource-constrained edge devices, where minimizing area and power consumption is critical.

To overcome these limitations, we argue that severity-adaptive fault mitigation techniques are essential. Such mechanisms activate only when a substantial degradation in accuracy is anticipated, allowing for intelligent trade- offs between reliability and resource usage.

In this paper, we propose two lightweight fault mitigation techniques specifically targeting permanent faults in systolic array-based GEMM accelerators. Based on microarchitectural fault characterization, we introduce:

· High-low bit swapping (HL-Swap) mechanism, which redirects data from faulty high-bit registers to error-free low-bit registers, thereby mitigating accuracy degradation caused by bit-level defects;

· Row-column deactivation (RC-Off) technique, which selectively disables specific rows or columns in the systolic array that are identified to cause significant performance degradation, based on fault severity analysis.

These two techniques are designed to maximize fault resilience while minimizing hardware overhead, enabling robust GEMM acceleration even under harsh reliability constraints.

II. Proposed Fault-Tolerant GEMM Accelerator

In order to overcome the limitations of the prior works which promptly address faults upon their occurrence, we propose two fault mitigation techniques which adaptively operate according to the severity of the accuracy drop based on microarchitectural fault analysis.

1. Microarchitectural Fault Analysis

To evaluate the impact of faults on accuracy degradation in output-stationary systolic array architectures, we define $D_{\rm fault}$ as the distance from the array boundary---where in- put data is initially supplied---to the pro cessing element (PE) at which the fault occurs, as illustrated in Fig. 1(c).

This parameter allows us to quantify the spatial propagation of faults within the systolic array. Our microarchitectural analysis considers multiple fault-related parameters, including the fault type, bit index, faulty PE rate (FPR), and the aforementioned $D_{\rm fault}$.

The experimental setup for the fault analysis is as follows. We perform fault injection on a $16 \times 16$ systolic array using the MNIST dataset [19]. Permanent faults ranging from 1-bit to 16-bit are injected into the 8-bit registers that receive data from neighboring PEs. Accordingly, the bit index spans from 1 (least significant bit, LSB) to 8 (most significant bit, MSB). To simulate worst-case conditions at the microarchitectural level, we assume that all other system components are operating under the most disadvantageous settings, except for the comparison group under evaluation. This setup ensures that we can isolate and clearly observe the individual effects of each fault parameter.

We compute the accuracy degradation using the Mean Absolute Percentage Error (MAPE), and we record both the maximum and minimum observed degradations along with their corresponding parameter settings. As shown in Table 1, faults of the Stuck-at-1 type result in approximately 23% greater accuracy degradation compared to Stuck-at-0 faults. This disparity may be attributed to the asymmetry in value distributions within activation functions like ReLU, which tend to suppress negative outputs and amplify positive ones.

Moreover, faults at the MSB level cause significantly greater degradation than those at the LSB level---by up to 46.7%---highlighting the critical influence of bit significance in fixed-point arithmetic. Similarly, increasing the number of injected faults from 1 to 16 results in a 21.5% rise in error rates, indicating the cumulative effect of fault density. Spatially, faults closer to the input boundary ($D_{\rm fault} = 0$) cause approximately 29.1% more degradation than those occurring at the farthest boundary ($D_{\rm fault} = 15$), which supports the hypothesis that early-stage faults propagate more aggressively through the systolic flow.

Among the evaluated parameters, the bit index and $D_{\rm fault}$ emerged as the most dominant contributors to ac- curacy degradation. Accordingly, we target these two fac- tors in our proposed mitigation techniques. The HL-Swap mechanism addresses bit-level sensitivity by redirecting high-bit values to low-bit locations, while the RC-Off technique mitigates spatial fault propagation by selectively disabling critical rows or columns based on fault severity analysis.

Fig. 1. Motivation of fault-tolerant GEMM accelerator: (a) accuracy degradation due to permanent faults and transient faults, (b) limitations of previous works, and (c) accuracy drop with two major fault factors.

../../Resources/ieie/JSTS.2025.25.3.318/image1.png

Table 1. Microarchitectural fault analysis: Impact of fault factors on accuracy drop with MNIST.

../../Resources/ieie/JSTS.2025.25.3.318/tb1.png

2. HL-Swap: High-Low Bit Swapping

According to the fault analysis described earlier, computational accuracy can degrade by up to 50.2% depending on the bit index where the fault occurs. In particular, faults at higher bit positions have a significantly greater impact on the output compared to lower bits, as higher bits carry more weight in fixed-point arithmetic. To address this issue, we propose a fault mitigation technique called HL-Swap, which is designed to reduce the impact of bit-index-dependent faults. HL-Swap operates by exchanging the upper bits with the lower $e$ bits in a register when a fault is detected in the upper half. More specifically, as illustrated in Fig. 2, when on of the two 8-bit registers receiving data from neigh boring PEs encounters a fault in its upper 4 bits, HL-Swap swaps them with the lower 4 bits within the same register. After the swap, a shifter realigns the data path to ensure that the accumulation process functions correctly despite the modified bit positions.

HL-Swap supports four operational modes depending on the location of the fault: No Swap, Reg0 Swap, Reg1 Swap, and Reg0,1 Swap. The No Swap mode is selected when no fault is present or when only the lower 4 bits are affected. Reg0 Swap is applied when the upper 4 bits of Reg0 are faulty, while Reg1 Swap is used if the upper 4 bits of Reg1 are affected. Reg0,1 Swap is activated when faults exist in the upper 4 bits of both Reg0 and Reg1.

These swap modes enable adaptive fault handling based on the location of the fault, and the mechanism provides effective mitigation without requiring redundant hard- ware. By using only simple internal bit manipulation and a lightweight shifter, HL-Swap offers a practical solution with minimal area overhead while preserving computational accuracy.

Fig. 2. Proposed HL-Swap fault mitigation method operation scheme.

../../Resources/ieie/JSTS.2025.25.3.318/image2.png

3. RC-Off: Location-Aware Row-Column Off with Scoring

As illustrated in Fig. 3, we propose a fault mitigtion mechanism called RC-Off, which selectively deactivates specific rows or columns in the systolic array based on their estimated contribution to accuracy degradation. Unlike conventional redundancy-based methods that uniformly replicate resources regardless of fault severity, RC- Off dynamically adapts its response according to the microarchitectural location and fault characteristics. This al lows the system to avoid unnecessary overhead while effectively suppressing high-impact faults. The method operates through three distinct phases: fault analysis, monitoring, and adaptive handling.

In the first phase, an offline fault analysis is conducted to assess the correlation between fault-induced degradations and their spatial and logical properties. Since accuracy degradation is highly sensitive to both the AI model architecture and the dataset in use, this phase begins by injecting faults into the systolic array under controlled conditions. We simulate various fault types, bit indices, positions, and quantities to build a comprehensive characterization. The results are aggregated into a score table that captures how much each fault configuration impacts output accuracy. This score table serves as a statistical foundation for runtime decision-making.

In the second phase, real-time monitoring is activated. Upon fault detection, each row and column in the systolic array is assigned a fault severity score. This score is computed based on the predefined score table and the attributes of the detected faults, including their type, bit position, spatial distance from the array boundary, and frequency. The scoring policy is defined as

(1)
$ Score_{r,c} = \sum_{n=0}^N Weight_{ft}(n)+ Weight_{bi}(n) \\ \quad + Weight_{d}(n) + Weight_{num}(n). $

In this equation, ${n} $ indexes each detected fault, ${ft} $ reprsents the fault type (e.g., stuck-at-0, stuck-at-1), ${bi} $ is the bit index within the register, ${d} $ corresponds to the fault distance $D_{\rm fault}$, and ${num} $ is the total number of faults observed in the given row or column. Each term is weighted according to its relative influence on output degradation, based on prior analysis.

In the final phase, once a row or column exceeds a predefined score threshold---indicating that it significantly degrades system accuracy---it is selectively disabled. This is implemented through a binary enable signal: a value of 1 maintains normal operation, while 0 indicates that the row or column is excluded from subsequent computations. The input data fed into the systolic array is dynamically reshaped to bypass these disabled elements, ensuring consistent data flow and structural integrity. This design ensures that computational resources are efficiently reallocated and fault impact is minimized without requiring full redundancy or complex reconfiguration logic.

Overall, RC-Off enables location-aware and model- sensitive fault mitigation, achieving a favorable balance between reliability and resource efficiency. The simplicity of its runtime enforcement makes it suitable for deployment in edge systems where area and power constraints are stringent.

Fig. 3. Overall scheme of the proposed RC-Off fault mitigation method based on the microarchitectural location of faults.

../../Resources/ieie/JSTS.2025.25.3.318/image3.png

4. Fault-Tolerant GEMM Accelerator

The accelerator consists of a systolic array, SRAM buffer, data reshaper, Score Table, and Fault Monitor. It employs four $16\times 16$ systolic arrays for computation.

To mitigate the accuracy drop incurred by the most significant fault factor, bit index, as shown in Fig. 1(c), HL- Swap is applied to convert faulty high-bit registers to fault- free low-bit registers. Also, to support RC-Off, the pro- posed GEMM accelerator measures the accuracy degradation rate based on the microarchitectural location and type of fault occurrence, which is stored in the score table as depicted in Fig. 4. The fault monitor continuously computes severity scores for RC-Off, bypassing them if the score exceeds a threshold value. This proposed approach effectively mitigates the impact of the propagation chain in the systolic array. In this scenario, erroneous data not only affects the faulty PEs but also spreads to other PEs when a permanent fault occurs in a PE.

The SRAM buffer preloads data to enhance data re- trieval efficiency. The data reshaper adjusts systolic array data feeding according to RC-Off criteria to mitigate accuracy drop. The score table updates fault-related information injected via fault injection commands and accommodates changes in fault bit indices through HL-Swap. The fault monitor computes fault severity based on the score table.

Fig. 4. Overall architecture of proposed fault-tolerantsystolic array for GEMM accelerator.

../../Resources/ieie/JSTS.2025.25.3.318/image4.png

III. Experimental Results

The proposed fault-tolerant GEMM accelerator was de- signed at the RTL and synthesized using a standard 28 nm Samsung FD-SOI technology. The target operating conditions included a supply voltage of 1.0 V and a nominal clock frequency of 250 MHz under typical process-voltage-temperature (PVT) corners. After logic synthesis, the resulting design comprised approximately 4.1 million gate equivalents (GE), which demonstrates the feasibility of the proposed design for lightweight edge devices with strict area and power constraints.

To evaluate fault resilience under realistic scenarios, we injected a total of 1,000 randomly generated permanent faults into four $16 \times 16$ systolic arrays. Each fault was assigned to random bit positions within the 8-bit registers

that store intermediate PE outputs. The simulation was conducted under the assumption that faults had been pre- identified using a built-in self-test (BIST) mechanism, al- lowing for appropriate fault-aware activation of mitigation techniques during runtime.

The hardware overhead of the proposed techniques was quantitatively measured after synthesis. The RC-Off mechanism incurred an area overhead of 2.4%, while the HL-Swap introduced an 8.9% increase in gate count. These values are considered negligible when weighed against the significant improvement in system reliability and inference accuracy.

Fig. 5 presents the accuracy degradation trends as a function of the FPR. Three configurations were evaluated: a baseline conventional GEMM accelerator, GEMM with RC-Off applied, and GEMM with both RC-Off and HL- Swap applied. For reference, the degradation values are normalized to the accuracy drop of the baseline system operating under a 6% FPR.

Under this condition (6% FPR), the conventional GEMM accelerator exhibited an accuracy drop of 25%. When the RC-Off technique was applied, the degradation was reduced to 9.3%, corresponding to an improvement of approximately 62.8%. Moreover, the simultaneous ap- plication of both RC-Off and HL-Swap further reduced the error, achieving a total improvement of approximately 63% relative to the baseline. These results demonstrate that the proposed techniques operate synergistically and effectively mitigate the impact of permanent faults with minimal hardware cost.

Fig. 5. Accuracy drop comparison of proposed fault mitigationschemes at various FPR.

../../Resources/ieie/JSTS.2025.25.3.318/image5.png

V. Conclusions

In this study, we proposed two fault mitigation tecniques to reduce accuracy degradation in GEMM accelerators under permanent faults. The first, HL-Swap, mitigates bit-index-dependent errors by swapping the upper and lower bits in a register when high-bit faults are detected. The second, RC-Off, selectively disables rows or columns that significantly impact accuracy, based on a fault scoring mechanism derived from microarchitectural analysis.

Implemented in Samsung 28 nm FD-SOI technology, the proposed methods incur low hardware over- head---2.4% for RC-Off and 8.9% for HL-Swap---while offering substantial reliability improvements. Under a 6% faulty PE rate, the combined application of HL-Swap and RC-Off reduces accuracy degradation by up to 63%, demonstrating their effectiveness and efficiency.

ACKNOWLEDGMENTS

This work was supported by the R&D Program of the Ministry of Trade, Industry, and Energy (MOTIE) and Korea Evaluation Institute of Industrial Technology (KEIT). (RS-2023-00232192)

References

1 
S. Ryu and J.-J. Kim, ``High-performance sparsity-aware NPU with reconfigurable comparator-multiplier architecture,'' Journal of Semiconductor Technology and Science, vol. 24, no. 6, pp. 572-577, 2024.DOI
2 
H.-T. Kung, ``Why systolic architectures?'' Computer, vol. 15, no. 1, pp. 37-46, 1982.DOI
3 
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” Proc. of 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1-12, 2017.DOI
4 
F. Yu, Z. Qin, C. Liu, D. Wang, and X. Chen, ``REIN the RobuTS: Robust DNN-based image recognition in autonomous driving systems,'' IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 40, no. 6, pp. 1258-1271, 2020.DOI
5 
H. A. Glory, C. Vigneswaran, S. S. Jagtap, R. Shruthi, G. Hariharan, and V. S. S. Sriram, ``AHW-BGOA-DNN: A novel deep learning model for epileptic seizure detection,'' Neural Computing and Applications, vol. 33, pp. 6065- 6093, 2021.DOI
6 
S. J. Yoon, T. Talluri, A. Angani, H. T. Chung, and K. J. Shin, ``Development of battery management system with PCM using neural network based aging algorithm for electric vehicle,'' IEIE Transactions on Smart Processing and Computing, vol. 14, no. 2, pp. 280-296, 2025.DOI
7 
S. S. Sahoo, A. Kumar, and B. Veeravalli, ``Design and evaluation of reliability-oriented task re-mapping in MP-SoCs using time-series analysis of intermittent faults,'' Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 798-803, 2016.DOI
8 
S. Borkar, ``Design perspectives on 22nm CMOS and beyond,'' Proc. of 46th Annual Design Automation Conference (DAC), pp. 93-94, 2009.DOI
9 
C. Constantinescu, ``Trends and challenges in VLSI circuit reliability,'' IEEE Micro, vol. 23, no. 4, pp. 14-19, 2003.DOI
10 
H. Nan and K. Choi, ``High performance, low cost, and robust soft error tolerant latch designs for nanoscale CMOS technology,'' IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 59, no. 7, pp. 1445-1457, 2012.DOI
11 
J. Henkel, L. Bauer, N. Dutt, P. Gupta, S. Nassif, M. Shafique, M. Tahoori, and N. Wehn, ``Reliable on-chip systems in the nano-era: Lessons learnt and future trends,'' Proc. of 50th Annual Design Automation Conference (DAC), pp. 1- 10, 2013.DOI
12 
H. Lee, H.-J. Lee, and H. Kim, ``A read disturbance tolerant phase change memory system for CNN inference workloads,'' Journal of Semiconductor Technology and Science, vol. 22, no. 4, pp. 216-223, 2022.DOI
13 
M. Pandey and A. Islam, ``Radiation tolerant by design 12-transistor static random access memory,'' Journal of Semiconductor Technology and Science, vol. 24, no. 5, pp.410-423, 2024.DOI
14 
J. J. Zhang, K. Basu, and S. Garg, ``Fault-tolerant systolic array based accelerators for deep neural network execution,'' IEEE Design & Test, vol. 36, no. 5, pp. 44-53, 2019.DOI
15 
M. A. Hanif and M. Shafique, ``Salvagednn: Salvaging deep neural network accelerators with permanent faults through saliency-driven fault-aware mapping,'' Philosophical Transactions of the Royal Society A, vol. 378, no. 2164, 20190164, 2020.DOI
16 
K. Cho, I. Lee, H. Lim, and S. Kang, ``Efficient systolic-array redundancy architecture for offline/online repair,'' Electronics, vol. 9, no. 2, 338, 2020.DOI
17 
L.-C. Chu and B. W. Wah, ``Fault tolerant neural networks with hybrid redundancy,'' Proc. of IJCNN International Joint Conference on Neural Networks, pp. 639-649, 1990.DOI
18 
H. Lee, J. Park, and S. Kang, ``An area-efficient systolic array redundancy architecture for reliable AI accelerator,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 32, no. 10, pp. 1950-1954, 2024.DOI
19 
L. Deng, ``The MNIST database of handwritten digit images for machine learning research,'' IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141-142, 2012.DOI
20 
J. J. Zhang, T. Gu, K. Basu, and S. Garg, ``Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator,'' Proc. of IEEE 36th VLSI Test Symposium (VTS), pp. 1-6, 2018.DOI
Sunyoung Park
../../Resources/ieie/JSTS.2025.25.3.318/author1.png

Sunyoung Park received her B.S. degree in electronic and electrical engineering from Ewha Womans University, Seoul, South Korea, in 2019, and an M.S. degree from the Digital System Architecture Laboratory, Ewha Womans University, where she is currently pursuing a Ph.D. degree with the Digital System Architecture Laboratory. Her current research interests include run-time test architecture, functional safety and digital system architecture design.

Hannah Yang
../../Resources/ieie/JSTS.2025.25.3.318/author2.png

Hannah Yang received her B.S. degree in electronic and electrical engineering from Ewha Womans University, Seoul, South Korea, in 2022, and an M.S. degree from the same university conducting research in the Digital System Architecture Laboratory with a focus on memory system architecture optimization for the VESA VDC-M Decoder IP, in 2024. She is currently pursuing a Ph.D. degree at Hanyang University. Her current research interests include RISC-V processors and domain-specific accelerators.

Hana Kim
../../Resources/ieie/JSTS.2025.25.3.318/author3.png

Hana Kim received her B.S. degree in electronic engineering in 2020 and an M.S. degree from the Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, South Korea, in 2022. She is currently pursuing a Ph.D. degree with the Digital System Architecture Laboratory at Hanyang University, Seoul. Her current research interests include Artificial Intelligence(AI) accelerators for various applications, data types, system-on-chip, and digital system architecture design.

Hyunji Kim
../../Resources/ieie/JSTS.2025.25.3.318/author4.png

Hyunji Kim received her B.S. and M.S. degrees in electronic and electrical engineering from Ewha Womans University, Seoul, South Korea, in 2019 and 2021, respectively. She is currently pursuing a Ph.D. degree at the same university with Digital System Architecture Lab. Her current research interests include domain-specific SoC architecture.

Ji-Hoon Kim
../../Resources/ieie/JSTS.2025.25.3.318/author5.png

Ji-Hoon Kim received his B.S. (summa cum laude) and Ph.D. degrees in Electrical Engineering and Computer Science from KAIST, Daejeon, South Korea, in 2004 and 2009, respectively. In 2009, he joined Samsung Electronics, Suwon, South Korea, as a Senior Engineer, and worked on next-generation architecture for 4G communication modem system-on-chip (SoC). From 2018 to 2025, he was a professor in the Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, South Korea. Since 2025, he has been with the Department of Electronic Engineering, Hanyang University, Seoul, South Korea. His current research interests include CPU microarchitecture, domain-specific SoC, and deep neural network accelerators. Dr. Kim served on the Technical Program Committee and Organizing Committee for various international conferences, including the IEEE International Conference on Computer Design (ICCD), the IEEE Asian Solid-State Circuits Conference (A-SSCC), and the IEEE International Solid-State Circuits Conference (ISSCC). He was a co-recipient of the Distinguished Design Award at the 2019 IEEE A-SSCC, and a recipient of the Best Design Award at 2007 Dongbu HiTek IP Design Contest, the First Place Award at 2008 International SoC Design Conference (ISOCC) Chip Design Contest, and the IEEE/IEIE Joint Award for Young Scientist and Engineer. He also serves as an Associate Editor for the IEEE Transactions on Circuits and Systems II: Express Briefs.