In order to overcome the limitations of the prior works
which promptly address faults upon their occurrence, we
propose two fault mitigation techniques which adaptively
operate according to the severity of the accuracy drop
based on microarchitectural fault analysis.
1. Microarchitectural Fault Analysis
To evaluate the impact of faults on accuracy degradation
in output-stationary systolic array architectures, we define
$D_{fault}$ as the distance from the array boundary—where input
data is initially supplied—to the pro cessing element
(PE) at which the fault occurs, as illustrated in Fig. 1(c).
This parameter allows us to quantify the spatial propagation
of faults within the systolic array. Our microarchi tectural analysis considers multiple fault-related parameters,
including the fault type, bit index, faulty PE rate
(FPR), and the aforementioned $D_{fault}$.
The experimental setup for the fault analysis is as follows.
We perform fault injection on a 16x16 systolic array
using the MNIST dataset [19]. Permanent faults ranging
from 1-bit to 16-bit are injected into the 8-bit registers
that receive data from neighboring PEs. Accordingly,
the bit index spans from 1 (least significant bit, LSB) to 8
(most significant bit, MSB). To simulate worst-case conditions
at the microarchitectural level, we assume that all
other system components are operating under the most
disadvantageous settings, except for the comparison group
under evaluation. This setup ensures that we can isolate
and clearly observe the individual effects of each fault parameter.
We compute the accuracy degradation using the Mean
Absolute Percentage Error (MAPE), and we record both
the maximum and minimum observed degradations along
with their corresponding parameter settings. As shown in
Table 1, faults of the Stuck-at-1 type result in approximately
23% greater accuracy degradation compared to
Stuck-at-0 faults. This disparity may be attributed to the
asymmetry in value distributions within activation functions
like ReLU, which tend to suppress negative outputs
and amplify positive ones.
Moreover, faults at the MSB level cause significantly
greater degradation than those at the LSB level—by up
to 46.7%—highlighting the critical influence of bit significance
in fixed-point arithmetic. Similarly, increasing
the number of injected faults from 1 to 16 results in a
21.5% rise in error rates, indicating the cumulative effect
of fault density. Spatially, faults closer to the input
boundary ($D_{fault}$ = 0) cause approximately 29.1% more
degradation than those occurring at the farthest boundary
($D_{fault}$ =15), which supports the hypothesis that earlystage
faults propagate more aggressively through the systolic flow.
Among the evaluated parameters, the bit index and
$D_{fault}$ emerged as the most dominant contributors to accuracy
degradation. Accordingly, we target these two factors
in our proposed mitigation techniques. The HL-Swap
mechanism addresses bit-level sensitivity by redirecting
high-bit values to low-bit locations, while the RC-Off
technique mitigates spatial fault propagation by selectively
disabling critical rows or columns based on fault
severity analysis.
Proposed HL-Swap fault mitigation method operation scheme.
Microarchitectural fault analysis: Impact of fault factors on accuracy drop with MNIST.
2. HL-Swap: High-low bit swapping
According to the fault analysis described earlier, computational
accuracy can degrade by up to 50.2% depending
on the bit index where the fault occurs. In particular,
faults at higher bit positions have a significantly greater
impact on the output compared to lower bits, as higher
bits carry more weight in fixed-point arithmetic. To address
this issue, we propose a fault mitigation technique
called HL-Swap, which is designed to reduce the impact
of bit-index-dependent faults. HL-Swap operates by exchanging
the upper bits with the lower e bits in a register
when a fault is detected in the upper half. More specifically,
as illustrated in Fig. 2, when one of the two 8-bit
registers receiving data from neigh boring PEs encounters
a fault in its upper 4 bits, HL-Swap swaps them with the
lower 4 bits within the same register. After the swap, a
shifter realigns the data path to ensure that the accumulation
process functions correctly despite the modified bit
positions.
HL-Swap supports four operational modes depending
on the location of the fault: No Swap, Reg0 Swap, Reg1
Swap, and Reg0,1 Swap. The No Swap mode is selected
when no fault is present or when only the lower 4 bits are affected. Reg0 Swap is applied when the upper 4 bits of
Reg0 are faulty, while Reg1 Swap is used if the upper 4
bits of Reg1 are affected. Reg0,1 Swap is activated when
faults exist in the upper 4 bits of both Reg0 and Reg1.
These swap modes enable adaptive fault handling based
on the location of the fault, and the mechanism provides
effective mitigation without requiring redundant hardware.
By using only simple internal bit manipulation and
a lightweight shifter, HL-Swap offers a practical solution
with minimal area overhead while preserving computational
accuracy.
Overall scheme of the proposed RC-Off fault mitigation method based on the microarchitectural location of faults.