MoghaddasiIraj1
NamByeong-Gyu1
-
(Department of Computer Science and Engineering, Chungnam National University, Daejeon
305-764, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
DNN accelerator, lifetime, inference engine, aging, NBTI, safety-critical, serial processing
I. Introduction
Deep neural networks (DNNs) are increasingly used in safety-critical disciplines like
automotive [1] and aerospace [2], where dependability is paramount due to the potential dire consequences of failures,
such as loss of human life or significant environmental harm. Among dependability
attributes, lifetime resilience [3] emerges as a primary concern of intelligent edge devices in safety-critical applications,
demanding particular attention. Meanwhile, the ever-increasing complexity of DNN models
poses challenges on edge devices with limited resources. Thereby, the primary focus
of previous research on DNN deployment has been enhancing computation efficiency,
such as through compression and quantization which reduces error resilience of special-purpose
accelerators by intrinsic redundancy elimination [4,5].
Whereas continuous scaling of CMOS technology feature size has introduced high-performance
and computation-efficient processing platforms, it has significantly threatened the
lifetime resilience of modern DNN inference engines, primarily due to aging factors.
Bias Temperature Instability (BTI), including NBTI and PBTI, stands out as a dominant
aging factor in advanced CMOS technologies [6,7]. BTI causes timing degradation through continual electrical stress on transistors,
leading to performance reduction, timing errors, and eventual lifetime loss. Various
operational parameters such as temperature, voltage, and stress affect BTI degradation
[8,9]. The conventional method of conservative guard banding appears insufficient for modern
hardware accelerators due to performance loss by considering more than 20% safety
margin over the lifetime [10]. Nonetheless, state-of-the-art research focuses on enhancing the lifetime of DNN
accelerators through approximation techniques and temperature mitigation, but with
a slight reduction in accuracy [11]. Other leading research efforts have attempted to extend the lifetime of DNN accelerators
by reducing stress in activation and weight memories using aging-aware data encoding
[12], iterative power gating, and memory bank switching techniques [13]. Conversely, the appropriate data representation became an important research direction
in designing DNN accelerators [18]. Conventional number systems appear suboptimal for the design of specialized DNN
accelerators. Previously, diverse alternative number systems have been explored in
DNN accelerators [17].
This research investigates aging mitigation for lifetime resilience enhancement by
reducing stress and increasing redundancies through employing Redundant Number System
(RDNS) [19]. Adapting RDNS for DNN acceleration, we have a serious obstacle of redundancies and
overheads arising from computing in RDNS. To manage the overheads, we propose the
Binary Signed Digit (BSD)-serial processing over bit-serial processing elements in
the conventional binary number system (BNS) [14-16]. In brief, serial processing can increase efficiency through 1) dynamic precision
adjustment, 2) computation active pruning, and 3) circuit design simplification.
Overall, we focus on combining the concepts of (a) serial processing and (b) computing
in RDNS to evaluate their collaborative impact on improving the lifetime resilience
of DNN accelerators. Herein, we evaluate the lifetime of BSD-serial PEs compared to
conventional bit-serial PEs operating in the BNS. Based on the literature, execution
elements are among the most timing-critical units for evaluating the overall lifetime
of a processor. [20]. To clarify the RDNS computing effectiveness for lifetime extension, we have used
a cross-layer workflow to evaluate BTI in the data path of the DNN accelerator. The
proposed approach jointly exploits input (activation and weight) and number-system
impacts on lifetime resilience. Experimental results demonstrate the proposed design
can improve lifetime resilience, via stress and degradation mitigation, while conserving
computational efficiency. Computing in RDNS can efficiently address the strict lifetime
constraints of safety-critical disciplines.
The major contributions can be summarized as:
1) For the first time, we explored number-system and workload effects on BTI stress
of accelerators data path. Computing in RDNS contributes to 36% lower stress on average
for diverse workloads.
2) We evaluated the lifetime of PEs considering all affecting factors, i.e., stress,
temperature, and voltage variations. BSD-serial processing causes an average 35.5%
higher lifetime in MTTF (mean time to failure) compared to the baseline.
3) We introduced a cross-layer workflow to evaluate aging in the DNN accelerator data
path.
The rest of this paper is organized as follows. Section II presents the preliminaries.
The proposed RDNS-based processing approach is overviewed in Section III. Section
IV clarifies the detailed architecture of processing elements. Section V presents
the experimental results and comparison over the baseline. Finally, Section VI concludes
the paper.
II. Background
In PMOS transistors, NBTI results in the threshold voltage shift increasing the critical
path delays and eventually causing timing errors. NBTI occurs when the PMOS transistors
are under stress with negative voltage bias ($V_{GS}=-V_{dd}$). Two major theories
for the NBTI process explanation are reaction-diffusion (RD) and trapping/de-trapping
(TD) models [8]. Both models explain NBTI as a 2-stage process including stress and recovery phases.
The long-term NBTI degradation model based on RD theory is described by [21]:
where A depends on operational parameters such as temperature ($T$) and voltage ($V_{dd}$),
$K$ is a fitting parameter, $Y$ is stress, $t$ is service time and n is assumed between
$1/4$ and $1/6$ depending on the diffusing species [21]. In this work, we use RD for NBTI aging evaluation using $n=1/6$. According to [22], $A\left(T,\,\,V_{dd}\right)$ can be calculated as:
where $t_{ox}$ is the effective oxide thickness and KB is the Boltzmann constant $(8.6\times
10-5eV/K),$ while $E_{0}=0.1897eV$and $B=0.075eV\,nm/V$ are fitting parameters. As
seen, the main workload factor in NBTI aging is stress which can be evaluated by internal
nodes’ duty cycles or signal probability ($SP$). For PBTI, $SP$ is the ratio of the
time with logic one at a gate input to total service time, while the probability of
0’s determines the NBTI-induced degradation in PMOS devices. Thus, both $SP$ and $1-SP$
are important to cover both NBTI and PBTI. These factors are given to the model to
estimate the delay degradation.
Also, the operating temperature of a chip can be calculated from [23]:
where $T_{chip}$ is the average temperature, $T_{a}$ is the ambient temperature ($T_{a}=25^{\circ}C$),
$P_{tot}$ is the total power consumption, A (in $cm^{2}$) is the chip area, and $R_{\theta
}$ is the equivalent thermal resistance.
In this study, lifetime is defined and estimated as a period in which $\Delta V_{th}$
reaches to 10% of the initial nominal value [24].
III. BSD-serial Processing Approach
The target operation of PEs can be described by $\sum _{i=0}^{l-1}W_{i}\times A_{i}$,
where $W_{i}$ and $A_{i}$ represent weights and activations, respectively. Binary
serial PE is the baseline of this study as conventional Bit-serial. Compared to fix-precision
PEs, serial processing can improve performance by 2.33X on average via on-the-fly
per-layer bit precision adjustment [14]. Here, activations and weights arrive in binary bit-serial and bit-parallel, respectively.
In each cycle, a bit column of input activations is bitwise ANDed by corresponding
parallel weight bits to produce partial products. Then, partial products are fed to
a compressor to generate the sum of products. Fig. 1(a) shows the computing approach of conventional bit-serial architecture. Serial engine
process inputs in p-cycle length loops, where $p$ is the activations precision in
bits. In the first cycle of a loop, MSB bit of $l$ concurrent input activations are
ANDed, which produce $l$ p-bit terms. The compressor sums these terms into a partial
sum using an adder tree. For the remaining $p-1$ cycles of a phase, the accumulator
shifts the previous residual by one bit, while accumulating the new some of the product.
According to Fig. 1(b), the processing approach of BSD-serial PE is like Bit-serial while replacing BNS
with RDNS in input activations, partial products, partial sums, and generated output.
Herein, we keep radix = 2 with the corresponding redundant digit set $\left\{-1,0,1\right\}$.
In this regard, a digit is represented with a couple of positive and negative bits
with 0 or 1 and -1 or 0 arithmetic values, respectively. For example, $N~ =~ +11$
can be represented as $1011$ or $110\overline{1}$, where in Positive/ Negative representation,
it reads as ($N+=1011,$ $N-=1111$ or $N+=1100,$ $N-=1110$).
The BSD-serial processing approach performs inference with a minimized latency (or
maximized frequency) independent of weight bit-precision, based on limited carry propagation
in RDNS adders. Fig. 2 illustrates an N-digit RDNS adder, featuring an overall architecture similar to carry-save
addition.
Fig. 1. Different processing approaches in accelerator PEs: (a) BNS Bit-serial; (b)
RDNS BSD-serial.
Fig. 2. N-digit RDNS adder based on carry-save addition.
IV. Detailed Design of PEs
BSD-serial PE architecture is composed of 3 main subunits including an RDNS partial
product generator (R-PPG), an RDNS compressor (R-C), and an RDNS accumulator (R-AC).
Fig. 3 illustrates the overall BSD-serial architecture extending binary bit-serial PE to
RDNS with 16-bit precision activation and weight inputs. Activations arrive in RDNS
while weights are in binary 2’s complement. In R-C and R-A, binary adders are replaced
with Carry save adders which can decrease the latency by eliminating Carry propagation.
The R-PPG gets 16${\times}$16-bit synapse weights in binary and 16 BSD-serial activations.
The R-PPG generates 16 partial products each in 16-BSD, feeding the R-C in $X_{0}$
to $X_{15}$ inputs to generate compressed partial sum output. R-AC operates in 16-cycle
loops, starting with 0 residual, and accumulates compressed partial sums produced
by R-C.
Fig. 3. Overall BSD-Serial PE Architecture.
Fig. 4 summarizes the proposed cross-layer workflow including required tasks in different
layers for lifetime evaluation. In this context, we first describe the Bit-serial
and BSD-serial architectures in RTL and synthesize them using a 28 nm cell library
and 0.9 V nominal voltage to produce the netlist and standard delay format (SDF) files.
Table 1 demonstrates an overview of generated reports by the Design Compiler synthesis tool.
In parallel, different DNN models e.g., ResNet18 on the ImageNet dataset, are deployed
in Python to profile evaluation benchmarks including weights and activations. Next,
post-synthesis cycle-accurate simulations are conducted on benchmarks to produce timing
reports and value change dump (VCD) outputs. Then, VCDs are explored to generate power
and stress (signal probability) reports using a power analysis (Primetime) tool. After
that, the maximum temperature is estimated based on previously measured power and
area and using the Hotspot tool based on Eq. (3). Finally, we used Matlab to predict degradation and lifetime for different architectures
and workloads considering all factors.
Table 1. The Summary of synthesis results for one PE
Feature \ Architecture
|
Bit-Serial
|
BSD-Serial
|
Improvement
|
Area (mm2)
|
0.002780
|
0.003720
|
- 25 %
|
Leakage Power
|
25 μW
|
27.6 μW
|
- 10 %
|
Dynamic Power
|
246 μW
|
310 μW
|
- 26 %
|
Cycle Time (Latency)
|
2.81 ns
|
1.37 ns
|
+ 51 %
|
Maximum Frequency
|
355 MHz
|
730 MHz
|
+ 106 %
|
Max. Performance/Area
|
127697
|
196236
|
+ 53 %
|
Fig. 4. The proposed cross-layer evaluation workflow.
This workflow is cross-layer because it analyzes interactions across multiple layers.
It mainly differs from conventional ASIC design flows because it adds BTI stress estimation
and lifetime prediction tasks. In this study, Lenet5 is utilized on the MNIST dataset
for the primary evaluation. VGG16 and ResNet18 are applied to the ImageNet dataset
for supplementary evaluation and to extend the experimental results.
1. Experimental Results
In this section, we will initially evaluate the impacts of architecture and number-system
effects on the internal stress of the PEs while running different DNN models. Fig. 5 illustrates the histogram of the stress ($SP$) variations for the target designs
running different DNNs. We can see that by computing in RDNS, the stress (and recovery)
phases have become more balanced. Moreover, average stress variations among 16 different
filters of ResNet18 for 10 different input images are illustrated in Fig. 6. Accordingly, on average BSD-serial architecture represents 36% lower stress compared
to conventional Bit-serial PE. As mentioned, both $SP$ and $1-SP$ are considered for
comparison.
Next, the efficacy of considering stress in aging and lifetime estimation is explored
over bit-serial and BSD-serial designs. For the sake of clarity, we have calculated
the $\mathrm{V}_{\mathrm{th}}$ degradation for successive runs of LeNet5, VGG16, and
ResNet18 on introduced PEs. We assumed the worst-case temperature and fixed $V_{dd}$
among PE matrices. In this regard, only stress varies a long DNN executions. According
to synthesis results, the temperature is almost equal between Bit-serial and BSD-serial
designs because the area and power almost equal increase considering (3). Fig. 7 illustrates NBTI-induced aging degradation when running different DNNs, in the form
of $\mathrm{V}_{\mathrm{th}}$ shift, which starts from 0 and increases up to 0.03
V (10% of initial value). Moreover, Fig. 8 illustrates the lifetime improvement in MTTF considering stress variations among
different designs. As seen, the BSD-serial design extends the lifetime by 35.5% compared
to the baseline Bit-serial due to balancing BTI stress.
Fig. 5. The histograms of stress variations among architectures running LeNet5, VGG16
(MNIST) and ResNet18 (ImageNet).
Fig. 6. BTI stress variations among different architectures, filters, and input images
running ResNet-18.
Fig. 7. NBTI degradation in target architectures running different DNN filters considering
stress variations.
Fig. 8. Lifetime comparison among designs running LeNet5, VGG16, and ResNet18 filters
considering stress variations.
2. Discussion
Table 2 illustrates an overview of the evaluation results which are prepared by cycle-accurate
post-synthesis simulations running diverse DNN models and datasets. In this regard,
the BSD-serial design demonstrated a balanced BTI stress over the baselines in the
form of a 36% duty cycle or Signal Probability (SP) reduction. Based on the decrease
in stress, BSD-serial PE mitigated Vth degradation over the bit-parallel baseline
(7%), which caused almost 35.5% lifetime extension over the bit-serial baseline.
Table 2. Summary of simulation results
Architecture \ DNN
|
LeNet5
|
VGG16
|
ResNet18
|
Overall
|
BTI Stress - Duty Cycle (SP)
|
Bit-serial
|
0.75
|
0.70
|
0.68
|
0.71
|
BSD-serial
|
0.52
|
0.52
|
0.53
|
0.52
|
Improvement
|
37%
|
35%
|
33%
|
36%
|
Degradation (mV) during 10 years
|
Bit-serial
|
29.58
|
29.23
|
29.14
|
29.31
|
BSD-serial
|
27.83
|
27.86
|
27.90
|
27.86
|
Improvement
|
-
|
-
|
-
|
32 (%)
|
Lifetime (year)
|
Bit-serial
|
10.89
|
11.69
|
11.92
|
11.50
|
BSD-serial
|
15.69
|
15.61
|
15.47
|
15.59
|
Improvement
|
-
|
-
|
-
|
35.5 (%)
|
VI. Conclusion
Hardware accelerators have shown high computation efficiency, making them the first
choice for DNN acceleration on edge devices. Considering processing elements (PEs)
array as the heart of DNN accelerators with multiply-accumulate functionality, this
research combines computing in RDNS with a serial processing approach to extend the
lifetime resilience, investigating the BTI stress variations on the accelerator’s
data path. The proposed technique extended the lifetime by 35.5% compared to conventional
bit-serial PE by 36% reducing the stress on average.
For future research, we aim to extend computing in RDNS to emerging DNN accelerators
with systolic array architecture to improve lifetime resilience. Additionally, we
will explore operating voltage scaling, leveraging performance gained by computing
in RDNS to further improve lifetime resilience, and apply the BSD-serial computing
concept to systolic array-based DNN accelerators. Ultimately, we intend to expand
the RDNS capability in lifetime extension to support training and inference computing
phases.
ACKNOWLEDGMENTS
This work was supported in part by the research fund of Chungnam National University,
and in part by the National Research Foundation of Korea (NRF) grant funded by the
Korean government (MSIT) (No. 2022R1A5A8026986).
References
S. Alcaide, L. Kosmidis, C. Hernandez, and J. Abella, “High-integrity gpu designs
for critical real-time automotive systems,” in 2019 Design, Automation & Test in Europe
Conference & Exhibition (DATE), 2019, pp. 824-829.
C. Adams, A. Spain, J. Parker, M. Hevert, J. Roach, and D. Cotten, “Towards an integrated
GPU accelerated SoC as a flight computer for small satellites,” in 2019 IEEE Aerospace
Conference, 2019, pp. 1-7.
I. Moghaddasi, S. Gorgin, and J.-A. Lee, “Dependable DNN Accelerator for Safety-critical
Systems: A Review on the Aging Perspective,” IEEE Access, 2023.
A. Arunachalam, S. Kundu, A. Rahat, S. Banerjee, and K. Basu, “Fault Resilience of
DNN Accelerators for Compressed Sensor Inputs,” in 2022 IEEE Computer Society Annual
Symposium on VLSI (ISVLSI), 2022, pp. 329-332.
M. Riera, J. M. Arnau, and A. Gonzalez, “DNN pruning with principal component analysis
and connection importance estimation,” Journal of Systems Architecture, vol. 122,
p. 102336, 2022.
D. S. Huang et al., “Comprehensive device and product level reliability studies on
advanced CMOS technologies featuring 7nm high-k metal gate FinFET transistors,” in
2018 IEEE International Reliability Physics Symposium (IRPS), 2018, pp. 6F-7.
C. Liu et al., “Systematical study of 14nm FinFET reliability: From device level stress
to product HTOL,” in 2015 IEEE International Reliability Physics Symposium, 2015,
pp. 2F-3.
I. Hill, P. Chanawala, R. Singh, S. A. Sheikholeslam, and A. Ivanov, “CMOS Reliability
from Past to Future: A Survey of Requirements, Trends, and Prediction Methods,” IEEE
Transactions on Device and Materials Reliability, vol. 22, no. 1. Institute of Electrical
and Electronics Engineers Inc., pp. 1-18, Mar. 01, 2022. doi: 10.1109/TDMR.2021.3131345
I. Moghaddasi, A. Fouman, M. E. Salehi, and M. Kargahi, “Instruction-level NBTI stress
estimation and its application in runtime aging prediction for embedded processors,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.
38, no. 8, pp. 1427-1437, 2018.
W. Wang, S. Yang, S. Bhardwaj, S. Vrudhula, F. Liu, and Y. Cao, “The impact of NBTI
effect on combinational circuit: Modeling, simulation, and analysis,” IEEE Trans Very
Large Scale Integr VLSI Syst, vol. 18, no. 2, pp. 173-183, 2009.
G. Zervakis et al., “Thermal-aware design for approximate dnn accelerators,” IEEE
Transactions on Computers, vol. 71, no. 10, pp. 2687-2697, 2022.
M. A. Hanif and M. Shafique, “DNN-Life: An Energy-Efficient Aging Mitigation Framework
for Improving the Lifetime of On-Chip Weight Memories in Deep Neural Network Hardware
Architectures,” in Proceedings -Design, Automation and Test in Europe, DATE, Institute
of Electrical and Electronics Engineers Inc., Feb. 2021, pp. 729-734. doi: 10.23919/DATE51398.2021.9473943
N. Landeros Muñoz, A. Valero, R. G. Tejero, and D. Zoni, “Gated-CNN: Combating NBTI
and HCI aging effects in on-chip activation memories of Convolutional Neural Network
accelerators,” Journal of Systems Architecture, vol. 128, Jul. 2022. doi: 10.1016/j.sysarc.2022.102553
P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, “Stripes: Bit-serial
deep neural network computing,” in 2016 49th Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), 2016, pp. 1-12.
J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: An energy-efficient
deep neural network accelerator with fully variable weight bit precision,” IEEE J
Solid-State Circuits, vol. 54, no. 1, pp. 173-185, 2018.
M. Capra, F. Conti, and M. Martina, “A Multi-Precision Bit-Serial Hardware Accelerator
IP for Deep Learning Enabled Internet-of-Things,” in 2021 IEEE International Midwest
Symposium on Circuits and Systems (MWSCAS), pp. 192-197.
V. Sakellariou, V. Paliouras, I. Kouretas, H. Saleh, and T. Stouraitis, “A multiplier-Free
RNS-Based CNN accelerator exploiting bit-Level sparsity,” IEEE Trans Emerg Top Comput,
pp. 1-16, 2023. doi: 10.1109/TETC.2023.3301590
G. Alsuhli, V. Sakellariou, H. Saleh, M. Al-Qutayri, B. Mohammad, and T. Stouraitis,
“Conventional Number Systems for DNN Architectures,” in Number Systems for Deep Neural
Network Architectures, Springer, 2023, pp. 17-25.
G. Jaberipur, “Redundant number system-based arithmetic circuits,” Arithmetic Circuits
for DSP Applications, pp. 273-312, 2017.
F. Oboril, F. Firouzi, S. Kiamehr, and M. Tahoori, “Reducing NBTI-induced processor
wearout by exploiting the timing slack of instructions,” in Proceedings of the eighth
IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis,
2012, pp. 443-452.
Y. Chen, A. Calimera, E. Macii, and M. Poncino, “Characterizing the activity factor
in NBTI aging models for embedded cores,” in Proceedings of the 25th edition on Great
Lakes Symposium on VLSI, 2015, pp. 75-78.
V. B. Kleeberger, M. Barke, C. Werner, D. Schmitt-Landsiedel, and U. Schlichtmann,
“A compact model for NBTI degradation and recovery under use-profile variations and
its application to aging analysis of digital integrated circuits,” Microelectronics
Reliability, vol. 54, no. 6, pp. 1083-1089, 2014.
M. Pedram and S. Nazarian, “Thermal modeling, analysis, and management in VLSI circuits:
Principles and methods,” Proceedings of the IEEE, vol. 94, no. 8, pp. 1487-1501, 2006.
J. W. McPherson and J. W. McPherson, “Time-to-failure modeling,” Reliability Physics
and Engineering: Time-To-Failure Modeling, pp. 37-49, 2013.
Iraj Moghaddasi received the B.Sc. degree in computer engineering from Shahid
Beheshti University, Tehran, in September 2000, the M.Sc. degree in computer engineering
from the Iran University of Science and Technology, Tehran, in May 2003, and the Ph.D.
degree from the School of Electrical and Computer Engineering, University of Tehran,
Tehran, in September 2018. From 2019 to 2021, he was a Research Associate with the
Iran Telecommunication Research Center (ITRC). He is currently a Postdoctoral Researcher
at Chungnam National University, Daejeon, Korea. His research interests include computer
architecture, reliable and high-performance computing, hardware modeling & architectural
exploration of DNN edge accelerators, embedded systems, and processing in memory for
computation-efficient machine learning.
Byeong-Gyu Nam (Senior Member, IEEE) received his B.S. degree (summa cum laude)
in computer engineering from Kyungpook National University, Daegu, Korea, in 1999,
M.S. and Ph.D. degrees in electrical engineering and computer science from Korea Advanced
Institute of Science and Technology (KAIST), Daejeon, Korea, in 2001 and 2007, respectively.
Dr. Nam is currently with Chungnam National University, Daejeon, Korea, as a professor.
His current interests include machine learning processors, graphics processors, and
low-power SoC design. He has served as the Chair of the Digital Architectures and
Systems (DAS) subcommittee of ISSCC from 2017 to 2019 and was a member of the TPC
for IEEE ISSCC, IEEE A-SSCC, IEEE COOL Chips, and ASP-DAC. He served as an Associate
Editor of the IEIE Journal of Semiconductor Technology and Science (JSTS) and a Guest
Editor for the IEEE Journal of Solid-State Circuits (JSSC) in 2013.