Mobile QR Code QR CODE

  1. (Department of Computer Science and Engineering, Chungnam National University, Daejeon 305-764, Korea)



DNN accelerator, lifetime, inference engine, aging, NBTI, safety-critical, serial processing

I. Introduction

Deep neural networks (DNNs) are increasingly used in safety-critical disciplines like automotive [1] and aerospace [2], where dependability is paramount due to the potential dire consequences of failures, such as loss of human life or significant environmental harm. Among dependability attributes, lifetime resilience [3] emerges as a primary concern of intelligent edge devices in safety-critical applications, demanding particular attention. Meanwhile, the ever-increasing complexity of DNN models poses challenges on edge devices with limited resources. Thereby, the primary focus of previous research on DNN deployment has been enhancing computation efficiency, such as through compression and quantization which reduces error resilience of special-purpose accelerators by intrinsic redundancy elimination [4,5].

Whereas continuous scaling of CMOS technology feature size has introduced high-performance and computation-efficient processing platforms, it has significantly threatened the lifetime resilience of modern DNN inference engines, primarily due to aging factors. Bias Temperature Instability (BTI), including NBTI and PBTI, stands out as a dominant aging factor in advanced CMOS technologies [6,7]. BTI causes timing degradation through continual electrical stress on transistors, leading to performance reduction, timing errors, and eventual lifetime loss. Various operational parameters such as temperature, voltage, and stress affect BTI degradation [8,9]. The conventional method of conservative guard banding appears insufficient for modern hardware accelerators due to performance loss by considering more than 20% safety margin over the lifetime [10]. Nonetheless, state-of-the-art research focuses on enhancing the lifetime of DNN accelerators through approximation techniques and temperature mitigation, but with a slight reduction in accuracy [11]. Other leading research efforts have attempted to extend the lifetime of DNN accelerators by reducing stress in activation and weight memories using aging-aware data encoding [12], iterative power gating, and memory bank switching techniques [13]. Conversely, the appropriate data representation became an important research direction in designing DNN accelerators [18]. Conventional number systems appear suboptimal for the design of specialized DNN accelerators. Previously, diverse alternative number systems have been explored in DNN accelerators [17].

This research investigates aging mitigation for lifetime resilience enhancement by reducing stress and increasing redundancies through employing Redundant Number System (RDNS) [19]. Adapting RDNS for DNN acceleration, we have a serious obstacle of redundancies and overheads arising from computing in RDNS. To manage the overheads, we propose the Binary Signed Digit (BSD)-serial processing over bit-serial processing elements in the conventional binary number system (BNS) [14-16]. In brief, serial processing can increase efficiency through 1) dynamic precision adjustment, 2) computation active pruning, and 3) circuit design simplification.

Overall, we focus on combining the concepts of (a) serial processing and (b) computing in RDNS to evaluate their collaborative impact on improving the lifetime resilience of DNN accelerators. Herein, we evaluate the lifetime of BSD-serial PEs compared to conventional bit-serial PEs operating in the BNS. Based on the literature, execution elements are among the most timing-critical units for evaluating the overall lifetime of a processor. [20]. To clarify the RDNS computing effectiveness for lifetime extension, we have used a cross-layer workflow to evaluate BTI in the data path of the DNN accelerator. The proposed approach jointly exploits input (activation and weight) and number-system impacts on lifetime resilience. Experimental results demonstrate the proposed design can improve lifetime resilience, via stress and degradation mitigation, while conserving computational efficiency. Computing in RDNS can efficiently address the strict lifetime constraints of safety-critical disciplines.

The major contributions can be summarized as:

1) For the first time, we explored number-system and workload effects on BTI stress of accelerators data path. Computing in RDNS contributes to 36% lower stress on average for diverse workloads.

2) We evaluated the lifetime of PEs considering all affecting factors, i.e., stress, temperature, and voltage variations. BSD-serial processing causes an average 35.5% higher lifetime in MTTF (mean time to failure) compared to the baseline.

3) We introduced a cross-layer workflow to evaluate aging in the DNN accelerator data path.

The rest of this paper is organized as follows. Section II presents the preliminaries. The proposed RDNS-based processing approach is overviewed in Section III. Section IV clarifies the detailed architecture of processing elements. Section V presents the experimental results and comparison over the baseline. Finally, Section VI concludes the paper.

II. Background

In PMOS transistors, NBTI results in the threshold voltage shift increasing the critical path delays and eventually causing timing errors. NBTI occurs when the PMOS transistors are under stress with negative voltage bias ($V_{GS}=-V_{dd}$). Two major theories for the NBTI process explanation are reaction-diffusion (RD) and trapping/de-trapping (TD) models [8]. Both models explain NBTI as a 2-stage process including stress and recovery phases. The long-term NBTI degradation model based on RD theory is described by [21]:

(1)
$ \Delta V_{th}=~ K\,\,.A\left(T,\,\,V_{dd}\right)Y^{n}\,t^{n} $

where A depends on operational parameters such as temperature ($T$) and voltage ($V_{dd}$), $K$ is a fitting parameter, $Y$ is stress, $t$ is service time and n is assumed between $1/4$ and $1/6$ depending on the diffusing species [21]. In this work, we use RD for NBTI aging evaluation using $n=1/6$. According to [22], $A\left(T,\,\,V_{dd}\right)$ can be calculated as:

(2)
$ A\left(T,\,\,V_{dd}\right)=\exp \left(-\frac{E_{0}}{K_{B}.T}\right).\exp \left(\frac{B.V_{dd}}{t_{ox}.K_{B}.T}\right) $

where $t_{ox}$ is the effective oxide thickness and KB is the Boltzmann constant $(8.6\times 10-5eV/K),$ while $E_{0}=0.1897eV$and $B=0.075eV\,nm/V$ are fitting parameters. As seen, the main workload factor in NBTI aging is stress which can be evaluated by internal nodes’ duty cycles or signal probability ($SP$). For PBTI, $SP$ is the ratio of the time with logic one at a gate input to total service time, while the probability of 0’s determines the NBTI-induced degradation in PMOS devices. Thus, both $SP$ and $1-SP$ are important to cover both NBTI and PBTI. These factors are given to the model to estimate the delay degradation.

Also, the operating temperature of a chip can be calculated from [23]:

(3)
$ T_{chip~ }=T_{a}+R_{\theta }.\frac{P_{tot}}{A} $

where $T_{chip}$ is the average temperature, $T_{a}$ is the ambient temperature ($T_{a}=25^{\circ}C$), $P_{tot}$ is the total power consumption, A (in $cm^{2}$) is the chip area, and $R_{\theta }$ is the equivalent thermal resistance.

In this study, lifetime is defined and estimated as a period in which $\Delta V_{th}$ reaches to 10% of the initial nominal value [24].

III. BSD-serial Processing Approach

The target operation of PEs can be described by $\sum _{i=0}^{l-1}W_{i}\times A_{i}$, where $W_{i}$ and $A_{i}$ represent weights and activations, respectively. Binary serial PE is the baseline of this study as conventional Bit-serial. Compared to fix-precision PEs, serial processing can improve performance by 2.33X on average via on-the-fly per-layer bit precision adjustment [14]. Here, activations and weights arrive in binary bit-serial and bit-parallel, respectively. In each cycle, a bit column of input activations is bitwise ANDed by corresponding parallel weight bits to produce partial products. Then, partial products are fed to a compressor to generate the sum of products. Fig. 1(a) shows the computing approach of conventional bit-serial architecture. Serial engine process inputs in p-cycle length loops, where $p$ is the activations precision in bits. In the first cycle of a loop, MSB bit of $l$ concurrent input activations are ANDed, which produce $l$ p-bit terms. The compressor sums these terms into a partial sum using an adder tree. For the remaining $p-1$ cycles of a phase, the accumulator shifts the previous residual by one bit, while accumulating the new some of the product. According to Fig. 1(b), the processing approach of BSD-serial PE is like Bit-serial while replacing BNS with RDNS in input activations, partial products, partial sums, and generated output. Herein, we keep radix = 2 with the corresponding redundant digit set $\left\{-1,0,1\right\}$. In this regard, a digit is represented with a couple of positive and negative bits with 0 or 1 and -1 or 0 arithmetic values, respectively. For example, $N~ =~ +11$ can be represented as $1011$ or $110\overline{1}$, where in Positive/ Negative representation, it reads as ($N+=1011,$ $N-=1111$ or $N+=1100,$ $N-=1110$).

The BSD-serial processing approach performs inference with a minimized latency (or maximized frequency) independent of weight bit-precision, based on limited carry propagation in RDNS adders. Fig. 2 illustrates an N-digit RDNS adder, featuring an overall architecture similar to carry-save addition.

Fig. 1. Different processing approaches in accelerator PEs: (a) BNS Bit-serial; (b) RDNS BSD-serial.

../../Resources/ieie/JSTS.2024.24.5.491/fig1.png

Fig. 2. N-digit RDNS adder based on carry-save addition.

../../Resources/ieie/JSTS.2024.24.5.491/fig2.png

IV. Detailed Design of PEs

BSD-serial PE architecture is composed of 3 main subunits including an RDNS partial product generator (R-PPG), an RDNS compressor (R-C), and an RDNS accumulator (R-AC). Fig. 3 illustrates the overall BSD-serial architecture extending binary bit-serial PE to RDNS with 16-bit precision activation and weight inputs. Activations arrive in RDNS while weights are in binary 2’s complement. In R-C and R-A, binary adders are replaced with Carry save adders which can decrease the latency by eliminating Carry propagation. The R-PPG gets 16${\times}$16-bit synapse weights in binary and 16 BSD-serial activations. The R-PPG generates 16 partial products each in 16-BSD, feeding the R-C in $X_{0}$ to $X_{15}$ inputs to generate compressed partial sum output. R-AC operates in 16-cycle loops, starting with 0 residual, and accumulates compressed partial sums produced by R-C.

Fig. 3. Overall BSD-Serial PE Architecture.

../../Resources/ieie/JSTS.2024.24.5.491/fig3.png

Fig. 4 summarizes the proposed cross-layer workflow including required tasks in different layers for lifetime evaluation. In this context, we first describe the Bit-serial and BSD-serial architectures in RTL and synthesize them using a 28 nm cell library and 0.9 V nominal voltage to produce the netlist and standard delay format (SDF) files. Table 1 demonstrates an overview of generated reports by the Design Compiler synthesis tool. In parallel, different DNN models e.g., ResNet18 on the ImageNet dataset, are deployed in Python to profile evaluation benchmarks including weights and activations. Next, post-synthesis cycle-accurate simulations are conducted on benchmarks to produce timing reports and value change dump (VCD) outputs. Then, VCDs are explored to generate power and stress (signal probability) reports using a power analysis (Primetime) tool. After that, the maximum temperature is estimated based on previously measured power and area and using the Hotspot tool based on Eq. (3). Finally, we used Matlab to predict degradation and lifetime for different architectures and workloads considering all factors.

Table 1. The Summary of synthesis results for one PE

Feature \ Architecture

Bit-Serial

BSD-Serial

Improvement

Area (mm2)

0.002780

0.003720

- 25 %

Leakage Power

25 μW

27.6 μW

- 10 %

Dynamic Power

246 μW

310 μW

- 26 %

Cycle Time (Latency)

2.81 ns

1.37 ns

+ 51 %

Maximum Frequency

355 MHz

730 MHz

+ 106 %

Max. Performance/Area

127697

196236

+ 53 %

Fig. 4. The proposed cross-layer evaluation workflow.

../../Resources/ieie/JSTS.2024.24.5.491/fig4.png

This workflow is cross-layer because it analyzes interactions across multiple layers. It mainly differs from conventional ASIC design flows because it adds BTI stress estimation and lifetime prediction tasks. In this study, Lenet5 is utilized on the MNIST dataset for the primary evaluation. VGG16 and ResNet18 are applied to the ImageNet dataset for supplementary evaluation and to extend the experimental results.

1. Experimental Results

In this section, we will initially evaluate the impacts of architecture and number-system effects on the internal stress of the PEs while running different DNN models. Fig. 5 illustrates the histogram of the stress ($SP$) variations for the target designs running different DNNs. We can see that by computing in RDNS, the stress (and recovery) phases have become more balanced. Moreover, average stress variations among 16 different filters of ResNet18 for 10 different input images are illustrated in Fig. 6. Accordingly, on average BSD-serial architecture represents 36% lower stress compared to conventional Bit-serial PE. As mentioned, both $SP$ and $1-SP$ are considered for comparison.

Next, the efficacy of considering stress in aging and lifetime estimation is explored over bit-serial and BSD-serial designs. For the sake of clarity, we have calculated the $\mathrm{V}_{\mathrm{th}}$ degradation for successive runs of LeNet5, VGG16, and ResNet18 on introduced PEs. We assumed the worst-case temperature and fixed $V_{dd}$ among PE matrices. In this regard, only stress varies a long DNN executions. According to synthesis results, the temperature is almost equal between Bit-serial and BSD-serial designs because the area and power almost equal increase considering (3). Fig. 7 illustrates NBTI-induced aging degradation when running different DNNs, in the form of $\mathrm{V}_{\mathrm{th}}$ shift, which starts from 0 and increases up to 0.03 V (10% of initial value). Moreover, Fig. 8 illustrates the lifetime improvement in MTTF considering stress variations among different designs. As seen, the BSD-serial design extends the lifetime by 35.5% compared to the baseline Bit-serial due to balancing BTI stress.

Fig. 5. The histograms of stress variations among architectures running LeNet5, VGG16 (MNIST) and ResNet18 (ImageNet).

../../Resources/ieie/JSTS.2024.24.5.491/fig5.png

Fig. 6. BTI stress variations among different architectures, filters, and input images running ResNet-18.

../../Resources/ieie/JSTS.2024.24.5.491/fig6.png

Fig. 7. NBTI degradation in target architectures running different DNN filters considering stress variations.

../../Resources/ieie/JSTS.2024.24.5.491/fig7.png

Fig. 8. Lifetime comparison among designs running LeNet5, VGG16, and ResNet18 filters considering stress variations.

../../Resources/ieie/JSTS.2024.24.5.491/fig8.png

2. Discussion

Table 2 illustrates an overview of the evaluation results which are prepared by cycle-accurate post-synthesis simulations running diverse DNN models and datasets. In this regard, the BSD-serial design demonstrated a balanced BTI stress over the baselines in the form of a 36% duty cycle or Signal Probability (SP) reduction. Based on the decrease in stress, BSD-serial PE mitigated Vth degradation over the bit-parallel baseline (7%), which caused almost 35.5% lifetime extension over the bit-serial baseline.

Table 2. Summary of simulation results

Architecture \ DNN

LeNet5

VGG16

ResNet18

Overall

BTI Stress - Duty Cycle (SP)

Bit-serial

0.75

0.70

0.68

0.71

BSD-serial

0.52

0.52

0.53

0.52

Improvement

37%

35%

33%

36%

Degradation (mV) during 10 years

Bit-serial

29.58

29.23

29.14

29.31

BSD-serial

27.83

27.86

27.90

27.86

Improvement

-

-

-

32 (%)

Lifetime (year)

Bit-serial

10.89

11.69

11.92

11.50

BSD-serial

15.69

15.61

15.47

15.59

Improvement

-

-

-

35.5 (%)

VI. Conclusion

Hardware accelerators have shown high computation efficiency, making them the first choice for DNN acceleration on edge devices. Considering processing elements (PEs) array as the heart of DNN accelerators with multiply-accumulate functionality, this research combines computing in RDNS with a serial processing approach to extend the lifetime resilience, investigating the BTI stress variations on the accelerator’s data path. The proposed technique extended the lifetime by 35.5% compared to conventional bit-serial PE by 36% reducing the stress on average.

For future research, we aim to extend computing in RDNS to emerging DNN accelerators with systolic array architecture to improve lifetime resilience. Additionally, we will explore operating voltage scaling, leveraging performance gained by computing in RDNS to further improve lifetime resilience, and apply the BSD-serial computing concept to systolic array-based DNN accelerators. Ultimately, we intend to expand the RDNS capability in lifetime extension to support training and inference computing phases.

ACKNOWLEDGMENTS

This work was supported in part by the research fund of Chungnam National University, and in part by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2022R1A5A8026986).

References

1 
S. Alcaide, L. Kosmidis, C. Hernandez, and J. Abella, “High-integrity gpu designs for critical real-time automotive systems,” in 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019, pp. 824-829.DOI
2 
C. Adams, A. Spain, J. Parker, M. Hevert, J. Roach, and D. Cotten, “Towards an integrated GPU accelerated SoC as a flight computer for small satellites,” in 2019 IEEE Aerospace Conference, 2019, pp. 1-7.DOI
3 
I. Moghaddasi, S. Gorgin, and J.-A. Lee, “Dependable DNN Accelerator for Safety-critical Systems: A Review on the Aging Perspective,” IEEE Access, 2023.DOI
4 
A. Arunachalam, S. Kundu, A. Rahat, S. Banerjee, and K. Basu, “Fault Resilience of DNN Accelerators for Compressed Sensor Inputs,” in 2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 329-332.DOI
5 
M. Riera, J. M. Arnau, and A. Gonzalez, “DNN pruning with principal component analysis and connection importance estimation,” Journal of Systems Architecture, vol. 122, p. 102336, 2022.DOI
6 
D. S. Huang et al., “Comprehensive device and product level reliability studies on advanced CMOS technologies featuring 7nm high-k metal gate FinFET transistors,” in 2018 IEEE International Reliability Physics Symposium (IRPS), 2018, pp. 6F-7.DOI
7 
C. Liu et al., “Systematical study of 14nm FinFET reliability: From device level stress to product HTOL,” in 2015 IEEE International Reliability Physics Symposium, 2015, pp. 2F-3.DOI
8 
I. Hill, P. Chanawala, R. Singh, S. A. Sheikholeslam, and A. Ivanov, “CMOS Reliability from Past to Future: A Survey of Requirements, Trends, and Prediction Methods,” IEEE Transactions on Device and Materials Reliability, vol. 22, no. 1. Institute of Electrical and Electronics Engineers Inc., pp. 1-18, Mar. 01, 2022. doi: 10.1109/TDMR.2021.3131345DOI
9 
I. Moghaddasi, A. Fouman, M. E. Salehi, and M. Kargahi, “Instruction-level NBTI stress estimation and its application in runtime aging prediction for embedded processors,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, no. 8, pp. 1427-1437, 2018.DOI
10 
W. Wang, S. Yang, S. Bhardwaj, S. Vrudhula, F. Liu, and Y. Cao, “The impact of NBTI effect on combinational circuit: Modeling, simulation, and analysis,” IEEE Trans Very Large Scale Integr VLSI Syst, vol. 18, no. 2, pp. 173-183, 2009.DOI
11 
G. Zervakis et al., “Thermal-aware design for approximate dnn accelerators,” IEEE Transactions on Computers, vol. 71, no. 10, pp. 2687-2697, 2022.DOI
12 
M. A. Hanif and M. Shafique, “DNN-Life: An Energy-Efficient Aging Mitigation Framework for Improving the Lifetime of On-Chip Weight Memories in Deep Neural Network Hardware Architectures,” in Proceedings -Design, Automation and Test in Europe, DATE, Institute of Electrical and Electronics Engineers Inc., Feb. 2021, pp. 729-734. doi: 10.23919/DATE51398.2021.9473943DOI
13 
N. Landeros Muñoz, A. Valero, R. G. Tejero, and D. Zoni, “Gated-CNN: Combating NBTI and HCI aging effects in on-chip activation memories of Convolutional Neural Network accelerators,” Journal of Systems Architecture, vol. 128, Jul. 2022. doi: 10.1016/j.sysarc.2022.102553DOI
14 
P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, “Stripes: Bit-serial deep neural network computing,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1-12.DOI
15 
J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision,” IEEE J Solid-State Circuits, vol. 54, no. 1, pp. 173-185, 2018.DOI
16 
M. Capra, F. Conti, and M. Martina, “A Multi-Precision Bit-Serial Hardware Accelerator IP for Deep Learning Enabled Internet-of-Things,” in 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 192-197.DOI
17 
V. Sakellariou, V. Paliouras, I. Kouretas, H. Saleh, and T. Stouraitis, “A multiplier-Free RNS-Based CNN accelerator exploiting bit-Level sparsity,” IEEE Trans Emerg Top Comput, pp. 1-16, 2023. doi: 10.1109/TETC.2023.3301590DOI
18 
G. Alsuhli, V. Sakellariou, H. Saleh, M. Al-Qutayri, B. Mohammad, and T. Stouraitis, “Conventional Number Systems for DNN Architectures,” in Number Systems for Deep Neural Network Architectures, Springer, 2023, pp. 17-25.DOI
19 
G. Jaberipur, “Redundant number system-based arithmetic circuits,” Arithmetic Circuits for DSP Applications, pp. 273-312, 2017.URL
20 
F. Oboril, F. Firouzi, S. Kiamehr, and M. Tahoori, “Reducing NBTI-induced processor wearout by exploiting the timing slack of instructions,” in Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, 2012, pp. 443-452.DOI
21 
Y. Chen, A. Calimera, E. Macii, and M. Poncino, “Characterizing the activity factor in NBTI aging models for embedded cores,” in Proceedings of the 25th edition on Great Lakes Symposium on VLSI, 2015, pp. 75-78.DOI
22 
V. B. Kleeberger, M. Barke, C. Werner, D. Schmitt-Landsiedel, and U. Schlichtmann, “A compact model for NBTI degradation and recovery under use-profile variations and its application to aging analysis of digital integrated circuits,” Microelectronics Reliability, vol. 54, no. 6, pp. 1083-1089, 2014.DOI
23 
M. Pedram and S. Nazarian, “Thermal modeling, analysis, and management in VLSI circuits: Principles and methods,” Proceedings of the IEEE, vol. 94, no. 8, pp. 1487-1501, 2006.DOI
24 
J. W. McPherson and J. W. McPherson, “Time-to-failure modeling,” Reliability Physics and Engineering: Time-To-Failure Modeling, pp. 37-49, 2013.URL
Iraj Moghaddasi
../../Resources/ieie/JSTS.2024.24.5.491/au1.png

Iraj Moghaddasi received the B.Sc. degree in computer engineering from Shahid Beheshti University, Tehran, in September 2000, the M.Sc. degree in computer engineering from the Iran University of Science and Technology, Tehran, in May 2003, and the Ph.D. degree from the School of Electrical and Computer Engineering, University of Tehran, Tehran, in September 2018. From 2019 to 2021, he was a Research Associate with the Iran Telecommunication Research Center (ITRC). He is currently a Postdoctoral Researcher at Chungnam National University, Daejeon, Korea. His research interests include computer architecture, reliable and high-performance computing, hardware modeling & architectural exploration of DNN edge accelerators, embedded systems, and processing in memory for computation-efficient machine learning.

Byeong-Gyu Nam
../../Resources/ieie/JSTS.2024.24.5.491/au2.png

Byeong-Gyu Nam (Senior Member, IEEE) received his B.S. degree (summa cum laude) in computer engineering from Kyungpook National University, Daegu, Korea, in 1999, M.S. and Ph.D. degrees in electrical engineering and computer science from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2001 and 2007, respectively. Dr. Nam is currently with Chungnam National University, Daejeon, Korea, as a professor. His current interests include machine learning processors, graphics processors, and low-power SoC design. He has served as the Chair of the Digital Architectures and Systems (DAS) subcommittee of ISSCC from 2017 to 2019 and was a member of the TPC for IEEE ISSCC, IEEE A-SSCC, IEEE COOL Chips, and ASP-DAC. He served as an Associate Editor of the IEIE Journal of Semiconductor Technology and Science (JSTS) and a Guest Editor for the IEEE Journal of Solid-State Circuits (JSSC) in 2013.