Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 21, No. 2, p.134-142

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 19 January 2020Accepted : 4 March 2020

DOI :

https://doi.org/10.5573/JSTS.2021.21.2.134

An Approximate DRAM Design with an Adjustable Refresh Scheme for Low-power Deep Neural Networks

NguyenDuy Thanh¹ KimHyun² LeeHyuk-Jae¹

(Inter-University Semiconductor Research Center, Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Korea )
(Department of Electrical and Information Engineering and the Research Center for Electrical and Information Technology, Seoul National University of Science and Technology, Seoul 01811, Korea)

^* E-mail: hyunkim@seoultech.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

A DRAM device requires periodic refresh operations to preserve data integrity, which incurs significant power consumption. Slowing down the refresh rate can reduce the power consumption; however, it may cause a loss of data stored in a DRAM cell, which affects the correctness of computation. This paper proposes a new memory architecture for deep learning applications, which reduces the refresh power consumption while maintaining accuracy. Utilizing the error-tolerant property of deep learning applications, the proposed memory architecture avoids the accuracy drop caused by data loss by flexibly controlling the refresh operation for different bits, depending on their criticality. For data storage in deep learning applications, the approximate DRAM architecture reorganizes the data so that these data are mapped to different DRAM devices according to their bit significance. Critical bits are stored in more frequently refreshed devices while non-critical bits are stored in less frequently refreshed devices. Compared to the conventional DRAM, the proposed approximate DRAM requires only a separation of the chip select signal for each device in a DRAM rank and a minor change in the memory controller. Simulation results show that the refresh power consumption is reduced by 66.5 % with a negligible accuracy drop on state-of-the-art deep neural networks.

Index Terms

TermsDeep learning, approximate DRAM, low power DRAM, bit-level refresh, fine-grained refresh

I. INTRODUCTION

Deep learning applications require large amounts of computation with excessive data traffic from the memory ⁽¹⁾. Therefore, DRAM energy consumption is substantial in the total system energy. Beside the power consumed by data access to DRAM, a significant portion of DRAM power is made by refresh operation. It is necessary to read out data periodically and then written them back into the same memory cells because DRAM cells cannot retain the stored data permanently. This refresh operation is necessary for all cells in a DRAM, whether they store significant data or not. Therefore, this refresh operation results in significant power consumption even though certain DRAM cells do not store the data that are accessed by an active process in the processor. As the DRAM density increases, the refresh power consumed becomes prominent. In a future 64 Gb DRAM device, the refresh operation is expected to account for up to 50% of the total power consumption ⁽²⁾.

Accordingly, various refresh management techniques have been proposed to reduce power consumption ^(2-⁷⁾. In three studies, ^(2-⁴⁾, the OS is requested by default to figure out the retention time information for each DRAM row, and a DRAM controller uses this information to selectively perform refresh operation for each row. These techniques significantly reduce the refresh power consumption by skipping unnecessary refresh commands. However, with increasing memory sizes, it is not cost-effective and not scalable to store the retention time information of all the rows of the DRAM. Moreover, profiling the retention time information of all DRAM cells requires a significant amount of effort, and incorrect results may be obtained, as it has to deal with Variable Retention Time and Data Pattern Dependencies ⁽⁸⁾. Two studies in ⁽⁵⁾ and ⁽⁶⁾ propose another approach to reduce the refresh power consumption by allowing the possibility of occurrence of a small amount of errors in DRAM cells. In ⁽⁵⁾, Flikker provides a software solution for the approximate memory by partitioning data into critical and non-critical data. It leverages the Partial-Array Self-Refresh Mode (PASR) ⁽⁹⁾, which supports different refresh rates for the different sections of the same DRAM bank. However, since all bits of data are unprotected from errors, in the case of floating data, an erroneous most significant bit (MSB) may change the data on a very large scale, which causes overflow in computation. Providing a finer-grained refresh control than ⁽⁵⁾, Sparkk in ⁽⁶⁾, uses varying refresh periods for different bits based on their importance. Sparkk protects the critical bits of integer data, thus making the applications more robust to an error than in the method proposed by Flikker; however, it lacks in software support and does not provide power measurement and performance evaluation based on architectural simulation. Different from Sparkk and Flikker, the studies in ⁽⁷⁾, and ⁽¹⁸⁾ proposes a transposed approximate memory architecture. A row-level refresh scheme is applied to refresh data bits according to their criticality. This scheme saves 69.68% of the refresh power with a negligible accuracy drop. However, a block of data needs to be buffered and transposed in the memory controller before saving to the external memory. It potentially causes the long latency for data access.

It is widely known that deep learning applications are error tolerant ^(10,¹¹⁾. Therefore, based on this property, this paper proposes an approximate DRAM with a fine-grained refresh scheme. The proposed scheme maps bits of data to different DRAM devices according to their bit significance. The critical bits are stored in more frequently refreshed devices to prevent the accuracy drop, whereas the non-critical bits are mapped to less frequent refreshed device thereby saving refresh power considerably. The simulation results show that the proposed approximate DRAM can significantly reduce the refresh power consumption by up to 66.5% with negligible degradation in accuracy. It is worth mentioning that this paper is extended from the previous work in ⁽²⁸⁾. In this paper, the hybrid memory architecture and evaluation methodology are elaborated in more detail. In addition, the evaluation of VGGNet inference with the real approximate memory on the FPGA platform is added to verify the correctness of the assumed error injection model.

The rest of this paper is organized as follows. Section II presents the proposed approximate architecture. Section III describes the evaluation methodology. In Section IV, the evaluation results are provided, and finally the conclusions are presented in Section V.

II. Proposed Approximate Dram

The main goal of the proposed approximate memory architecture is to apply different refresh rates to the different bits of data depending on the significance of the bits. The critical bits (i.e., sign bit and exponent bits) should be placed in the error-free zone with the normal refresh rate. On the other hand, the non-critical bits (i.e., mantissa bits) are placed in the erroneous zone with a slower refresh rate to achieve power saving during refresh operations.

Fig. 1. Approximate memory architecture: APPROX2. Dark device: precise, dotted white device: approximate.

A deep learning application, in general, accesses 32-bit floating-point data ⁽¹²⁾. The effectiveness of the approximate memory architecture depends on separating important data from the other data and refreshing them at the normal rate, while the other insignificant data are refreshed at a much slower rate, which may cause errors. The more fined-grained the critical data partitioning is, the more refresh reduction it can achieve. This paper proposes an approximate DRAM architecture, as shown in Fig. 1. This architecture is refreshed with different fined-grained schemes while protecting the 9 critical bits. This architecture, so called APPROX2, refreshes data at granularity of 2 bits. The least significant bits (LSBs) of the data are refreshed at slower rate than the MSBs. To reduce the design effort, hardware overhead, and performance maintenance, this paper proposes a refresh scheme at DRAM device granularity. The DRAM devices for critical data are called precise devices and are marked with a dark color in Fig. 1. The precise devices are refreshed at the normal rate. On the other hand, the DRAM devices for non-critical data are called approximate devices and are marked with a dotted white color in Fig. 1. It should be noted that, for simplicity, the refresh period of approximate devices should be a multiple of the normal refresh period (tRF = 64 ms). In a conventional DRAM module, when the memory controller issues the refresh command, all the DRAM devices in a rank are refreshed simultaneously, and are thus refreshed at the same rate. To skip refresh operations for the approximate devices, the proposed DRAM architecture requires separate chip select signals for devices in the same rank, but this separation incurs a small area overhead. Furthermore, the design of the memory controller for the proposed architecture does not change much from the conventional design. To this end, in the new memory controller design, besides the refresh interval counter (tRFI = 7.8 μs) and row counter, a retention time round counter (multiple of 64 ms) is needed to decide whether a specific device needs to be refreshed during the current 64-ms round.

In most computing systems, the cache line size is 64 bytes. To optimize the latency when reading or writing cache lines from/to memory, one cache line should be fit in one DRAM full burst access (BL = 8). Since this paper focuses on the approximate memory for the class of deep learning applications, it is assumed that the cache line that is evicted/loaded to/from the approximate DRAM contains only single-precision floating-point data. Therefore, a 64-byte cache line contains 16 data words.

Since the refresh period is different for each group of 2 nearby bits, $32/2=16$ devices are needed to form a rank. To keep the bit width of the DRAM rank unchanged (64 bits), the ${\times}$4 DRAM device is used. The data mapping of APPROX2 is also shown in Fig. 1. Two corresponding bits of each data are merged and stored in a proper device. For example, {DATA1[31:30] and DATA0[31:30]} are stored in chip 15, {DATA1[29:28] and DATA0[29:28] are stored in chip 14, and so on. To keep the 9 critical bits (i.e., DATA[31:23]) protected, the chips 11, 12, 13, 14, and 15 must be error-free, and therefore refreshed at the normal rate. Other chips can contain erroneous bits, and are refreshed at a slower frequency. The refresh policy for APPROX2 is expressed as follows:

(1)

$R P(n)=\left\{\begin{array}{ll}64 & \text { for } 11 \leq n \leq 15 \\ (10-\mathrm{n}) * \text { incr }+\text { offset } & \text { for } 0 \leq n \leq 10\end{array}\right.$

where RP(n) represents the refresh period of n-th device and (incr, offset) are chosen experimentally. Therefore, a maximum refresh power saving of 11/16=68.75% can be achieved when the refresh operation of the approximate devices is turned off completely. Further evaluation with simulation results are provided in Section IV.

The instruction set architecture (ISA), operating system (OS), and compiler supports for systems with approximate memory are well described in ⁽⁵⁾ and ⁽¹³⁾. For the experiments in this study, a custom memory allocator is built to allocate the approximate data to the approximate memory. Similar to that in ⁽⁵⁾, the critical data partitioning is done by the programmer. Non-critical data are allocated in the approximate memory space. On the other hand, critical data such as instruction codes are stored in the precise memory space. In deep learning applications, if slight errors occur in the weights and feature maps, the performance might not degrade significantly. Therefore, they are non-critical data and can be stored in the approximate memory. This study uses the aforementioned memory allocator to allocate these data to the approximate memory. A pintool is written based on the Pin dynamic-instrumentation tool ⁽¹⁴⁾ to calculate the number of approximate pages over the total number of pages accessed by the program. It monitors all the memory accesses of the program to the approximate memory space and accumulates the total number of accessed approximate pages. The memory footprint profiling results of some deep networks, AlexNet ⁽²¹⁾, VGGNet ⁽¹⁵⁾ and GoogleNet ⁽¹⁶⁾, listed in Table 1 show that the portion of critical data in the total memory access is very small. Therefore, the refresh power can be considerably reduced by using the approximate memory. A practical hybrid memory system may contain precise devices and approximate devices, as depicted in Fig. 2.

Table 1. Memory footprint profiling from ⁽¹⁸⁾

Deep Networks	AlexNet	VGGNet	GoogLeNet
Proportion of approximate pages	99.78%	99.85%	99.67%

Fig. 2. Hybrid memory system.

Table 2. System configuration

Parameters	Values
Processor	4 out-of-order cores 2.53 GHz, 8-issue per core, 128 instruction queue, 64 ROB size, ×86 ISA support
L1I, L1D cache	16 KB per core, 4-way associative
L2 cache	1 MB shared cache, 4-bank, 16-way associative
Memory Controller	Single channel, open page, FR-FCFS, 64-entry queues per rank
DRAM	16 GB DDR3-1333 using 4 Gb ×4, ×8 Micron devices

III. Circuit Description

1. Architectural Simulation

To estimate the performance and power consumption of the proposed system when running deep learning applications by simulation, a cycle-accurate integrated simulator ⁽¹⁷⁾ based on Pin ⁽¹⁴⁾ binary instrumentation is used for fast simulation, detailed micro-architecture, and detailed DRAM memory subsystem. The memory bandwidth and detailed power consumption information of the memory subsystem can be obtained, which calculates the DRAM power consumption by following the methodology described in ⁽¹⁹⁾. The front-end Pin-based simulator is enhanced to emulate the OS kernel functionality for custom memory allocation. The non-critical data of the CNN programs are allocated to approximate virtual pages. These approximate virtual pages are implicitly assigned to the approximate physical page area. The system configuration is described in Table 2. The system has 4 out-of-order processor cores sharing the L2 cache, and each core has its own L1 instruction cache and data cache. A single memory channel is connected to a 16 GB Micron DDR3-1333 memory module consisting of ${\times}$4 and ${\times}$8 4 Gb devices.

Based on the pre-trained models from Caffe library ⁽²⁰⁾, this study builds the inference code of some state-of-the-art CNNs such as AlexNet ⁽²¹⁾, VGGNet ⁽¹⁵⁾, and GoogLeNet ⁽¹⁶⁾ for performing architectural simulation and power measurement. The separation of approximate data from the normal data may cause performance overhead, because it potentially impacts locality and parallelism. How this separation affects the system performance is presented in Section IV.

2. Simulation of Deep Learning Inference with Bit Error Injection

The most recent research ⁽²²⁾ has shown that the retention time is not equal for all DRAM cells. Instead, most cells have high retention time (called strong cells) while only a few cells (called leaky cells) have low retention time. This research also shows that DDR3 DRAM cells can hold their values longer than 64 ms. At extremely high temperatures such as 80 $^{\circ}$C and 90 $^{\circ}$C, the bit flip rate increases exponentially. The bit error probability of customized refresh periods used in this paper is referred to ⁽⁷⁾. For each bit in the approximate data, it is injected with an error having a corresponding probability depending on its criticality. The simulation of the proposed architecture described in Section II is injected at a bit granularity.

This study uses the pre-trained model from the Caffe library for GoogLeNet ⁽¹³⁾ as the baseline for analyzing the accuracy effect of approximate memory. The simulation results for other networks such as AlexNet and VGGNet are expected to be similar to GoogleNet as shown in ⁽⁷⁾.

Fig. 3. Hardware platform for CNN testing on real approximate DRAM (a) Block diagram, (b) Hardware setup for test.

A set of 10,000 test images from ImageNet 2012 dataset ⁽²³⁾ is used to measure the accuracy of the CNNs using the proposed approximate memory. The inference accuracy of each test is the Top-1 accuracy normalized to the accuracy of the corresponding test without using memory approximation. The experiments are conducted with various (offset, incr) values under different working temperatures.

3. CNN Tests on Real Approximate DRAM on FPGA

To verify the practicality of the error injection model, this study presents a hardware platform as shown in Fig. 3. An FPGA board is connected to the host computer via the Peripheral Component Interconnect Express (PCIe) port. The Direct Memory Access (DMA) module transfers data between the host computer and the DDR3 DRAM. The memory controller of the DRAM is customized to study the behavior of the approximate DRAM through long refresh periods. Because a device for temperature control is not available, these experiments are conducted at room temperatures. At low temperatures, the refresh period can be set even up to 100 s. A pre-trained model of VGGNet is used in this test. It should be noted that the CNN algorithms run on the host computer, not on the FPGA board.

The role of the approximate DRAM on the FPGA board is to inject error into the CNN model at runtime. For each test, the CNN model is temporarily saved in the approximate DRAM for a refresh period, then copied back to the host PC to run the classification.

IV. Simulation Results

1. Performance and Refresh Power Reduction

For the proposed approximate architecture, the reduction of refresh power consumption is derived mathematically from the following equations:

(2)

\begin{align} \begin{array}{l} Psave2=1- \frac{10+64\sum _{n=0}^{10}2/\left(n*incr+offset\right)}{32}\\ =0.6875- \sum _{n=0}^{10}\frac{4}{\left(n*incr+offset\right)} \end{array} \end{align}

where (offset, incr) are the aforementioned parameters. The refresh power reduction of the proposed architecture with respect to their parameters is shown in Fig. 4. When the offset is 4,096 ms, the refresh power reduction is almost saturated. Thus, for the approximate devices, the refresh period needs not be longer than 4,096 ms.

The system performance of the three deep networks is listed in Table 3. The baseline represents the results of the inference programs running on conventional memory, whereas the others represent the results of programs running on the proposed approximate memory. The proposed architecture has a similar instruction per cycle (IPC) to baseline. This is because the approximate architecture requires no change in the DRAM’s number of IOs and cache line size.

The breakdown of the power consumption is depicted in Fig. 5. The simulation assumes that one-fourth of the memory space is precise and three-fourths is approximate. This assumption makes sense as the memory footprint profiling results listed in Table 1 indicate that most of the memory accessed pages can be approximated. The approximate part is refreshed during a long period so that the refresh power is almost zero. In addition, the power consumption of the proposed scheme is normalized with the baseline, which uses conventional memory. APPROX2 consumes more background power than the baseline because it uses twice more DRAM chips (${\times}$4 devices) in a rank and the background power consumed by the ${\times}$4 chip is larger than the power half-consumed by the ${\times}$8 device. However, because refresh power is forcefully reduced, the total power consumption of APPROX2 is less than that of the baseline. It skips 68.1% of the refresh operations for the approximate part, while preserving system performance. Fig. 5 shows that APPROX2 saves up to 11.1 % of the total power consumption. More saving in the future is expected when the computing system uses larger memory modules composed of higher density devices such as 16, 32, and 64 Gb memory. As predicted in ⁽²⁾, the refresh power accounts for 50% of the total power consumption when using 64 Gb devices. Therefore, the refresh saving can achieve 25.5% (=0.75${\times}$50%${\times}$68.1%) of the total energy for the hybrid memory system assumed in this experiment.

Fig. 4. Refresh power reduction of APPROX2 w.r.t. (offset, incr).

Table 3. System performance (IPC)

Networks	Test scenario	IPC
AlexNet	Baseline APPROX2	1.561 1.563
VGGNet	Baseline APPROX2	1.212 1.211
GoogLeNet	Baseline APPROX2	1.798 1.804

Fig. 5. Power breakdown of some deep networks.

2. Simulation of Classification Tasks with Bit Error Injection

At temperatures lower than 60 $^{\circ}$C, the accuracy loss is negligible with all (offset, incr) pairs. However, the accuracy drop is significant for temperatures higher than 60 $^{\circ}$C. It can be seen in Fig. 6 that with offset = 3072, 4096 ms, an accuracy degrades significantly. On contradictory, with offset = 1024 ms, the accuracy loss is negligible (within 1%) at both 80 and 90 $^{\circ}$C. The refresh saving achieved with the (1024, 256) option is as high as 66.5%.

3. CNN Tests on Real Approximate DRAM on FPGA

Fig. 7 shows that the error injection model simulates well the error pattern of real approximate DRAM. Fig. 7(a) shows the result of the FPGA test and Fig. 7(b) is constructed based on the test results of the bit error model. The experimental results show that the performance of the real approximate DRAM is slightly better than that of the error injection. This is because the error probabilities used in the bit injection are the worst-case values measured in the real approximate DRAM ⁽²²⁾. At room temperature, although there is no critical bit protection, the refresh period can be prolonged to tens of seconds with very small performance loss.

4. Comparison with the Previous Works

This study attempts another approach that greatly reduces the power consumption by allowing the error occurrence in the DRAM cells. Previous studies on approximate memory through refresh control are presented in ⁽⁵⁾ and ⁽⁶⁾. In ⁽⁵⁾, the experiments show a slight reduction of 20-25% self-refresh power in the idle mode only. Research in ⁽⁶⁾ does not indicate any power measurements. The transposed architecture in ⁽⁷⁾ saves 70% of refresh power consumption at cost of additional buffers and control logics. On the other hand, the proposed approximate memory design requires only separate chip select signals for DRAM devices in the same rank and minor change in the memory controller. Meanwhile, the proposed architecture reduces 66.5% of the refresh power with negligible performance loss at any operating temperatures. It should be noted that the proposed architecture compensates for the problems of previous studies by reducing a significant amount of power consumption at various temperatures.

There are various compression methods such as pruning ⁽²⁴⁾ and quantization ^(25-²⁷⁾, which aggressively reduce the memory access, thereby reducing DRAM power consumption. Different from low-bit quantization, the proposed scheme does not require additional effort of fine-tuning or retraining to preserve the accuracy of the networks. Therefore, a floating-point data in original CNN model trained by GPU can be used directly in the proposed architecture.

Fig. 6. Accuracy drop of APPROX2 w.r.t (offset, incr) and high temperatures.

Fig. 7. Comparison of VGGNet test results on real approximate DRAM and bit injection simulation (a) FPGA test, (b) Bit injection simulation.

Compared to relevant research in ^(7,¹⁸⁾, which is able to reduce 26.0 % of the total energy for the hybrid memory, the proposed design achieves a competitive performance. Moreover, it does not require additional hardware overhead for multiple transposed buffers and address translation.

V. CONCLUSIONS

This paper presents an efficient approximate DRAM design which can save 66.5 % of refresh power consumption and up to 25.5% of the total memory power consumption for future 64Gb device. The error robustness of the proposed scheme is confirmed by testing at extremely high operation temperature. It is noteworthy that the proposed approximate DRAM requires a minor change in the memory controller without changing the DRAM device or incurring any significant additional hardware resources.

ACKNOWLEDGMENTS

This study was supported by the Research Program funded by the SeoulTech(Seoul National University of Science and Technology).

REFERENCES

Deng Z., Xu C., Ci Q., Faraboschi P., , Reduced-precision memory value approximation for deep learning, [Online]. Available: http://www. labs.hpe.com/techreports/2015/HPL-2015-100.html.

Liu J., Jaiyen B., Veras R., Mutlu O., 2012, RAIDR: Retention-aware intelligent DRAM refresh, in Proc 39th Annu Int Symp Comput Archit, pp. 1-12

Bhati I., Chishti Z., Lu S., Jacob B., 2015, Flexible auto-refresh: Enabling scalable and energy-efficient DRAM refresh reductions, in Proc 42nd Annu Int Symp Comput Archit, pp. 235-246

Raha A., Sutar S., Jayakumar H., Raghunathan V., July 2017, Quality configurable approximate DRAM, IEEE Trans. Comput., Vol. 66, No. 7, pp. 1172-1187

Liu S., Pattabiraman K., Moscibroda T., Zorn and B. G., 2011, Flikker: Saving DRAM Refresh-power through Critical Data Partitioning, in Proc. 16th Int. Conf. Archit. Support Program. Languages Operating Syst., pp. 213-224

Lucas J., Alvarez-Mesa M., Andersch M., Juurlink B., 2014, Sparkk: Quality-Scalable Approximate Storage in DRAM, in Memory Forum

Nguyen D. T., Kim H., Lee H.-J., Chang I.-J., 2018, An approximate memory architecture for a reduction of refresh power consumption in deep learning applications, in Proc. IEEE Int. Symp. Circuit Syst., pp. 1-5

Liu J., Jaiyen B., Kim Y., Wilkerson C., Mutlu O., 2013, An experimental study of data retention behavior in modern DRAM devices: Implications for retention time profiling mechanisms, in Proc. 40th Int. Symp. Comput. Archit., pp. 60-71

1Gb Mobile LPDDR , Micron Technology , , [Online]. Available: \href{http://www.micron.com/resource-details/1b6bc910-f3a6-4aa4-9fe7-0265026cf1d3}{www.micron.com/resource-details/1b6b c910-f3a6-4aa4-9fe7-0265026cf1d3}.

Courbariaux M., David J., Bengio Y., , Training deep neural networks with low precision multiplications, [Online]. Available: arXiv: 1412. 7024.

Gupta S., Agrawal A., Gopalakrishnan K., Narayanan P., 2015, Deep learning with limited numerical precision, in Proc. Int. Conf. Mach. Learn, pp. 1737-1746

Whitehead N., Fit-florea A., , Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs, [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/ summary?doi= 10.1.1.231.215.

Samson , 2011, EnerJ: Approximate data types for safe and general low-power computation, in Proc. 32nd ACM SIGPLAN Conf. Program. Language Des. Implementation, pp. 164-174

Luk C. K., 2005, Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation, in Proc. ACM SIGPLAN Conf. Prog. Lang. Des. Imp., pp. 190-200

Simonyan K., Zisserman A., , Very deep convolutional networks for large-scale image recognition, [Online]. Available: arXiv:1409.1556.

Szegedy , 2015, Going deeper with convolutions, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1-9

Bick K., Nguyen D. T., Lee H. -J., Kim H., 2018, Fast and Accurate Memory Simulation by Integrating DRAMSim2 into McSimA+, MDPI Electronics, Vol. 7, No. 8, pp. 152

Nguyen D. T., Nguyen H. H., Kim H., Lee H.-J., May 2020, An Approximate Memory Architecture for Energy Saving in Deep Learning Applications, IEEE Trans. Circuits Syst. I Reg. Papers, Vol. 67, No. 5, pp. 1588-1601

Calculating Memory System Power for DDR3 , 2007, Micron Technology

Jia Y., 2014, Caffe: Convolutional architecture for fast feature embedding, in Proc. 22nd ACM Int. Conf. Multimed., pp. 675-678

Krizhevsky A., Sutskever I., Hinton G. E., 2012, ImageNet classification with deep convolutional neural networks, in Proc. Adv. Neural Inf. Process. Syst., Vol. 1, pp. 1097-1105

Jung M., 2017, A Platform to analyze DDR3 DRAM’s power and retention time, IEEE Des. Test, pp. 52-59

Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein M., Berg A., Fei-Fei L., 2015, ImageNet large scale visual recognition challenge, Int. J. Comput. Vis., pp. 211-252

Han S., Pool J., Tran J., Dally W. J., , Learning both weights and connections for efficient neural network, . [Online]. Available: arxiv.org/abs/1506. 02626.

Nguyen D. T., Nguyen T. N., Kim H., 2019, A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Vol. 27, No. 8, pp. 1861-1873

Lin D. D., Talathi S. S., Annapureddy V. S., 2016, Fixed Point Quantization of Deep Convolutional Networks, in Proc. Int. Conf. Mach. Learn. (ICML), pp. 2849-2858

Restegari M., Ordonez V., Redmon J., Farhadi A., , XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, [Online]. Available: arxiv.org/abs/1603.05279.

Nguyen D. T., Lee H.-J., Kim H., Chang I.-J., 2020, An approximate DRAM design with a flexible refresh scheme for low power deep learning applications, in Int. Conf. on Electronics Information and Communication

Author

Duy Thanh Nguyen

received the B.S. degree in electrical engineering from Hanoi University of Science and Technology, Hanoi, Vietnam, M.S. degree in Electrical and Computer Engineering from Seoul National University, Seoul, Korea, in 2011 and 2014, respectively.

He is currently working toward the Ph.D. degree in Electrical and Computer Engineering at Seoul National University.

His research interests include computer architecture, memory system, SoC design for computer vision applications.

Hyun Kim

received the B.S., M.S. and Ph.D. degrees in Electrical Engineering and Computer Science from Seoul National University, Seoul, Korea, in 2009, 2011 and 2015, respectively.

From 2015 to 2018, he was with the BK21 Creative Research Engineer Development for IT, Seoul National University, Seoul, Korea, as a BK Assistant Professor.

In 2018, he joined the Department of Electrical and Information Engineering, Seoul National University of Science and Technology, Seoul, Korea, where he is currently working as an Assistant Professor.

His research interests are the areas of algorithm, computer architecture, memory, and SoC design for low-complexity multimedia applications and deep neural networks.

Hyuk-Jae Lee

received the B.S. and M.S. degrees in electronics engi-neering from Seoul National University, South Korea, in 1987 and 1989, respectively, and the Ph.D. degree in Electrical and Computer Engineering from Purdue University, West Lafayette, IN, in 1996.

From 1996 to 1998, he was with the Faculty of the Department of Computer Science, Louisiana Tech University, Ruston, LS.

From 1998 to 2001, he was with the Server and Workstation Chipset Division, Intel Corporation, Hillsboro, OR, as a Senior Component Design Engineer.

In 2001, he joined the School of Electrical Engineering and Computer Science, Seoul National University, South Korea, where he is currently a Professor.

He is a Founder of Mamurian Design, Inc., a fabless SoC design house for multimedia applications.

His research interests are in the areas of computer architecture and SoC design for multimedia applications.