Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 25, No. 04, p.451-458

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 8 Apr. 2025Revised : 15 May 2025Accepted : 25 May 2025

DOI :

https://doi.org/10.5573/JSTS.2025.25.4.451

RADAR: An Efficient FPGA-based ResNet Accelerator with Data-aware Reordering of Processing Sequences

ParkJuntae¹ ChoiDahun¹ KimHyun¹

(Department of Electrical and Information Engineering and Research Center for Electrical and Information Technology, Seoul National University of Science and Technology, Seoul 01811, Korea)

^*Corresponding author: Hyun Kim E-mail: hyunkim@seoultech.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

The deployment of compact convolutional neural network (CNN) models with skip connections on edge devices through dedicated hardware accelerators is increasingly prevalent. However, optimizing the use of limited on-chip memory (OCM) across multiple CNN layers, especially those with skip connections, remains a challenge. In this paper, we propose a novel CNN accelerator technique that reorders the computation sequence for each layer to maximize data reuse within the OCM, thereby minimizing DRAM access and improving the utilization of both the OCM and the convolution processor. Additionally, we introduce a shared buffer design that efficiently manages OCM usage across different layers, particularly those involving skip connections. Finally, we present a ResNet-18 accelerator IP, RADAR, implemented with the proposed technique on a Xilinx ZCU102 FPGA. RADAR achieves 64.9 GOPS/W and 446.9 GOPS while maintaining high accuracy, demonstrating significant improvements over prior works in terms of the trade-off between throughput, hardware resource efficiency, and model accuracy.

Index Terms

Convolutional neural network (CNN), field-programmable gate array (FPGA), hardware accelerator, ResNet, skip connection, data reordering

I. INTRODUCTION

Starting with the study of residual neural networks (ResNets) ^[1], which introduced skip connections using residual blocks, numerous models ^[2-^5] have actively utilized the skip connection technique to achieve high accuracy. A skip connection is a technique that allows the output of one layer to be fed directly into a subsequent layer, bypassing one or more intermediate layers. Skip connections help preserve the gradient flow through the network, making it easier to train deeper models by alleviating the vanishing or exploding gradient problem and improving the ability of the model to learn complex patterns. Consequently, skip connections have become essential not only in modern convolutional neural networks (CNNs) but also in transformer-based models ^[6,^7]. In addition, ResNet is widely used as a backbone network for various tasks, such as image classification ^[4], object detection ^[8], pose estimation ^[9], and segmentation ^[10].

Recently, with the development of high-performance models, there has been an increasing need for dedicated hardware that can effectively accelerate models on mobile and edge devices ^[11]. Graphics processing units (GPUs) have significant advantages in terms of versatility but suffer from high power consumption and difficulties in applying fine-tuned optimizations for specific networks. By contrast, accelerators based on field-programmable gate arrays (FPGAs) exhibit superior performance in terms of power efficiency and throughput relative to GPUs and are easier to optimize for models, leading to active research on accelerator designs for generic model ^[13] and application-specific models ^[12].

Although various accelerator studies have been conducted ^[13-^21,^25], there is a lack of research on how to effectively utilize on-chip memory (OCM) for general CNN layer operations, including skip connections. Operations involving skip connections, which enable high accuracy, require external memory access, leading to increased power consumption and often decreased speed. This structural issue is critical for mobile/edge devices with limited available energy and presents significant challenges in the optimization phase. Therefore, it is necessary to develop a method that can exploit remaining OCM data and supporting architecture that maximizes data reusability while maintaining the structure of skip connections.

In this study, we propose a novel CNN accelerator design technique and architecture that can effectively accelerate models with abundant skip connections to enhance data reusability. The contributions of this study are as follows:

• We analyze CNN layers with skip connections, which impose a heavy memory access burden, and propose a data-aware reordering of processing sequences to maximize data reuse in OCM, considering its limited size.

• We propose a shared buffer technique that efficiently uses buffers with the proposed reordering technique. We also propose an architecture that utilizes the proposed reordering and shared buffer techniques with fully pipelined processing elements (PEs) capable of processing operations at high speeds with high data reuse.

• Based on the proposed technique and architecture, we design a ResNet-18 accelerator with 8-bit quantization and implement it on a Xilinx ZYNQ UltraScale+ MPSoC ZCU102 FPGA board, achieving a throughput of 345 GOPS and power efficiency of 54.3 GOPS/W.

The remainder of this brief is organized as follows. Section II explains the background. Section III details the proposed reordering scheme, and Section IV provides the proposed architecture. Section V presents the experimental results and their analysis. Finally, the paper is concluded in Section VI.

II. BACKGROUND

1. Residual Blocks

ResNet is a representative model that applies a skip-connection technique using two types of residual blocks. Fig. 1 illustrates each type of residual block, where (a) and (b) represent normal blocks, and (c) and (d) depict bottleneck blocks. The bottleneck block comprised three convolutional (CONV) layers structured with a 3$\times$3 CONV block sandwiched between two 1$\times$1 CONV blocks. This increases the depth of the model while reducing the number of parameters, thus lowering the complexity. The 1$\times$1 CONV blocks at the top and bottom serve to reduce and expand the dimensions, respectively. This bottleneck block reduces the training time compared to using standard blocks. Figs. 1(a) and 1(c) add the input activation (IA) of the current block directly to its output activation (OA), whereas Figs. 1(b) and 1(d) show a 1$\times$1 CONV operation on the IA of the current block before adding it to the OA. ResNet, excluding the first CONV layer, max pooling layer, last average pooling layer, and fully connected (FC) layer, comprises two types of residual blocks stacked in a regular sequence. Models based on ResNet, such as ResNeXT ^[4], MobileNet-v2 ^[2], EfficientNet ^[3], and Resnest ^[5], also follow this pattern of regularity and achieve significant performance improvements by utilizing skip connections similar to ResNet. The requirement of storing and loading previous activation for skip connections is not a significant issue in environments with relatively large OCM, such as GPUs. However, for mobile/edge devices, where there are significant constraints on both OCM and power, the smaller the OCM, the greater the performance drop in speed and power.

Fig. 1. Residual blocks utilizing skip connection technique. (a) normal identity block, (b) normal residual block, (c) bottleneck identity block, and (d) bottleneck residual block.

2. Related Works

As research on ResNet and its effective acceleration has been vigorously pursued, numerous studies have been conducted from various perspectives. A study ^[13] achieved a relatively high throughput compared to prior studies by proposing a loop optimization strategy for different CNN models and designing an architecture that supports various CONV operations. However, this study has limitations owing to the insufficient consideration of skip connections in the dataflow optimization methods. Another study ^[14] implemented a CNN accelerator framework based on a streamlined architecture by replacing standard CONV blocks with depth-wise separable CONV blocks and using layer-fusion techniques to simplify models with skip connections. Although this achieves high throughput, it requires model modifications and retraining, and the lack of support for standard CONV operations poses a scalability issue. A different approach ^[15] increases the speed of the adder tree by setting up a system with a multi-clock domain, making the clock frequency of the adder tree twice as fast as that of other modules and systems. However, it does not consider the optimization of skip connections, which leads to limitations in generality and scalability. Another study ^[18] proposed sparsity-aware CONV acceleration for a pruned ResNet-18 to make the overall model sparse, achieving a high throughput at the highest sparse rate compared with other ResNet-18 accelerator studies, albeit with a significant drop in throughput at lower sparsity rates. In ^[17], a blocked Winograd-GEMM architecture was proposed to accelerate ResNet-18 by analyzing the performance of various Winograd tiles. Nonetheless, this study focused solely on CONV operation optimization with limited research on dataflow and demonstrated optimization inefficiencies in logic utilization versus digital signal processing (DSP) unit usage. Another study ^[20] improved the operational efficiency of ResNet-18 by unifying various filter sizes through a filter-based decomposition & clustering algorithm and eliminating invalid weights through a sparse-aware filter transformation scheme; however, it lacked an accuracy analysis and did not consider optimizations for operations, such as skip connections. Finally, ^[21] proposed a hardware-aware training algorithm that performs hardware-software codesign by removing or shortening skip connections during training. This approach reduced the required memory bandwidth and improved hardware resources but necessitated retraining and was limited by the dataset size.

III. REORDERING OF PROCESSING SEQUENCES

Selecting an appropriate processing sequence is crucial in designing CNN accelerators because it affects the effective PE architecture and the required number of accesses to the OCM and off-chip dynamic random access memory (DRAM). The typical processing sequence presented in Fig. 2(a), which considers the reuse efficiency of weights, IA, and OA as well as memory access, has the advantage of increasing the reuse efficiency of IA and weight. This allows for the generation of OA proportional to the size of $P_{oc}$ and OA tiles without storing the partial sum (PSUM) of OA in the DRAM, effectively reducing energy consumption in DRAM access. However, using the same processing sequence for all layers has the disadvantage that the OA tile produced at the final stage of each CONV layer does not match the IA tile required for the next CONV layer, thereby preventing the reuse of the remaining activation in the OCM. This drawback is particularly critical for layers that perform relatively simple element-wise (E/W) additions, such as skip connections, where the activation must be stored in the DRAM and loaded again shortly thereafter.

Therefore, we propose a data-aware reordering of processing sequences that can efficiently handle both the typical CONV and skip connection layers. The reordering technique shown in Fig. 2(b) minimizes DRAM access by maximizing the reuse of the activation existing in the OCM. Each block in Fig. 2(b) represents the operational direction of each layer. In the proposed method, the operation order of the IA tile in the channel direction is reversed for each layer, allowing the data existing in the OCM to be immediately utilized for the operation of the next layer when the last OA tile of the current layer is performed. For example, as shown in Fig. 2(b), if the CONV operation in layer i starts with the IA tile of the front channel and ends with the OA tile of the back channel, the operation in layer i+1 starts immediately with the back IA tile existing in the OCM. The processing sequence order for each layer is then reversed. Although applying this method slightly increases the complexity of the controller compared to applying the same processing sequence to all layers, it significantly reduces the required number of DRAM access and improves data reusability in the OCM. Additionally, this method is particularly effective for skip connections, which only perform a single addition operation, because the latency and energy consumed by memory access in skip connections are significant compared to the latency and energy required by the skip connection, often becoming a bottleneck.

Fig. 2. Processing sequences of CNNs (a) Typical. (b) Proposed. $P_{ic}$ and $P_{oc}$ denote the parallelism of the input channel and output channels, respectively.

IV. PROPOSED ARCHITECTURE

1. Efficient Buffer Usage using Shared Buffer

Typical accelerators ^[13-^15] use dedicated buffers (DBs) for weights, IA, and OA, each used for specific purposes. However, CNN models have a characteristic where the size of activation decreases, and the size of parameters increases from the front to the back layers. With dedicated buffers ^[23], unless the size of each buffer is set to very small, there will be cases where some layers do not fully utilize their buffers. It should be noted that setting the buffers to be very small significantly increases the number of accesses to DRAM.

Fig. 3 illustrates the utilization rate of each buffer for ResNet-18, including the size of each buffer and the average wasted OCM size based on these utilizations. In Fig. 3, the size of each buffer is set based on multiples of the greatest common divisor of the data sizes required by each layer to minimize the number of tiling operations per layer for each data type based on ResNet-18. For example, because the IA sizes of the residual CONV blocks are 392, 196, 98, 49, and 24.5 KiB based on 8-bit quantization, the size of the IA buffer is set to one of the multiples of 24.5 KiB. Fig. 3(a) shows that when 36, 49, and 49 KiB were allocated to the weight, IA, and OA buffers, respectively, their utilization rates were 91.9, 94.6, and 87.5%, respectively, resulting in an average of approximately 12 KiB wasted out of a total of 134 KiB of buffer (total buffer utilization rate = 91%). This trend worsened as the buffer size increased. For example, 484 KiB, which is almost half of the total 968 KiB buffer, is not used as shown in Fig. 3(c).

To address the inefficiency of dedicated OCM, in this study, a shared buffer technique is proposed to minimize the size of ineffectively wasted OCM. Fig. 4 presents the overall architecture tailored to the ResNet-18 model and the configuration of the buffer array for efficient OCM usage. The buffer array comprises four types of buffers. Among these, three buffers starting with the prefix DB (i.e., DB_IA, DB_OA, and DB_Param) are dedicated buffers used for a single purpose, each storing IA, OA, and parameters other than the weight (i.e., bias, scale factor, and batch normalization (BN) parameters), respectively. The remaining buffer, prefixed SB_# (# refers to the number of each shared buffer), is a shared buffer that can store weights, IAs, OAs, and previous IAs for skip connections, depending on the characteristics of the layer. To increase OCM utilization, DB_IA and DB_OA are allocated sizes that can contain the activation size of the smallest CONV layer (i.e., 24.5 KiB for ResNet-18), while DB_Param is set to a size that can store all parameters of a layer except for weight (i.e., 4 KiB for ResNet-18, which occupies minor resource). To achieve a high OCM utilization, the size of the shared buffer is determined by considering the size of the activation of each layer and weight. For ResNet-18, the shared buffer array consists of 16 buffers, each with 36 KiB (total 576 KiB). The buffer array usage method for each layer using the shared buffer is as follows: IA is primarily stored in DB_IA, and if necessary, it is sequentially stored using SB_# from the top. Similarly, the OA is first stored in DB_OA, and if needed, it is stored sequentially, starting from SB_#. The remaining shared buffers are used to store the weights and activation of the skip connections. By pre-storing the IA used in skip connections under low OCM demand, this approach effectively minimizes memory access overhead.

Fig. 3. Analysis examples of OCM utilization rate and effective OCM for ResNet-18. (a) W36-IA49-OA49, (b) W144-IA98-OA196, (c) W576-IA196-OA196.

Fig. 4. Overall architecture of proposed accelerator.

2. Architecture of RADAR

We propose RADAR, a novel ResNet-18 accelerator IP that exploits a processing sequence reordering with a shared buffer scheme. As illustrated in Fig. 4, the proposed accelerator is composed of four main components: a global controller that orchestrates the entire processing sequence, a buffer array with shared buffers, and two processors responsible for specific operations. A multiply-accumulate (MAC) processor handles CNN's main operations, such as CONV and FC layers. In contrast, a miscellaneous processor performs all other operations, such as quantization (Quant), dequantization (DeQuant), activation, BN, pooling, and E/W addition.

In the MAC processor, the PE is configured in a systolic array style capable of utilizing weight, IA, and PSUM reuse. This allows the previous PSUM to be forwarded to the next PE, and because each PE is configured in a pipeline manner, this results in high data reuse and speed. Channel parallelism, denoted as $P_{ic}$ and $P_{oc}$ in Fig. 2, is set to values of 8 and 16, respectively. To optimize 3$\times$3 CONV operations, a parallelism of 3 is applied in the height direction of the activation, resulting in a total of 1,152 PEs being used for MAC operations. Additionally, all modules in the miscellaneous processor are configured with the same parallelism as the MAC processor's $P_{oc}$ (i.e., 16), ensuring that the output data are fully pipelined.

%While systolic arrays or 2D array-type PE architecture have the advantage of achieving high data reusability more easily compared to SIMD-based PE architectures that process multiple data simultaneously, they also have a significant drawback: due to their generally deeper pipeline stages, PE utilization can drop sharply if the input data is not consecutively provided or data is not ready to be fed. This issue is particularly evident in weight-stationary architectures, where a considerable number of clock cycles are wasted during the replacement of weights after the current weights have been fully utilized. To minimize latency losses, we introduced a register that pre-fetches and stores the next set of weights to be used. This allows for weight replacement within a single clock cycle, thereby improving PE utilization and latency performance of the MAC processor with minimal hardware resources. The output PSUM from the PE array passes through an adder tree and is then stored in a dedicated PSUM buffer, where it is accumulated until a complete output is generated. Once the complete output is generated, it is sequentially transferred to the miscellaneous processor.

While systolic arrays or 2D PE architectures offer better data reusability than single-instruction multiple-data (SIMD)-based PEs, they suffer from lower PE utilization when input data are not consecutively available, particularly in weight-stationary architectures where significant clock cycles are wasted when replacing fully used weights. Therefore, to minimize latency losses, we introduce a register to pre-fetch and store the next set of weights. This enables weight replacement in a single clock cycle, thereby improving PE utilization and latency with minimal hardware overhead. The PSUM from the PE array passes through an adder tree, accumulates in a PSUM buffer, and is then transferred sequentially to the miscellaneous processor.

RADAR supports an 8-bit quantized ResNet-18 model, which requires additional operations. In a MAC processor, OAs are periodically generated. To ensure efficiency in terms of memory access, the OAs are not directly stored in the OA buffer. Instead, depending on the layer configuration, they are passed through additional operation modules before being stored in an OA buffer to be utilized in the next CONV layer. For example, if BN and ReLU operations are required before the next CONV layer, the output from the MAC processor is immediately passed through the miscellaneous processor's DeQuant-BN-ReLU-Quant modules in a sequential pipelined manner. The resulting output is then stored in the OA buffer within the buffer array. To support the proposed reordering, the last OA tile is stored in DB_IA after traversing non-CONV modules, ensuring its availability for the following CONV or E/W addition.

V. EXPERIMENTAL RESULTS

1. Experimental Environments

RADAR was implemented on a Xilinx ZCU102 FPGA board using Xilinx Vivado 23.1 and Vitis 23.1 tools. The power consumption of RADAR was measured by the Vivado power estimator tool. The commonly used linear INT8 post-training quantization ^[22] and batch normalization folding ^[24] techniques were applied, and experiments were conducted using an NVIDIA RTX 3090.

2. Ablation Study

Table 1 lists the number of DRAM accesses and effective OCM utilization on a 625 KiB OCM when the proposed data-aware reordering and shared buffer techniques are applied separately to ResNet-18. The results indicate that while the reordering technique alone does not improve effective OCM utilization, it significantly reduces DRAM accesses to approximately 22% of the baseline. In contrast, applying only the shared buffer technique reduces DRAM accesses by about 9% compared to the baseline, but effectively improves OCM utilization by approximately 22.6%. Ultimately, applying both methods achieved an approximately 33% reduction in DRAM access and an approximately 29.2% improvement in OCM utilization compared to the baseline. %Although the reordering technique alone does not enhance effective OCM utilization, the combined use of both techniques leads to greater improvement because the shared buffer increases the proportion of valid data when transitioning between layer operations.

Table 1. Ablation studies for each proposed method.

3. Performance Comparison

Table 2 presents a comparison of the proposed RADAR with existing accelerators in terms of performance and resource usage, where the information not provided in prior research is denoted by a dashed line (-). RADAR achieved the highest throughput (GOPS) as well as the highest power efficiency (GOPS/W) and OCM efficiency (GOPS/KiB) compared with previous studies. These metrics are particularly relevant indicators for mobile/edge devices with limited hardware resources and batteries. In terms of throughput, RADAR achieves approximately 17% higher throughput than the previous highest result presented in ^[17], while using significantly fewer hardware resources such as OCM, LUTs, and DSPs. Consequently, it achieves 2.1 times higher OCM efficiency compared to ^[17]. In the case of ^[15], aggressive quantization using 2-bit weights and 8-bit activations enables high OCM efficiency and relatively high throughput with minimal hardware resources among prior studies, but this approach comes at the cost of a 2.9% in accuracy. Consequently, this study achieves the highest power efficiency and OCM efficiency compared to prior studies by maximizing data reuse in OCM through the proposed reordering of processing sequences and shared buffer techniques.

Table 2. Performance comparison with previous works.

VI. CONCLUSION

In this study, we designed an energy-efficient ResNet-18 accelerator, RADAR, optimized for DRAM access and OCM utilization, which are the most important factors in CNN accelerator implementation, by proposing data-aware reordering of processing sequences and shared buffer that considers skip connections. We anticipate that the proposed design will accelerate the commercialization of CNN inference for various mobile/edge devices.

ACKNOWLEDGMENTS

This research was supported by Seoul National University of Science and Technology.

References

K. He, X. Zhang, S. Ren, and J. Sun, ``Deep residual learning for image recognition,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, ``MobileNetV2: Inverted residuals and linear bottlenecks,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510-4520, 2018.

M. Tan and Q. Le, ``Efficientnet: Rethinking model scaling for convolutional neural networks,'' Proc. of International Conference on Machine Learning, pp. 6105-6114, 2019.

S. Xie, R. Girshick, P. Dollár Z. Tu, and K. He, ``Aggregated residual transformations for deep neural networks,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492-1500, 2017.

H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, and Z. Zhang, ``Resnest: Split-attention networks,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2736-2746, 2022.

N. J. Kim, J. Lee, and H. Kim, ``HyQ: Hardware-friendly post-training quantization for CNN-transformer hybrid networks,'' Proc. of International Joint Conference on Artificial Intelligence (IJCAI), pp. 4291-4299, 2024.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, ``An image is worth $16\times16$ words: Transformers for image recognition at scale,'' arXiv preprint arXiv:2010.11929, 2020.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, ``SSD: Single shot multibox detector,'' Proc. of European Conference on Computer Vision, pp. 21-37, 2016.

Y. Cai, Z. Wang, Z. Luo, B. Yin, A. Du, H. Wang, X. Zhang, X. Zhou, E. Zhou, and J. Sun, ``Learning delicate local representations for multi-person pose estimation,'' Proc. of European Conference on Computer Vision, pp. 455-472, 2020.

S. I. Lee and H. Kim, ``GaussianMask: Uncertainty-aware instance segmentation based on Gaussian modeling,'' Proc. of International Conference on Pattern Recognition (ICPR), pp. 3851-3857, 2022.

R. Singh and S. S. Gill, ``Edge AI: A survey,'' Internet of Things and Cyber-Physical Systems, vol. 3, pp. 71-92, 2023.

D. T. Nguyen, T. N. Nguyen, H. Kim, and H.-J. Lee, ``A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 8, pp. 1861-1873, 2019.

Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, ``Optimizing the convolution operation to accelerate deep neural networks on FPGA,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 7, pp. 1354-1367, 2018.

R. Zhao, H.-C. Ng, W. Luk, and X. Niu, ``Towards efficient convolutional neural network for domain-specific applications on FPGA,'' in Proc. of International Conference on Field Programmable Logic and Applications (FPL), pp. 147-1477, 2018.

Y. Chen, K. Zhang, C. Gong, C. Hao, X. Zhang, and T. Li, ``T-DLA: An open-source deep learning accelerator for ternarized DNN models on embedded FPGA,'' Proc. of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 13-18, 2019.

Q. Xiao and Y. Liang, ``Zac: Towards automatic optimization and deployment of quantized deep neural networks on embedded devices,'' Proc. of IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1-6, 2019.

S. Kala and S. Nalesh, ``Efficient CNN accelerator on FPGA,'' IETE Journal of Research, vol. 66, no. 6, pp. 733-740, 2020.

J. Wen, Y. Ma, and Z. Wang, ``An efficient FPGA accelerator optimized for high throughput sparse CNN inference,'' Proc. of IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pp. 165-168, 2020.

X. Xie, J. Lin, Z. Wang, and J. Wei, ``An efficient and flexible accelerator design for sparse convolutional neural networks,'' IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 68, no. 7, pp. 2936-2949, 2021.

Y. Meng, C. Yang, S. Xiang, J. Wang, K. Mei, and L. Geng, ``An efficient CNN accelerator achieving high PE utilization using a dense-/sparse-aware redundancy reduction method and data-index decoupling workflow,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 31, no. 10, pp. 1537-1550, 2023.

O. Weng, G. Marcano, V. Loncar, A. Khnodamoradi, G. Abarajithan, N. Sheybani, A. Meza, F. Koushanfar, K. Denolf, J. M. Duarte, and R. Kastner, ``Tailor: Altering skip connections for resource-efficient inference,'' ACM Transactions on Reconfigurable Technology and Systems, vol. 17, no. 1, pp. 1-23, 2024.

M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, ``Up or down? Adaptive rounding for post-training quantization,'' Proc. of International Conference on Machine Learning, pp. 7197-7206, 2020.

D. T. Nguyen, H. Kim, and H.-J. Lee, ``Layer-specific optimization for mixed data flow with mixed precision in FPGA design for CNN-based object detectors,'' IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 6, pp. 2450-2464, 2020.

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, and A. Howard, ``Quantization and training of neural networks for efficient integer-arithmetic-only inference,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2704-2713, 2018.

S. Ki, J. Park, and H. Kim, ``Dedicated FPGA Implementation of the Gaussian TinyYOLOv3 Accelerator,'' IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 10, pp. 3882-3886, 2023.

Juntae Park

Juntae Park received his B.S. and M.S. degree in electrical and information engineering from the Seoul National University of Science and Technology, Seoul, Korea, in 2023. His research interests include the areas of efficient hardware accelerator design for deep neural networks and computer architecture.

Dahun Choi

Dahun Choi is a Ph.D. student in electrical and information engineering from the Seoul National University of Science and Technology. He received an M.S. degree in electrical and information Engineering from the Seoul National University of Science and Technology, Seoul, Korea, in 2022. His research interests include the areas of network quantization and efficient network design for deep neural networks.

Hyun Kim

Hyun Kim received his B.S., M.S. and Ph.D. degrees in electrical engineering and computer science from Seoul National University, Seoul, Korea, in 2009, 2011, and 2015, respectively. From 2015 to 2018, he was with the BK21 Creative Research Engineer Development for IT, Seoul National University, Seoul, Korea, as a BK Assistant Professor. Since 2018, Dr. Kim joined the Department of Electrical and Information Engineering, Seoul National University of Science and Technology, Seoul, Korea, where he is currently serving as an Associate Professor. His research results have been internationally acclaimed. His research interests are the areas of algorithm, computer architecture, memory system design, and digital system (SoC) design for low-complexity multimedia applications and deep neural networks.