ParkJuntae1
ChoiDahun1
KimHyun1
-
(Department of Electrical and Information Engineering and Research Center for Electrical
and Information Technology, Seoul National University of Science and Technology, Seoul
01811, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Convolutional neural network (CNN), field-programmable gate array (FPGA), hardware accelerator, ResNet, skip connection, data reordering
I. INTRODUCTION
Starting with the study of residual neural networks (ResNets) [1], which introduced skip connections using residual blocks, numerous models [2-5] have actively utilized the skip connection technique to achieve high accuracy. A
skip connection is a technique that allows the output of one layer to be fed directly
into a subsequent layer, bypassing one or more intermediate layers. Skip connections
help preserve the gradient flow through the network, making it easier to train deeper
models by alleviating the vanishing or exploding gradient problem and improving the
ability of the model to learn complex patterns. Consequently, skip connections have
become essential not only in modern convolutional neural networks (CNNs) but also
in transformer-based models [6,7]. In addition, ResNet is widely used as a backbone network for various tasks, such
as image classification [4], object detection [8], pose estimation [9], and segmentation [10].
Recently, with the development of high-performance models, there has been an increasing
need for dedicated hardware that can effectively accelerate models on mobile and edge
devices [11]. Graphics processing units (GPUs) have significant advantages in terms of versatility
but suffer from high power consumption and difficulties in applying fine-tuned optimizations
for specific networks. By contrast, accelerators based on field-programmable gate
arrays (FPGAs) exhibit superior performance in terms of power efficiency and throughput
relative to GPUs and are easier to optimize for models, leading to active research
on accelerator designs for generic model [13] and application-specific models [12].
Although various accelerator studies have been conducted [13-21,25], there is a lack of research on how to effectively utilize on-chip memory (OCM) for
general CNN layer operations, including skip connections. Operations involving skip
connections, which enable high accuracy, require external memory access, leading to
increased power consumption and often decreased speed. This structural issue is critical
for mobile/edge devices with limited available energy and presents significant challenges
in the optimization phase. Therefore, it is necessary to develop a method that can
exploit remaining OCM data and supporting architecture that maximizes data reusability
while maintaining the structure of skip connections.
In this study, we propose a novel CNN accelerator design technique and architecture
that can effectively accelerate models with abundant skip connections to enhance data
reusability. The contributions of this study are as follows:
• We analyze CNN layers with skip connections, which impose a heavy memory access
burden, and propose a data-aware reordering of processing sequences to maximize data
reuse in OCM, considering its limited size.
• We propose a shared buffer technique that efficiently uses buffers with the proposed
reordering technique. We also propose an architecture that utilizes the proposed reordering
and shared buffer techniques with fully pipelined processing elements (PEs) capable
of processing operations at high speeds with high data reuse.
• Based on the proposed technique and architecture, we design a ResNet-18 accelerator
with 8-bit quantization and implement it on a Xilinx ZYNQ UltraScale+ MPSoC ZCU102
FPGA board, achieving a throughput of 345 GOPS and power efficiency of 54.3 GOPS/W.
The remainder of this brief is organized as follows. Section II explains the background.
Section III details the proposed reordering scheme, and Section IV provides the proposed
architecture. Section V presents the experimental results and their analysis. Finally,
the paper is concluded in Section VI.
II. BACKGROUND
1. Residual Blocks
ResNet is a representative model that applies a skip-connection technique using two
types of residual blocks. Fig. 1 illustrates each type of residual block, where (a) and (b) represent normal blocks,
and (c) and (d) depict bottleneck blocks. The bottleneck block comprised three convolutional
(CONV) layers structured with a 3$\times$3 CONV block sandwiched between two 1$\times$1
CONV blocks. This increases the depth of the model while reducing the number of parameters,
thus lowering the complexity. The 1$\times$1 CONV blocks at the top and bottom serve
to reduce and expand the dimensions, respectively. This bottleneck block reduces the
training time compared to using standard blocks. Figs. 1(a) and 1(c) add the input activation (IA) of the current block directly to its output activation
(OA), whereas Figs. 1(b) and 1(d) show a 1$\times$1 CONV operation on the IA of the current block before adding it
to the OA. ResNet, excluding the first CONV layer, max pooling layer, last average
pooling layer, and fully connected (FC) layer, comprises two types of residual blocks
stacked in a regular sequence. Models based on ResNet, such as ResNeXT [4], MobileNet-v2 [2], EfficientNet [3], and Resnest [5], also follow this pattern of regularity and achieve significant performance improvements
by utilizing skip connections similar to ResNet. The requirement of storing and loading
previous activation for skip connections is not a significant issue in environments
with relatively large OCM, such as GPUs. However, for mobile/edge devices, where there
are significant constraints on both OCM and power, the smaller the OCM, the greater
the performance drop in speed and power.
Fig. 1. Residual blocks utilizing skip connection technique. (a) normal identity block,
(b) normal residual block, (c) bottleneck identity block, and (d) bottleneck residual
block.
2. Related Works
As research on ResNet and its effective acceleration has been vigorously pursued,
numerous studies have been conducted from various perspectives. A study [13] achieved a relatively high throughput compared to prior studies by proposing a loop
optimization strategy for different CNN models and designing an architecture that
supports various CONV operations. However, this study has limitations owing to the
insufficient consideration of skip connections in the dataflow optimization methods.
Another study [14] implemented a CNN accelerator framework based on a streamlined architecture by replacing
standard CONV blocks with depth-wise separable CONV blocks and using layer-fusion
techniques to simplify models with skip connections. Although this achieves high throughput,
it requires model modifications and retraining, and the lack of support for standard
CONV operations poses a scalability issue. A different approach [15] increases the speed of the adder tree by setting up a system with a multi-clock domain,
making the clock frequency of the adder tree twice as fast as that of other modules
and systems. However, it does not consider the optimization of skip connections, which
leads to limitations in generality and scalability. Another study [18] proposed sparsity-aware CONV acceleration for a pruned ResNet-18 to make the overall
model sparse, achieving a high throughput at the highest sparse rate compared with
other ResNet-18 accelerator studies, albeit with a significant drop in throughput
at lower sparsity rates. In [17], a blocked Winograd-GEMM architecture was proposed to accelerate ResNet-18 by analyzing
the performance of various Winograd tiles. Nonetheless, this study focused solely
on CONV operation optimization with limited research on dataflow and demonstrated
optimization inefficiencies in logic utilization versus digital signal processing
(DSP) unit usage. Another study [20] improved the operational efficiency of ResNet-18 by unifying various filter sizes
through a filter-based decomposition & clustering algorithm and eliminating invalid
weights through a sparse-aware filter transformation scheme; however, it lacked an
accuracy analysis and did not consider optimizations for operations, such as skip
connections. Finally, [21] proposed a hardware-aware training algorithm that performs hardware-software codesign
by removing or shortening skip connections during training. This approach reduced
the required memory bandwidth and improved hardware resources but necessitated retraining
and was limited by the dataset size.
III. REORDERING OF PROCESSING SEQUENCES
Selecting an appropriate processing sequence is crucial in designing CNN accelerators
because it affects the effective PE architecture and the required number of accesses
to the OCM and off-chip dynamic random access memory (DRAM). The typical processing
sequence presented in Fig. 2(a), which considers the reuse efficiency of weights, IA, and OA as well as memory access,
has the advantage of increasing the reuse efficiency of IA and weight. This allows
for the generation of OA proportional to the size of $P_{oc}$ and OA tiles without
storing the partial sum (PSUM) of OA in the DRAM, effectively reducing energy consumption
in DRAM access. However, using the same processing sequence for all layers has the
disadvantage that the OA tile produced at the final stage of each CONV layer does
not match the IA tile required for the next CONV layer, thereby preventing the reuse
of the remaining activation in the OCM. This drawback is particularly critical for
layers that perform relatively simple element-wise (E/W) additions, such as skip connections,
where the activation must be stored in the DRAM and loaded again shortly thereafter.
Therefore, we propose a data-aware reordering of processing sequences that can efficiently
handle both the typical CONV and skip connection layers. The reordering technique
shown in Fig. 2(b) minimizes DRAM access by maximizing the reuse of the activation existing in the OCM.
Each block in Fig. 2(b) represents the operational direction of each layer. In the proposed method, the operation
order of the IA tile in the channel direction is reversed for each layer, allowing
the data existing in the OCM to be immediately utilized for the operation of the next
layer when the last OA tile of the current layer is performed. For example, as shown
in Fig. 2(b), if the CONV operation in layer i starts with the IA tile of the front channel and
ends with the OA tile of the back channel, the operation in layer i+1 starts immediately
with the back IA tile existing in the OCM. The processing sequence order for each
layer is then reversed. Although applying this method slightly increases the complexity
of the controller compared to applying the same processing sequence to all layers,
it significantly reduces the required number of DRAM access and improves data reusability
in the OCM. Additionally, this method is particularly effective for skip connections,
which only perform a single addition operation, because the latency and energy consumed
by memory access in skip connections are significant compared to the latency and energy
required by the skip connection, often becoming a bottleneck.
Fig. 2. Processing sequences of CNNs (a) Typical. (b) Proposed. $P_{ic}$ and $P_{oc}$
denote the parallelism of the input channel and output channels, respectively.
IV. PROPOSED ARCHITECTURE
1. Efficient Buffer Usage using Shared Buffer
Typical accelerators [13-15] use dedicated buffers (DBs) for weights, IA, and OA, each used for specific purposes.
However, CNN models have a characteristic where the size of activation decreases,
and the size of parameters increases from the front to the back layers. With dedicated
buffers [23], unless the size of each buffer is set to very small, there will be cases where some
layers do not fully utilize their buffers. It should be noted that setting the buffers
to be very small significantly increases the number of accesses to DRAM.
Fig. 3 illustrates the utilization rate of each buffer for ResNet-18, including the size
of each buffer and the average wasted OCM size based on these utilizations. In Fig. 3, the size of each buffer is set based on multiples of the greatest common divisor
of the data sizes required by each layer to minimize the number of tiling operations
per layer for each data type based on ResNet-18. For example, because the IA sizes
of the residual CONV blocks are 392, 196, 98, 49, and 24.5 KiB based on 8-bit quantization,
the size of the IA buffer is set to one of the multiples of 24.5 KiB. Fig. 3(a) shows that when 36, 49, and 49 KiB were allocated to the weight, IA, and OA buffers,
respectively, their utilization rates were 91.9, 94.6, and 87.5%, respectively, resulting
in an average of approximately 12 KiB wasted out of a total of 134 KiB of buffer (total
buffer utilization rate = 91%). This trend worsened as the buffer size increased.
For example, 484 KiB, which is almost half of the total 968 KiB buffer, is not used
as shown in Fig. 3(c).
To address the inefficiency of dedicated OCM, in this study, a shared buffer technique
is proposed to minimize the size of ineffectively wasted OCM. Fig. 4 presents the overall architecture tailored to the ResNet-18 model and the configuration
of the buffer array for efficient OCM usage. The buffer array comprises four types
of buffers. Among these, three buffers starting with the prefix DB (i.e., DB_IA, DB_OA,
and DB_Param) are dedicated buffers used for a single purpose, each storing IA, OA,
and parameters other than the weight (i.e., bias, scale factor, and batch normalization
(BN) parameters), respectively. The remaining buffer, prefixed SB_# (# refers to the
number of each shared buffer), is a shared buffer that can store weights, IAs, OAs,
and previous IAs for skip connections, depending on the characteristics of the layer.
To increase OCM utilization, DB_IA and DB_OA are allocated sizes that can contain
the activation size of the smallest CONV layer (i.e., 24.5 KiB for ResNet-18), while
DB_Param is set to a size that can store all parameters of a layer except for weight
(i.e., 4 KiB for ResNet-18, which occupies minor resource). To achieve a high OCM
utilization, the size of the shared buffer is determined by considering the size of
the activation of each layer and weight. For ResNet-18, the shared buffer array consists
of 16 buffers, each with 36 KiB (total 576 KiB). The buffer array usage method for
each layer using the shared buffer is as follows: IA is primarily stored in DB_IA,
and if necessary, it is sequentially stored using SB_# from the top. Similarly, the
OA is first stored in DB_OA, and if needed, it is stored sequentially, starting from
SB_#. The remaining shared buffers are used to store the weights and activation of
the skip connections. By pre-storing the IA used in skip connections under low OCM
demand, this approach effectively minimizes memory access overhead.
Fig. 3. Analysis examples of OCM utilization rate and effective OCM for ResNet-18.
(a) W36-IA49-OA49, (b) W144-IA98-OA196, (c) W576-IA196-OA196.
Fig. 4. Overall architecture of proposed accelerator.
2. Architecture of RADAR
We propose RADAR, a novel ResNet-18 accelerator IP that exploits a processing sequence
reordering with a shared buffer scheme. As illustrated in Fig. 4, the proposed accelerator is composed of four main components: a global controller
that orchestrates the entire processing sequence, a buffer array with shared buffers,
and two processors responsible for specific operations. A multiply-accumulate (MAC)
processor handles CNN's main operations, such as CONV and FC layers. In contrast,
a miscellaneous processor performs all other operations, such as quantization (Quant),
dequantization (DeQuant), activation, BN, pooling, and E/W addition.
In the MAC processor, the PE is configured in a systolic array style capable of utilizing
weight, IA, and PSUM reuse. This allows the previous PSUM to be forwarded to the next
PE, and because each PE is configured in a pipeline manner, this results in high data
reuse and speed. Channel parallelism, denoted as $P_{ic}$ and $P_{oc}$ in Fig. 2, is set to values of 8 and 16, respectively. To optimize 3$\times$3 CONV operations,
a parallelism of 3 is applied in the height direction of the activation, resulting
in a total of 1,152 PEs being used for MAC operations. Additionally, all modules in
the miscellaneous processor are configured with the same parallelism as the MAC processor's
$P_{oc}$ (i.e., 16), ensuring that the output data are fully pipelined.
%While systolic arrays or 2D array-type PE architecture have the advantage of achieving
high data reusability more easily compared to SIMD-based PE architectures that process
multiple data simultaneously, they also have a significant drawback: due to their
generally deeper pipeline stages, PE utilization can drop sharply if the input data
is not consecutively provided or data is not ready to be fed. This issue is particularly
evident in weight-stationary architectures, where a considerable number of clock cycles
are wasted during the replacement of weights after the current weights have been fully
utilized. To minimize latency losses, we introduced a register that pre-fetches and
stores the next set of weights to be used. This allows for weight replacement within
a single clock cycle, thereby improving PE utilization and latency performance of
the MAC processor with minimal hardware resources. The output PSUM from the PE array
passes through an adder tree and is then stored in a dedicated PSUM buffer, where
it is accumulated until a complete output is generated. Once the complete output is
generated, it is sequentially transferred to the miscellaneous processor.
While systolic arrays or 2D PE architectures offer better data reusability than single-instruction
multiple-data (SIMD)-based PEs, they suffer from lower PE utilization when input data
are not consecutively available, particularly in weight-stationary architectures where
significant clock cycles are wasted when replacing fully used weights. Therefore,
to minimize latency losses, we introduce a register to pre-fetch and store the next
set of weights. This enables weight replacement in a single clock cycle, thereby improving
PE utilization and latency with minimal hardware overhead. The PSUM from the PE array
passes through an adder tree, accumulates in a PSUM buffer, and is then transferred
sequentially to the miscellaneous processor.
RADAR supports an 8-bit quantized ResNet-18 model, which requires additional operations.
In a MAC processor, OAs are periodically generated. To ensure efficiency in terms
of memory access, the OAs are not directly stored in the OA buffer. Instead, depending
on the layer configuration, they are passed through additional operation modules before
being stored in an OA buffer to be utilized in the next CONV layer. For example, if
BN and ReLU operations are required before the next CONV layer, the output from the
MAC processor is immediately passed through the miscellaneous processor's DeQuant-BN-ReLU-Quant
modules in a sequential pipelined manner. The resulting output is then stored in the
OA buffer within the buffer array. To support the proposed reordering, the last OA
tile is stored in DB_IA after traversing non-CONV modules, ensuring its availability
for the following CONV or E/W addition.
V. EXPERIMENTAL RESULTS
1. Experimental Environments
RADAR was implemented on a Xilinx ZCU102 FPGA board using Xilinx Vivado 23.1 and Vitis
23.1 tools. The power consumption of RADAR was measured by the Vivado power estimator
tool. The commonly used linear INT8 post-training quantization [22] and batch normalization folding [24] techniques were applied, and experiments were conducted using an NVIDIA RTX 3090.
2. Ablation Study
Table 1 lists the number of DRAM accesses and effective OCM utilization on a 625 KiB OCM
when the proposed data-aware reordering and shared buffer techniques are applied separately
to ResNet-18. The results indicate that while the reordering technique alone does
not improve effective OCM utilization, it significantly reduces DRAM accesses to approximately
22% of the baseline. In contrast, applying only the shared buffer technique reduces
DRAM accesses by about 9% compared to the baseline, but effectively improves OCM utilization
by approximately 22.6%. Ultimately, applying both methods achieved an approximately
33% reduction in DRAM access and an approximately 29.2% improvement in OCM utilization
compared to the baseline. %Although the reordering technique alone does not enhance
effective OCM utilization, the combined use of both techniques leads to greater improvement
because the shared buffer increases the proportion of valid data when transitioning
between layer operations.
Table 1. Ablation studies for each proposed method.
3. Performance Comparison
Table 2 presents a comparison of the proposed RADAR with existing accelerators in terms of
performance and resource usage, where the information not provided in prior research
is denoted by a dashed line (-). RADAR achieved the highest throughput (GOPS) as well
as the highest power efficiency (GOPS/W) and OCM efficiency (GOPS/KiB) compared with
previous studies. These metrics are particularly relevant indicators for mobile/edge
devices with limited hardware resources and batteries. In terms of throughput, RADAR
achieves approximately 17% higher throughput than the previous highest result presented
in [17], while using significantly fewer hardware resources such as OCM, LUTs, and DSPs.
Consequently, it achieves 2.1 times higher OCM efficiency compared to [17]. In the case of [15], aggressive quantization using 2-bit weights and 8-bit activations enables high OCM
efficiency and relatively high throughput with minimal hardware resources among prior
studies, but this approach comes at the cost of a 2.9% in accuracy. Consequently,
this study achieves the highest power efficiency and OCM efficiency compared to prior
studies by maximizing data reuse in OCM through the proposed reordering of processing
sequences and shared buffer techniques.
Table 2. Performance comparison with previous works.
VI. CONCLUSION
In this study, we designed an energy-efficient ResNet-18 accelerator, RADAR, optimized
for DRAM access and OCM utilization, which are the most important factors in CNN accelerator
implementation, by proposing data-aware reordering of processing sequences and shared
buffer that considers skip connections. We anticipate that the proposed design will
accelerate the commercialization of CNN inference for various mobile/edge devices.
ACKNOWLEDGMENTS
This research was supported by Seoul National University of Science and Technology.
References
K. He, X. Zhang, S. Ren, and J. Sun, ``Deep residual learning for image recognition,''
Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778,
2016.

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, ``MobileNetV2: Inverted
residuals and linear bottlenecks,'' Proc. of IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 4510-4520, 2018.

M. Tan and Q. Le, ``Efficientnet: Rethinking model scaling for convolutional neural
networks,'' Proc. of International Conference on Machine Learning, pp. 6105-6114,
2019.

S. Xie, R. Girshick, P. Dollár Z. Tu, and K. He, ``Aggregated residual transformations
for deep neural networks,'' Proc. of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1492-1500, 2017.

H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, and Z. Zhang, ``Resnest: Split-attention
networks,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 2736-2746, 2022.

N. J. Kim, J. Lee, and H. Kim, ``HyQ: Hardware-friendly post-training quantization
for CNN-transformer hybrid networks,'' Proc. of International Joint Conference on
Artificial Intelligence (IJCAI), pp. 4291-4299, 2024.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, ``An
image is worth $16\times16$ words: Transformers for image recognition at scale,''
arXiv preprint arXiv:2010.11929, 2020.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, ``SSD:
Single shot multibox detector,'' Proc. of European Conference on Computer Vision,
pp. 21-37, 2016.

Y. Cai, Z. Wang, Z. Luo, B. Yin, A. Du, H. Wang, X. Zhang, X. Zhou, E. Zhou, and J.
Sun, ``Learning delicate local representations for multi-person pose estimation,''
Proc. of European Conference on Computer Vision, pp. 455-472, 2020.

S. I. Lee and H. Kim, ``GaussianMask: Uncertainty-aware instance segmentation based
on Gaussian modeling,'' Proc. of International Conference on Pattern Recognition (ICPR),
pp. 3851-3857, 2022.

R. Singh and S. S. Gill, ``Edge AI: A survey,'' Internet of Things and Cyber-Physical
Systems, vol. 3, pp. 71-92, 2023.

D. T. Nguyen, T. N. Nguyen, H. Kim, and H.-J. Lee, ``A high-throughput and power-efficient
FPGA implementation of YOLO CNN for object detection,'' IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 27, no. 8, pp. 1861-1873, 2019.

Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, ``Optimizing the convolution operation
to accelerate deep neural networks on FPGA,'' IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 26, no. 7, pp. 1354-1367, 2018.

R. Zhao, H.-C. Ng, W. Luk, and X. Niu, ``Towards efficient convolutional neural network
for domain-specific applications on FPGA,'' in Proc. of International Conference on
Field Programmable Logic and Applications (FPL), pp. 147-1477, 2018.

Y. Chen, K. Zhang, C. Gong, C. Hao, X. Zhang, and T. Li, ``T-DLA: An open-source deep
learning accelerator for ternarized DNN models on embedded FPGA,'' Proc. of IEEE Computer
Society Annual Symposium on VLSI (ISVLSI), pp. 13-18, 2019.

Q. Xiao and Y. Liang, ``Zac: Towards automatic optimization and deployment of quantized
deep neural networks on embedded devices,'' Proc. of IEEE/ACM International Conference
on Computer-Aided Design (ICCAD), pp. 1-6, 2019.

S. Kala and S. Nalesh, ``Efficient CNN accelerator on FPGA,'' IETE Journal of Research,
vol. 66, no. 6, pp. 733-740, 2020.

J. Wen, Y. Ma, and Z. Wang, ``An efficient FPGA accelerator optimized for high throughput
sparse CNN inference,'' Proc. of IEEE Asia Pacific Conference on Circuits and Systems
(APCCAS), pp. 165-168, 2020.

X. Xie, J. Lin, Z. Wang, and J. Wei, ``An efficient and flexible accelerator design
for sparse convolutional neural networks,'' IEEE Transactions on Circuits and Systems
I: Regular Papers, vol. 68, no. 7, pp. 2936-2949, 2021.

Y. Meng, C. Yang, S. Xiang, J. Wang, K. Mei, and L. Geng, ``An efficient CNN accelerator
achieving high PE utilization using a dense-/sparse-aware redundancy reduction method
and data-index decoupling workflow,'' IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 31, no. 10, pp. 1537-1550, 2023.

O. Weng, G. Marcano, V. Loncar, A. Khnodamoradi, G. Abarajithan, N. Sheybani, A. Meza,
F. Koushanfar, K. Denolf, J. M. Duarte, and R. Kastner, ``Tailor: Altering skip connections
for resource-efficient inference,'' ACM Transactions on Reconfigurable Technology
and Systems, vol. 17, no. 1, pp. 1-23, 2024.

M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, ``Up or down?
Adaptive rounding for post-training quantization,'' Proc. of International Conference
on Machine Learning, pp. 7197-7206, 2020.

D. T. Nguyen, H. Kim, and H.-J. Lee, ``Layer-specific optimization for mixed data
flow with mixed precision in FPGA design for CNN-based object detectors,'' IEEE Transactions
on Circuits and Systems for Video Technology, vol. 31, no. 6, pp. 2450-2464, 2020.

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, and A. Howard, ``Quantization and training
of neural networks for efficient integer-arithmetic-only inference,'' Proc. of IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2704-2713, 2018.

S. Ki, J. Park, and H. Kim, ``Dedicated FPGA Implementation of the Gaussian TinyYOLOv3
Accelerator,'' IEEE Transactions on Circuits and Systems II: Express Briefs, vol.
70, no. 10, pp. 3882-3886, 2023.

Juntae Park received his B.S. and M.S. degree in electrical and information engineering
from the Seoul National University of Science and Technology, Seoul, Korea, in 2023.
His research interests include the areas of efficient hardware accelerator design
for deep neural networks and computer architecture.
Dahun Choi is a Ph.D. student in electrical and information engineering from the
Seoul National University of Science and Technology. He received an M.S. degree in
electrical and information Engineering from the Seoul National University of Science
and Technology, Seoul, Korea, in 2022. His research interests include the areas of
network quantization and efficient network design for deep neural networks.
Hyun Kim received his B.S., M.S. and Ph.D. degrees in electrical engineering and
computer science from Seoul National University, Seoul, Korea, in 2009, 2011, and
2015, respectively. From 2015 to 2018, he was with the BK21 Creative Research Engineer
Development for IT, Seoul National University, Seoul, Korea, as a BK Assistant Professor.
Since 2018, Dr. Kim joined the Department of Electrical and Information Engineering,
Seoul National University of Science and Technology, Seoul, Korea, where he is currently
serving as an Associate Professor. His research results have been internationally
acclaimed. His research interests are the areas of algorithm, computer architecture,
memory system design, and digital system (SoC) design for low-complexity multimedia
applications and deep neural networks.