(Juhyoung Lee)
1†
(Sungpill Choi)
1
(Jinmook Lee)
1
(Sanghoon Kang)
1
(Hoi-Jun Yoo)
1
-
(School of EE, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Optical flow estimation, action recognition, tile-based processing, early termination, real-time mobile user interface, high-throughput ASIC
I. INTRODUCTION
Recently, action recognition has been highlighted as a next-generation user interface
in mobile devices such as pet robots for human-computer interaction (HCI)[1,2]. The most important thing in action recognition on mobile devices is the high recognition
accuracy, thus several previous works[3,4] used deep neural networks with RGB images as their input. However, in order to achieve
more robust action recognition performance on illumination changes and clutters in
the background, we need motion information as well as RGB images[5]. For example, action recognition accuracy is enhanced by about 14% when the motion
information is used along with RGB images on a UCF-101 dataset[6].
In order to obtain the motion information, optical flow estimation (OFE) is broadly
used. It is an algorithm which calculates a set of vectors which contain an object
or pattern’s movement in the image sequence. Fig. 1(a) shows the overall algorithm of the Brox Flow[7,8], which is an OFE algorithm that increases the action recognition accuracy. The OFE
consists of two stages: 1) Scale space generation and 2) Flow generation. First, in
the scale space generation stage, it generates multiple images in different scales
(Gaussian Pyramid) to achieve scale invariance. Secondly, in the flow generation stage,
conjugate gradient descent (CGD) algorithm estimates the final optical flow result
by minimizing a cost function that is calculated by comparing the previous frame with
the current frame for every scale in the pyramid.
Fig. 1. (a) The algorithm flow of the optical flow estimation (OFE), (b) External
memory bandwidth, computation workload breakdown of the OFE.
However, it is difficult to implement real-time OFE in mobile devices due to massive
amount of external memory access and OFE’s huge computational cost.
As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory
bandwidth and 35.7 GFLOPS of computational throughput are required. Over 99.3% of
the external memory bandwidth and 85% of the computation are occupied by the CGD iterations
in flow generation stage. It is due to the CGD, which performs a lot of large (over
320 × 240 for QVGA image) matrix multiplications for every iteration.
In this paper, we propose a real-time OFE processor with two key features. 1) Tile-based
hierarchical optical flow estimation algorithm (THOFE) is proposed to mitigate external
memory bandwidth requirement. The THOFE divides input images into several tiles and
reuses the intermediate data on-chip during the CGD as much as possible. Thus, it
minimizes the required external memory bandwidth to 175.8 MB/s and required on-chip
memory capacity from 4 MB to 326.4 KB.
2) Background detection unit with early termination algorithm is proposed to reduce
the required computational cost by omitting redundant image patches with zero optical
flow, thus increasing the computational throughput by 50.7%.
From the two features, the proposed hardware achieves 99.4 fps OFE with only 175.8
MB/s external memory bandwidth. Compared to the previous work[9], the selection of optimal tile size and increased parallelism helped improve the
OFE throughput by 116%.
II. PROPOSED TILE-BASED HIERARCHICAL OFE
1. Basic Concept & Description of the Proposed OFE
The basic concept of the proposed tile-based hierarchical OFE algorithm is to divide
the input image to multiple smaller image tiles to reduce external memory access (EMA).
The CGD iterations, which is performed in the flow generation stage, occupy the largest
portion of EMA. It is because, the CGD includes many iterations of matrix multiplication
(over 320 × 240 × 26 dimension for QVGA image), whose dimension is too large to store
the entire elements on-chip. In addition, the EMA requirement of the CGD is proportional
to the size of the input image. The conventional OFE[8] performs the CGD for the entire image at once. However, it is not efficient to accelerate
the CGD for the entire image in mobile devices which have limited memory capacity.
Even though integration of larger on-chip memory might help reduce the EMA, over 4
MB of on-chip memory capacity (for QVGA image) is required, which is not feasible
for low power consumption. Therefore, if we can reduce the input image size of the
CGD, we will be able to reduce the EMA without the need for large onchip memory size.
Fig. 2 shows the proposed tile-based OFE algorithm. It divides an input image into several
tiles, and performs the CGD in a separate manner. In other words, the only data which
must be located in the on-chip memory during the CGD iterations are from the current
processing tile. Therefore, the required memory size of the proposed tile-based hierarchical
OFE is proportional to the tile size, which is much smaller than the whole input image
size, to which the conventional method’s memory size was proportional. This tile division
method is performed in both spatially (different tiles processed in the different
cores) and temporally (different tiles processed at the different time). After CGD
iterations are performed on each tile, the generated optical flow images from different
tiles are collected in a final optical flow image. In addition, the number of tile
division is determined by the size of each scale space image. In other words, the
tile division is utilized only for lower layers of the scale space which has a larger
image size.
Fig. 2. The overall flow of proposed tile-based hierarchical optical flow estimation.
However, due to the loss of information near tile boundaries, accuracy degradation
must occur in tilebased hierarchical OFE. For example, the target motion vector which
lies across different tiles cannot be estimated with a single tile. Therefore, tile
overlapping is applied to prevent the degradation in the proposed OFE. By setting
the overlapping region length to include motion vectors near the tile boundaries,
we can avoid accuracy degradation. An analysis of proper overlapping length will be
shown in Section II-B.
The proposed method requires a smaller on-chip memory size but a larger computation
workload due to duplicated computations in the overlapping regions. This trade-off
relationship is directly affected by the tile size. The smaller the tile size, the
smaller the required memory the larger computation overhead. The analysis of this
trade-off and proper tile size will be also shown in Section II-B.
There are three advantages of the proposed method. First, the required memory size
is reduced by a ratio of (tile size)/(image size) compared to the conventional algorithm[8]. Second, the proposed method shows scalability because the tile size is not affected
by the image size. Even though the input images become larger, the proposed method
can accelerate the OFE using the same on-chip memory whereas the conventional method
requires larger on-chip memory. Finally, a parallel processing inside the input image
is available because there is no data dependency among different tiles.
2. Analysis on Tile-based Hierarchical OFE
Fig. 3(a) shows the endpoint error difference of the conventional method and the proposed tile-based
method over different overlapping lengths. The endpoint error (EPE) is measured around
tile boundaries with the Middlebury dataset[10], and is defined by (1),
Fig. 3. (a) The endpoint error difference between original OFE algorithm and proposed
tile-based OFE algorithm, (b) The required memory and workload analysis by sweeping
tile size L, (c) The product of required memory and workload of several tile sizes.
where ($u$, $v$) is the estimated optical flow vector and ($u_{GT}$ , $v_{GT}$) is
the ground truth optical flow vector.
For the overlapping length larger than 5 pixels, the EPE around tile boundaries of
the tiling method converges to the lowest point. Therefore, the overlapping length
is set to 5 pixels.
Fig. 3(b) shows the computation workload and the required memory over different tile sizes
with the overlapping length set as 5 pixels. Fig. 3(c) shows the (required memory × workload) over different tile sizes. In these analyses,
the square-shaped tile is used. The overlapping region of length 5 is concatenated
to the 4 sides of each tile. The results show that setting tile size as 20 generates
the lowest (required memory × workload).
Fig. 4(a)-(f) show the proposed tile-based OFE results, with the overlapping length as 5 and the
tile size as 20, in comparison with the conventional OFE method. The results show
that there is only a 0.05 EPE difference between the conventional method and the proposed
tilebased approach.
Fig. 4. (a) Ground truth optical flow, (b) Conventional algorithm’s optical flow,
(c) Tile-based algorithm’s optical flow (d) RGB image, (e) Endpoint error map of conventional
algorithm, (f) Endpoint error map of tile-based.
III. OVERALL ARCHITECTURE
Fig. 5 illustrates the overall architecture of the proposed processor. It consists of the
16 tile CGD cores, 1-D SIMD core, scale space core, and top RISC controller. The tile
CGD cores are integrated to accelerate the CGD while exploiting the parallelism of
the proposed tile-based hierarchical OFE. The 1-D SIMD core is integrated to accelerate
the matrix computations which are utilized as generating the parameters for the CGD.
The scale space core accelerates the image filtering which generates the scale spaces.
The top RISC controller is used for the communication among every component in this
processor through network-on-chip (NoC).
Fig. 5. The overall architecture of the proposed OFE processor.
Each tile CGD core consists of a tile CGD unit, a background detection unit (BDU),
a 20.4 KB tile CGD memory, and a core controller. The tile CGD unit, which includes
14 5-way parallel multipliers, 16 5-way parallel adders, and an adder tree. The BDU
performs background detection for early termination.
The 1-D SIMD core consists of an element-wise multiplication engine, an element-wise
addition engine, and a 32 KB DMEM. The scale space core has a convolution engine for
filtering process of scale space generation and a 32 KB DMEM.
The overall flow of the tile-based hierarchical OFE on the proposed processor is as
follows: 1) The scale space core generates multiple scales of the input image. 2)
The 1-D SIMD core generates parameters for the CGD operation using the generated images.
3) Each tile CGD core operates in a parallel manner to generate an optical flow for
each tile with the generated parameters. 4) After the optical flow tiles are collected,
it finally forms an entire optical flow image.
The tile size and parallelism are the two major differences between the proposed processor
and the previous work [9][9]. In [9][9], the processor was optimized for 90 × 70 tile size and took shared memory architecture.
In [9][9], data of one tile were stored in the shared memory and were accessed by two CGD cores.
As a result, 328 KB of CGD memory was required, with 9.12 GOPS of computation overhead
[9][9]. On the other hand, the proposed processor is optimized for 20 × 20 tile size, which
shows the lowest (required memory × workload). Since the memory requirement for processing
a single tile has been greatly reduced, several tiles can be processed simultaneously
while using the similar sized memory to [9][9]. Therefore, the memory of each tile is allocated locally to each core, and 16 cores
are integrated to operate in parallel. As a result, external memory access of 1.74
MB / frame is required for real-time OFE using 326 KB of CGD memory. Even though the
computation overhead of 93.3 GOPS occurs, the throughput increases 2.16 times due
to the parallelism of 16 CGD cores.
IV. PROPOSED BACKGROUND DECISION UNIT WITH EARLY TERMINATION
In the optical flow results, the pixels of zeros do not have any information and they
may correspond to a background of the scene. Especially in the case of action recognition,
there are numerous pixels with zero motion vector. The OFE result with UCF-101 dataset[11] reveals that average 78.5 % pixels of a scene have zero optical flow. Therefore,
CGD throughput can be increased by terminating the computations early for zero optical
flow pixels.
Fig. 6 shows the flow of the proposed early termination scheme. The CGD processing with
the proposed scheme is divided into 2 phases. During phase 1, CGD iterations are performed
through the whole image. Then, the background detection may be performed following
phase 1 by thresholding the magnitude of the optical flow vector. Thus, the pixels
with small optical flow vector are abandoned. After the background detection, CGD
iterations are performed without the abandoned pixels during phase 2. Early termination
of the CGD process accomplishes 50.7 % throughput enhancement than the conventional
method with 0.7% misprediction rate in the UCF-101 test dataset.
Fig. 6. The overall flow of the early termination scheme.
After the background detection, the addresses of nonterminated pixels must be saved
for subsequent iterations. Non-terminated pixel’s addresses are stored efficiently
utilizing only the distance between non-terminated pixels. In the tile CGD cores,
distance map registers of total 320 bytes are integrated to store.
Fig. 7(a) explains the block diagram of the background detector. It updates the distance map
register through thresholding the magnitude of each region’s optical flow vectors
(du, dv). If the optical flow of the selected region is smaller than the threshold,
the background detector increments the stored distance value. If not, the background
detector stores the distance counter value into the distance map and resets the distance
counter to 0 for the next region. Fig. 7(b) shows the operation of the tile CGD core for early termination using the stored distance
map. The tile CGD core looks-up the distance map register and fetches the non-terminated
pixels’ optical flow vectors from the tile CGD memory. And it feeds the vector into
the tile CGD unit. If the distance map value is 0, the operation of the corresponding
row is terminated and proceeds to the next row.
Fig. 7. (a) The background detector, (b) the operation on tile CGD core.
V. IMPLEMENTATION RESULTS
The proposed OFE processor shown in Fig. 8 is simulated using 65 nm CMOS technology and it occupies 12.8 mm2 die area. The processor can operate at a maximum 200 MHz clock frequency with 1.2
V supply, and it consumes 278.4 mW power. It achieves 99.4 fps throughput for QVGA
resolution (320 × 240) at 200 MHz clock frequency. Table 1 shows the comparison table with previous OFE hardware implementations. The proposed
tile-based OFE processor shows 116 % higher throughput than [9][9]. This is because the proposed processor can exploit higher parallelism with the same
memory size by selecting a more optimal tile size than [9][9]. The proposed tile-based hierarchical OFE processor fulfills the real-time constraint
despite the fact that the OFE is performed for all pixels (100 % density). [13][13] can also estimate a 100% density optical flow map, but the proposed algorithm shows
a much lower AAE than [13][13]. Fig. 9 shows the constant current injection capability of DCG. It can inject constant amplitudes
of 100 kHz pseudo-sine currents, up to kΩs of load condition.
Fig. 8. The layout photograph and performance summary.
Table 1. Comparison Table
|
[13][13] (2014)
|
[12][12] (2015)
|
[9][9] (2018)
|
This work
|
Processor/Process
|
FPGA
(Xylinx V7)
|
FPGA
(Xylinx V6)
|
ASIC/65mn
CMOS
|
ASIC/65mn
CMOS
|
Resolution
|
Full HD
(1920x1080)
|
SVGA
(800x600)
|
QVGA
(320x240)
|
QVGA
(320x240)
|
Power
|
12.22W
|
x
|
437mW
|
278.4mW
|
Frequency
|
84MHz
|
94MHz
|
200MHz
|
200MHz
|
Throughput
|
175 fps
|
196 fps
|
46.1 fps
|
99.4 fps
|
AAEa
|
11.85/16.51°
|
3.75°
|
4.53°
|
4.64°
|
Densityb
|
59.04/100 %
|
35.24 %
|
100 %
|
100 %
|
Algorithm
|
H & S
|
L & K
|
Tile-based processing
|
Tile-based processing
|
a Average angular error measured on the Yosemite sequence without clouds.
b Density of pixels computed across all the pixels of the Yosemite sequence without
clouds.
Fig. 9. (a) Estimated optical flow images, (b) Action recognition accuracy of proposed
optical flow and baseline [7][7] (@ 1000 randomly sampled UCF-101 test dataset).
Fig. 9(a) indicates the estimated optical flow images from the UCF-101 dataset[11]. Fig. 9(b) indicates the comparison of action recognition accuracy utilizing the LRCN network[6] with the optical flow maps of the proposed OFE and the baseline[7]. Randomly selected 1000 videos from UCF-101 were utilized for the test dataset, and
the proposed OFE shows only 0.8% accuracy drop.
VI. CONCLUSIONS
In this paper, we implemented and simulated a tilebased OFE processor. It can support
the OFE for action recognition in mobile devices in real-time with 2 key features:
1) The tile-based hierarchical OFE to reduce external memory access efficiently with
a limited on-chip memory, 2) the background detection unit with early termination
to increase the throughput by 50.7 %. With these features, the proposed OFE processor
achieves 99.4 fps throughput for QVGA resolution (320 × 240) at 200 MHz core frequency,
showing 278.4 mW power consumption, and satisfying the real-time constraint.
ACKNOWLEDGMENTS
Manuscript submitted October 5. This work was supported by Institute for Information
& communications Technology Promotion(IITP) grant funded by the Korea government(MSIP)
(No. 2016-0-00207, Intelligent Processor Architectures and Application Softwares for
CNN (Convolutional Neural Network)-RNN (Recurrent Neural Network)).
REFERENCES
Chrungoo A., Manimaran S. S., Ravindran B., 2014, Activity Recognition for Natural
Human Robot Interaction, ICSR 2014. Lecture Notes in Computer Science, Springer, Vol.
8755
Rezazadegan Fahimeh, Shirazi Sareh, Upcrofit Ben, Milford Michael, 2017, Action recognition:
From static datasets to moving robots, In Proceedings of IEEE International Conference
on Robotics and Automation, pp. 3185-3191
Ji S., Xu W., Yang M., Yu K., 2013, 3D convolutional neural networks for human action
recognition, IEEE PAMI, Vol. 35, No. 1, pp. 221-231
Karpathy A., Toderici G., Shetty S., Leung T., Sukthankar R., Fei-Fei L., 2014, Large-scale
video classification with convolutional neural networks, In Proc. CVPR
Sevilla-Lara L., Liao Y., Guney F., Jampani V., Geiger A., Black M. J., 2017, On the
integration of optical flow and action recognition, arXiv preprint arXiv:1712.08416
Donahue Jeff, Hendricks Lisa Anne, Rohrbach Marcus, Venugopalan Subhashini, Guadarrama
Sergio, Saenko Kate, Darrell Trevor, 2015, Long-Term Recurrent Convolutional Networks
for Visual Recognition and Description, IEEE Conference on Computer Vision and Pattern
Recognition (CVPR)
Brox T., Bruhn A., Papenberg N., Weickert J., 2004, High accuracy optical flow estimation
based on a theory for warping, In ECCV
Liu C., 2009, Beyond Pixels: Exploring New Representations and Applications for Motion
Analysis, PhD thesis, MIT
Lee J., Kim C., Choi S., Shin D., Kang S., Yoo H., 2018, A 46.1 fps Global Matching
Optical Flow Estimation Processor for Action Recognition in Mobile Devices, 2018 IEEE
International Symposium on Circuits and Systems (ISCAS), Florence, Italy, pp. 1-5
Baker S., Scharstein D., Lewis J., Roth S., Black M. J., Szeliski R., 2011, A database
and evaluation methodology for optical flow, International Journal of Computer Vision,
Vol. 92, No. 1, pp. 1-31
Soomro K., Zamir A. R., Shah M., 2012, UCF101: A dataset of 101 human actions classes
from videos in the wild, CRCV-TR-12-01, Tech. Rep.
Seong H., Rhee C., Lee H., 2015, A novel hardware architecture of the lucas-kanade
optical flow for reduced frame memory access, Circuits and Systems for Video Technology,
IEEE Transactions on, Vol. PP, No. 99, pp. 1-1
Komorkiewicz M., Kryjak T., Gorgoń M., 2014, Efficient hardware implementation of
the Horn–Schunck algorithm for high-resolution real-time dense optical flow sensor,
Sensors, 1424-8220, Vol. 14, No. 2, pp. 2860-2891
Author
is a master student at the Korea Advanced Institute of Science and Technology (KAIST).
His current research interests include energy-efficient multicore architectures and
systems especially focused on artificial intelligence and computer vision fields.
Lee received a BS in electrical engineering from the KAIST.
He is a student member of IEEE.
Contact him at juhyoung@kaist.ac.kr.
(S'13 – M’15) received the B.S. and M.S. degrees in Department of Electrical Engineering
from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
in 2012 and 2014, respectively.
He is currently working toward the Ph.D. degree in the same department at KAIST.
His research interests include the low-power and memory-efficient architecture design
for mobile vision SoC especially object segmentation, stereoscopic and deep learning.
(S’15) received the B.S. degrees in electrical engineering from Hanyang University,
Seoul, Korea in 2014, and the M.S. degree in electrical engineering from the Korea
Advanced Institute of Science and Technology, Daejeon, South Korea, in 2016, where
he is currently pursuing the Ph. D. degree.
His research interests include energy-efficient deep learning inference/training accelerator
ASIC design, embedded deep learning platform design and verification, embedded system
development with FPGA programming, and deep learning algorithm for sequence recognition.
Affiliations: KAIST (Korea Advanced Institute of Science and Technology)
Postal Address: #1233, Dept. of Electrical Engineering, KAIST, 335 Gwahangno, Yuseong-gu,
Daejeon 305-701, Republic of Korea
Phone: +82-42-350-5468
Fax: +82-42-350-3410
(S’16) received the B.S. and M.S. degrees with the School of Electrical Engineering,
Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 2016.
He is currently pursuing the Ph.D. degree with the School of Electrical Engineering.
His current research interests include low-power vision system-on-chip design and
deep learning processor design.