Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 19, No. 1, p.116-123

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 30 January 2019Revised : 0 0 0Accepted : 31 January 2019

DOI :

https://doi.org/10.5573/JSTS.2019.19.1.116

A 99.4 fps Optical Flow Estimation Processor with Image Tiling for Action Recognition in Mobile Devices

(Juhyoung Lee) ¹^† (Sungpill Choi) ¹ (Jinmook Lee) ¹ (Sanghoon Kang) ¹ (Hoi-Jun Yoo) ¹

(School of EE, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, Korea)

^†Corresponding author, E-mail: juhyoung@kaist.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

A 99.4 fps optical flow estimation (OFE) processor with image tiling is proposed for action recognition in mobile devices. The OFE is essential for the high action recognition accuracy. However, it is unsuitable for real-time constraint in a mobile computing environment because it requires a huge amount of external memory accesses (EMAs) and matrix computations. For mitigating the external memory bandwidth requirement, this paper proposes the tile-based hierarchical OFE. It divides input images into several tiles and enables intermediate data reusing with 326.4 KB on-chip memory and 175.8 MB/s external memory bandwidth. Moreover, a background decision unit with early termination is proposed to reduce computation workload. It gets rid of unnecessary matrix computation by terminates the computation early for zero optical flow region. As a result, the proposed features reduce external memory bandwidth by 99.3 % and increase throughput by 50.7 %, respectively. The proposed 12.8 mm2 OFE processor is implemented in 65 nm CMOS technology, and it achieves the real-time OFE with 99.4 framesper- second (fps) throughput for an image resolution of QVGA (320 × 240).

Index Terms

Optical flow estimation, action recognition, tile-based processing, early termination, real-time mobile user interface, high-throughput ASIC

I. INTRODUCTION

Recently, action recognition has been highlighted as a next-generation user interface in mobile devices such as pet robots for human-computer interaction (HCI)^[1,^2]. The most important thing in action recognition on mobile devices is the high recognition accuracy, thus several previous works^[3,^4] used deep neural networks with RGB images as their input. However, in order to achieve more robust action recognition performance on illumination changes and clutters in the background, we need motion information as well as RGB images^[5]. For example, action recognition accuracy is enhanced by about 14% when the motion information is used along with RGB images on a UCF-101 dataset^[6].

In order to obtain the motion information, optical flow estimation (OFE) is broadly used. It is an algorithm which calculates a set of vectors which contain an object or pattern’s movement in the image sequence. Fig. 1(a) shows the overall algorithm of the Brox Flow^[7,^8], which is an OFE algorithm that increases the action recognition accuracy. The OFE consists of two stages: 1) Scale space generation and 2) Flow generation. First, in the scale space generation stage, it generates multiple images in different scales (Gaussian Pyramid) to achieve scale invariance. Secondly, in the flow generation stage, conjugate gradient descent (CGD) algorithm estimates the final optical flow result by minimizing a cost function that is calculated by comparing the previous frame with the current frame for every scale in the pyramid.

Fig. 1. (a) The algorithm flow of the optical flow estimation (OFE), (b) External memory bandwidth, computation workload breakdown of the OFE.

However, it is difficult to implement real-time OFE in mobile devices due to massive amount of external memory access and OFE’s huge computational cost.

As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and 35.7 GFLOPS of computational throughput are required. Over 99.3% of the external memory bandwidth and 85% of the computation are occupied by the CGD iterations in flow generation stage. It is due to the CGD, which performs a lot of large (over 320 × 240 for QVGA image) matrix multiplications for every iteration.

In this paper, we propose a real-time OFE processor with two key features. 1) Tile-based hierarchical optical flow estimation algorithm (THOFE) is proposed to mitigate external memory bandwidth requirement. The THOFE divides input images into several tiles and reuses the intermediate data on-chip during the CGD as much as possible. Thus, it minimizes the required external memory bandwidth to 175.8 MB/s and required on-chip memory capacity from 4 MB to 326.4 KB.

2) Background detection unit with early termination algorithm is proposed to reduce the required computational cost by omitting redundant image patches with zero optical flow, thus increasing the computational throughput by 50.7%.

From the two features, the proposed hardware achieves 99.4 fps OFE with only 175.8 MB/s external memory bandwidth. Compared to the previous work^[9], the selection of optimal tile size and increased parallelism helped improve the OFE throughput by 116%.

II. PROPOSED TILE-BASED HIERARCHICAL OFE

1. Basic Concept & Description of the Proposed OFE

The basic concept of the proposed tile-based hierarchical OFE algorithm is to divide the input image to multiple smaller image tiles to reduce external memory access (EMA).

The CGD iterations, which is performed in the flow generation stage, occupy the largest portion of EMA. It is because, the CGD includes many iterations of matrix multiplication (over 320 × 240 × 26 dimension for QVGA image), whose dimension is too large to store the entire elements on-chip. In addition, the EMA requirement of the CGD is proportional to the size of the input image. The conventional OFE^[8] performs the CGD for the entire image at once. However, it is not efficient to accelerate the CGD for the entire image in mobile devices which have limited memory capacity.

Even though integration of larger on-chip memory might help reduce the EMA, over 4 MB of on-chip memory capacity (for QVGA image) is required, which is not feasible for low power consumption. Therefore, if we can reduce the input image size of the CGD, we will be able to reduce the EMA without the need for large onchip memory size.

Fig. 2 shows the proposed tile-based OFE algorithm. It divides an input image into several tiles, and performs the CGD in a separate manner. In other words, the only data which must be located in the on-chip memory during the CGD iterations are from the current processing tile. Therefore, the required memory size of the proposed tile-based hierarchical OFE is proportional to the tile size, which is much smaller than the whole input image size, to which the conventional method’s memory size was proportional. This tile division method is performed in both spatially (different tiles processed in the different cores) and temporally (different tiles processed at the different time). After CGD iterations are performed on each tile, the generated optical flow images from different tiles are collected in a final optical flow image. In addition, the number of tile division is determined by the size of each scale space image. In other words, the tile division is utilized only for lower layers of the scale space which has a larger image size.

Fig. 2. The overall flow of proposed tile-based hierarchical optical flow estimation.

However, due to the loss of information near tile boundaries, accuracy degradation must occur in tilebased hierarchical OFE. For example, the target motion vector which lies across different tiles cannot be estimated with a single tile. Therefore, tile overlapping is applied to prevent the degradation in the proposed OFE. By setting the overlapping region length to include motion vectors near the tile boundaries, we can avoid accuracy degradation. An analysis of proper overlapping length will be shown in Section II-B.

The proposed method requires a smaller on-chip memory size but a larger computation workload due to duplicated computations in the overlapping regions. This trade-off relationship is directly affected by the tile size. The smaller the tile size, the smaller the required memory the larger computation overhead. The analysis of this trade-off and proper tile size will be also shown in Section II-B.

There are three advantages of the proposed method. First, the required memory size is reduced by a ratio of (tile size)/(image size) compared to the conventional algorithm^[8]. Second, the proposed method shows scalability because the tile size is not affected by the image size. Even though the input images become larger, the proposed method can accelerate the OFE using the same on-chip memory whereas the conventional method requires larger on-chip memory. Finally, a parallel processing inside the input image is available because there is no data dependency among different tiles.

2. Analysis on Tile-based Hierarchical OFE

Fig. 3(a) shows the endpoint error difference of the conventional method and the proposed tile-based method over different overlapping lengths. The endpoint error (EPE) is measured around tile boundaries with the Middlebury dataset^[10], and is defined by (1),

Fig. 3. (a) The endpoint error difference between original OFE algorithm and proposed tile-based OFE algorithm, (b) The required memory and workload analysis by sweeping tile size L, (c) The product of required memory and workload of several tile sizes.

(1)

$\mathrm{EPE}=\sqrt{\left(u-u_{G T}\right)^{2}+\left(v-v_{G T}\right)^{2}}$

where ($u$, $v$) is the estimated optical flow vector and ($u_{GT}$ , $v_{GT}$) is the ground truth optical flow vector.

For the overlapping length larger than 5 pixels, the EPE around tile boundaries of the tiling method converges to the lowest point. Therefore, the overlapping length is set to 5 pixels.

Fig. 3(b) shows the computation workload and the required memory over different tile sizes with the overlapping length set as 5 pixels. Fig. 3(c) shows the (required memory × workload) over different tile sizes. In these analyses, the square-shaped tile is used. The overlapping region of length 5 is concatenated to the 4 sides of each tile. The results show that setting tile size as 20 generates the lowest (required memory × workload).

Fig. 4(a)-(f) show the proposed tile-based OFE results, with the overlapping length as 5 and the tile size as 20, in comparison with the conventional OFE method. The results show that there is only a 0.05 EPE difference between the conventional method and the proposed tilebased approach.

Fig. 4. (a) Ground truth optical flow, (b) Conventional algorithm’s optical flow, (c) Tile-based algorithm’s optical flow (d) RGB image, (e) Endpoint error map of conventional algorithm, (f) Endpoint error map of tile-based.

III. OVERALL ARCHITECTURE

Fig. 5 illustrates the overall architecture of the proposed processor. It consists of the 16 tile CGD cores, 1-D SIMD core, scale space core, and top RISC controller. The tile CGD cores are integrated to accelerate the CGD while exploiting the parallelism of the proposed tile-based hierarchical OFE. The 1-D SIMD core is integrated to accelerate the matrix computations which are utilized as generating the parameters for the CGD. The scale space core accelerates the image filtering which generates the scale spaces. The top RISC controller is used for the communication among every component in this processor through network-on-chip (NoC).

Fig. 5. The overall architecture of the proposed OFE processor.

Each tile CGD core consists of a tile CGD unit, a background detection unit (BDU), a 20.4 KB tile CGD memory, and a core controller. The tile CGD unit, which includes 14 5-way parallel multipliers, 16 5-way parallel adders, and an adder tree. The BDU performs background detection for early termination.

The 1-D SIMD core consists of an element-wise multiplication engine, an element-wise addition engine, and a 32 KB DMEM. The scale space core has a convolution engine for filtering process of scale space generation and a 32 KB DMEM.

The overall flow of the tile-based hierarchical OFE on the proposed processor is as follows: 1) The scale space core generates multiple scales of the input image. 2) The 1-D SIMD core generates parameters for the CGD operation using the generated images. 3) Each tile CGD core operates in a parallel manner to generate an optical flow for each tile with the generated parameters. 4) After the optical flow tiles are collected, it finally forms an entire optical flow image.

The tile size and parallelism are the two major differences between the proposed processor and the previous work [9]^[9]. In [9]^[9], the processor was optimized for 90 × 70 tile size and took shared memory architecture. In [9]^[9], data of one tile were stored in the shared memory and were accessed by two CGD cores. As a result, 328 KB of CGD memory was required, with 9.12 GOPS of computation overhead [9]^[9]. On the other hand, the proposed processor is optimized for 20 × 20 tile size, which shows the lowest (required memory × workload). Since the memory requirement for processing a single tile has been greatly reduced, several tiles can be processed simultaneously while using the similar sized memory to [9]^[9]. Therefore, the memory of each tile is allocated locally to each core, and 16 cores are integrated to operate in parallel. As a result, external memory access of 1.74 MB / frame is required for real-time OFE using 326 KB of CGD memory. Even though the computation overhead of 93.3 GOPS occurs, the throughput increases 2.16 times due to the parallelism of 16 CGD cores.

IV. PROPOSED BACKGROUND DECISION UNIT WITH EARLY TERMINATION

In the optical flow results, the pixels of zeros do not have any information and they may correspond to a background of the scene. Especially in the case of action recognition, there are numerous pixels with zero motion vector. The OFE result with UCF-101 dataset^[11] reveals that average 78.5 % pixels of a scene have zero optical flow. Therefore, CGD throughput can be increased by terminating the computations early for zero optical flow pixels.

Fig. 6 shows the flow of the proposed early termination scheme. The CGD processing with the proposed scheme is divided into 2 phases. During phase 1, CGD iterations are performed through the whole image. Then, the background detection may be performed following phase 1 by thresholding the magnitude of the optical flow vector. Thus, the pixels with small optical flow vector are abandoned. After the background detection, CGD iterations are performed without the abandoned pixels during phase 2. Early termination of the CGD process accomplishes 50.7 % throughput enhancement than the conventional method with 0.7% misprediction rate in the UCF-101 test dataset.

Fig. 6. The overall flow of the early termination scheme.

After the background detection, the addresses of nonterminated pixels must be saved for subsequent iterations. Non-terminated pixel’s addresses are stored efficiently utilizing only the distance between non-terminated pixels. In the tile CGD cores, distance map registers of total 320 bytes are integrated to store.

Fig. 7(a) explains the block diagram of the background detector. It updates the distance map register through thresholding the magnitude of each region’s optical flow vectors (du, dv). If the optical flow of the selected region is smaller than the threshold, the background detector increments the stored distance value. If not, the background detector stores the distance counter value into the distance map and resets the distance counter to 0 for the next region. Fig. 7(b) shows the operation of the tile CGD core for early termination using the stored distance map. The tile CGD core looks-up the distance map register and fetches the non-terminated pixels’ optical flow vectors from the tile CGD memory. And it feeds the vector into the tile CGD unit. If the distance map value is 0, the operation of the corresponding row is terminated and proceeds to the next row.

Fig. 7. (a) The background detector, (b) the operation on tile CGD core.

V. IMPLEMENTATION RESULTS

The proposed OFE processor shown in Fig. 8 is simulated using 65 nm CMOS technology and it occupies 12.8 mm² die area. The processor can operate at a maximum 200 MHz clock frequency with 1.2 V supply, and it consumes 278.4 mW power. It achieves 99.4 fps throughput for QVGA resolution (320 × 240) at 200 MHz clock frequency. Table 1 shows the comparison table with previous OFE hardware implementations. The proposed tile-based OFE processor shows 116 % higher throughput than [9]^[9]. This is because the proposed processor can exploit higher parallelism with the same memory size by selecting a more optimal tile size than [9]^[9]. The proposed tile-based hierarchical OFE processor fulfills the real-time constraint despite the fact that the OFE is performed for all pixels (100 % density). [13]^[13] can also estimate a 100% density optical flow map, but the proposed algorithm shows a much lower AAE than [13]^[13]. Fig. 9 shows the constant current injection capability of DCG. It can inject constant amplitudes of 100 kHz pseudo-sine currents, up to kΩs of load condition.

Fig. 8. The layout photograph and performance summary.

Table 1. Comparison Table

	[13]^[13] (2014)	[12]^[12] (2015)	[9]^[9] (2018)	This work
Processor/Process	FPGA (Xylinx V7)	FPGA (Xylinx V6)	ASIC/65mn CMOS	ASIC/65mn CMOS
Resolution	Full HD (1920x1080)	SVGA (800x600)	QVGA (320x240)	QVGA (320x240)
Power	12.22W	x	437mW	278.4mW
Frequency	84MHz	94MHz	200MHz	200MHz
Throughput	175 fps	196 fps	46.1 fps	99.4 fps
AAE^a	11.85/16.51°	3.75°	4.53°	4.64°
Density^b	59.04/100 %	35.24 %	100 %	100 %
Algorithm	H & S	L & K	Tile-based processing	Tile-based processing

^a Average angular error measured on the Yosemite sequence without clouds.

^b Density of pixels computed across all the pixels of the Yosemite sequence without clouds.

Fig. 9. (a) Estimated optical flow images, (b) Action recognition accuracy of proposed optical flow and baseline [7]^[7] (@ 1000 randomly sampled UCF-101 test dataset).

Fig. 9(a) indicates the estimated optical flow images from the UCF-101 dataset^[11]. Fig. 9(b) indicates the comparison of action recognition accuracy utilizing the LRCN network^[6] with the optical flow maps of the proposed OFE and the baseline^[7]. Randomly selected 1000 videos from UCF-101 were utilized for the test dataset, and the proposed OFE shows only 0.8% accuracy drop.

VI. CONCLUSIONS

In this paper, we implemented and simulated a tilebased OFE processor. It can support the OFE for action recognition in mobile devices in real-time with 2 key features: 1) The tile-based hierarchical OFE to reduce external memory access efficiently with a limited on-chip memory, 2) the background detection unit with early termination to increase the throughput by 50.7 %. With these features, the proposed OFE processor achieves 99.4 fps throughput for QVGA resolution (320 × 240) at 200 MHz core frequency, showing 278.4 mW power consumption, and satisfying the real-time constraint.

ACKNOWLEDGMENTS

Manuscript submitted October 5. This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No. 2016-0-00207, Intelligent Processor Architectures and Application Softwares for CNN (Convolutional Neural Network)-RNN (Recurrent Neural Network)).

REFERENCES

Chrungoo A., Manimaran S. S., Ravindran B., 2014, Activity Recognition for Natural Human Robot Interaction, ICSR 2014. Lecture Notes in Computer Science, Springer, Vol. 8755

Rezazadegan Fahimeh, Shirazi Sareh, Upcrofit Ben, Milford Michael, 2017, Action recognition: From static datasets to moving robots, In Proceedings of IEEE International Conference on Robotics and Automation, pp. 3185-3191

Ji S., Xu W., Yang M., Yu K., 2013, 3D convolutional neural networks for human action recognition, IEEE PAMI, Vol. 35, No. 1, pp. 221-231

Karpathy A., Toderici G., Shetty S., Leung T., Sukthankar R., Fei-Fei L., 2014, Large-scale video classification with convolutional neural networks, In Proc. CVPR

Sevilla-Lara L., Liao Y., Guney F., Jampani V., Geiger A., Black M. J., 2017, On the integration of optical flow and action recognition, arXiv preprint arXiv:1712.08416

Donahue Jeff, Hendricks Lisa Anne, Rohrbach Marcus, Venugopalan Subhashini, Guadarrama Sergio, Saenko Kate, Darrell Trevor, 2015, Long-Term Recurrent Convolutional Networks for Visual Recognition and Description, IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Brox T., Bruhn A., Papenberg N., Weickert J., 2004, High accuracy optical flow estimation based on a theory for warping, In ECCV

Liu C., 2009, Beyond Pixels: Exploring New Representations and Applications for Motion Analysis, PhD thesis, MIT

Lee J., Kim C., Choi S., Shin D., Kang S., Yoo H., 2018, A 46.1 fps Global Matching Optical Flow Estimation Processor for Action Recognition in Mobile Devices, 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, pp. 1-5

Baker S., Scharstein D., Lewis J., Roth S., Black M. J., Szeliski R., 2011, A database and evaluation methodology for optical flow, International Journal of Computer Vision, Vol. 92, No. 1, pp. 1-31

Soomro K., Zamir A. R., Shah M., 2012, UCF101: A dataset of 101 human actions classes from videos in the wild, CRCV-TR-12-01, Tech. Rep.

Seong H., Rhee C., Lee H., 2015, A novel hardware architecture of the lucas-kanade optical flow for reduced frame memory access, Circuits and Systems for Video Technology, IEEE Transactions on, Vol. PP, No. 99, pp. 1-1

Komorkiewicz M., Kryjak T., Gorgoń M., 2014, Efficient hardware implementation of the Horn–Schunck algorithm for high-resolution real-time dense optical flow sensor, Sensors, 1424-8220, Vol. 14, No. 2, pp. 2860-2891

Author

Juhyoung Lee

is a master student at the Korea Advanced Institute of Science and Technology (KAIST).

His current research interests include energy-efficient multicore architectures and systems especially focused on artificial intelligence and computer vision fields.

Lee received a BS in electrical engineering from the KAIST.

He is a student member of IEEE.

Contact him at juhyoung@kaist.ac.kr.

Sungpill Choi

(S'13 – M’15) received the B.S. and M.S. degrees in Department of Electrical Engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea in 2012 and 2014, respectively.

He is currently working toward the Ph.D. degree in the same department at KAIST.

His research interests include the low-power and memory-efficient architecture design for mobile vision SoC especially object segmentation, stereoscopic and deep learning.

Jinmook Lee

(S’15) received the B.S. degrees in electrical engineering from Hanyang University, Seoul, Korea in 2014, and the M.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 2016, where he is currently pursuing the Ph. D. degree.

His research interests include energy-efficient deep learning inference/training accelerator ASIC design, embedded deep learning platform design and verification, embedded system development with FPGA programming, and deep learning algorithm for sequence recognition.

Affiliations: KAIST (Korea Advanced Institute of Science and Technology)

Postal Address: #1233, Dept. of Electrical Engineering, KAIST, 335 Gwahangno, Yuseong-gu, Daejeon 305-701, Republic of Korea

Phone: +82-42-350-5468

Fax: +82-42-350-3410

Sanghoon Kang

(S’16) received the B.S. and M.S. degrees with the School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 2016.

He is currently pursuing the Ph.D. degree with the School of Electrical Engineering.

His current research interests include low-power vision system-on-chip design and deep learning processor design.