(Juhyoung Lee)
                     1†
                     
                     (Sungpill Choi)
                     1
                     
                     (Jinmook Lee)
                     1
                     
                     (Sanghoon Kang)
                     1
                     
                     (Hoi-Jun Yoo)
                     1
               
                  - 
                           
                        (School of EE, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, Korea)
                        
 
               
             
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Index Terms
               
               Optical flow estimation, action recognition, tile-based processing, early termination, real-time mobile user interface, high-throughput ASIC
             
            
          
         
            
                  I. INTRODUCTION
               Recently, action recognition has been highlighted as a next-generation user interface
                  in mobile devices such as pet robots for human-computer interaction (HCI)[1,2]. The most important thing in action recognition on mobile devices is the high recognition
                  accuracy, thus several previous works[3,4] used deep neural networks with RGB images as their input. However, in order to achieve
                  more robust action recognition performance on illumination changes and clutters in
                  the background, we need motion information as well as RGB images[5]. For example, action recognition accuracy is enhanced by about 14% when the motion
                  information is used along with RGB images on a UCF-101 dataset[6].
               
               In order to obtain the motion information, optical flow estimation (OFE) is broadly
                  used. It is an algorithm which calculates a set of vectors which contain an object
                  or pattern’s movement in the image sequence. Fig. 1(a) shows the overall algorithm of the Brox Flow[7,8], which is an OFE algorithm that increases the action recognition accuracy. The OFE
                  consists of two stages: 1) Scale space generation and 2) Flow generation. First, in
                  the scale space generation stage, it generates multiple images in different scales
                  (Gaussian Pyramid) to achieve scale invariance. Secondly, in the flow generation stage,
                  conjugate gradient descent (CGD) algorithm estimates the final optical flow result
                  by minimizing a cost function that is calculated by comparing the previous frame with
                  the current frame for every scale in the pyramid.
               
                
                  
                  
                        
                        
Fig. 1. (a) The algorithm flow of the optical flow estimation (OFE), (b) External
                           memory bandwidth, computation workload breakdown of the OFE.
                        
 
                      
                  
               
               However, it is difficult to implement real-time OFE in mobile devices due to massive
                  amount of external memory access and OFE’s huge computational cost.
               
               As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory
                  bandwidth and 35.7 GFLOPS of computational throughput are required. Over 99.3% of
                  the external memory bandwidth and 85% of the computation are occupied by the CGD iterations
                  in flow generation stage. It is due to the CGD, which performs a lot of large (over
                  320 × 240 for QVGA image) matrix multiplications for every iteration.
               
               In this paper, we propose a real-time OFE processor with two key features. 1) Tile-based
                  hierarchical optical flow estimation algorithm (THOFE) is proposed to mitigate external
                  memory bandwidth requirement. The THOFE divides input images into several tiles and
                  reuses the intermediate data on-chip during the CGD as much as possible. Thus, it
                  minimizes the required external memory bandwidth to 175.8 MB/s and required on-chip
                  memory capacity from 4 MB to 326.4 KB.
               
               2) Background detection unit with early termination algorithm is proposed to reduce
                  the required computational cost by omitting redundant image patches with zero optical
                  flow, thus increasing the computational throughput by 50.7%.
               
               From the two features, the proposed hardware achieves 99.4 fps OFE with only 175.8
                  MB/s external memory bandwidth. Compared to the previous work[9], the selection of optimal tile size and increased parallelism helped improve the
                  OFE throughput by 116%.
               
             
            
                  II. PROPOSED TILE-BASED HIERARCHICAL OFE
               
                     1. Basic Concept  &  Description of the Proposed OFE
                  The basic concept of the proposed tile-based hierarchical OFE algorithm is to divide
                     the input image to multiple smaller image tiles to reduce external memory access (EMA).
                  
                  The CGD iterations, which is performed in the flow generation stage, occupy the largest
                     portion of EMA. It is because, the CGD includes many iterations of matrix multiplication
                     (over 320 × 240 × 26 dimension for QVGA image), whose dimension is too large to store
                     the entire elements on-chip. In addition, the EMA requirement of the CGD is proportional
                     to the size of the input image. The conventional OFE[8] performs the CGD for the entire image at once. However, it is not efficient to accelerate
                     the CGD for the entire image in mobile devices which have limited memory capacity.
                  
                  Even though integration of larger on-chip memory might help reduce the EMA, over 4
                     MB of on-chip memory capacity (for QVGA image) is required, which is not feasible
                     for low power consumption. Therefore, if we can reduce the input image size of the
                     CGD, we will be able to reduce the EMA without the need for large onchip memory size.
                  
                  Fig. 2 shows the proposed tile-based OFE algorithm. It divides an input image into several
                     tiles, and performs the CGD in a separate manner. In other words, the only data which
                     must be located in the on-chip memory during the CGD iterations are from the current
                     processing tile. Therefore, the required memory size of the proposed tile-based hierarchical
                     OFE is proportional to the tile size, which is much smaller than the whole input image
                     size, to which the conventional method’s memory size was proportional. This tile division
                     method is performed in both spatially (different tiles processed in the different
                     cores) and temporally (different tiles processed at the different time). After CGD
                     iterations are performed on each tile, the generated optical flow images from different
                     tiles are collected in a final optical flow image. In addition, the number of tile
                     division is determined by the size of each scale space image. In other words, the
                     tile division is utilized only for lower layers of the scale space which has a larger
                     image size.
                  
                   
                     
                     
                           
                           
Fig. 2. The overall flow of proposed tile-based hierarchical optical flow estimation.
 
                         
                     
                  
                  However, due to the loss of information near tile boundaries, accuracy degradation
                     must occur in tilebased hierarchical OFE. For example, the target motion vector which
                     lies across different tiles cannot be estimated with a single tile. Therefore, tile
                     overlapping is applied to prevent the degradation in the proposed OFE. By setting
                     the overlapping region length to include motion vectors near the tile boundaries,
                     we can avoid accuracy degradation. An analysis of proper overlapping length will be
                     shown in Section II-B.
                  
                  The proposed method requires a smaller on-chip memory size but a larger computation
                     workload due to duplicated computations in the overlapping regions. This trade-off
                     relationship is directly affected by the tile size. The smaller the tile size, the
                     smaller the required memory the larger computation overhead. The analysis of this
                     trade-off and proper tile size will be also shown in Section II-B.
                  
                  There are three advantages of the proposed method. First, the required memory size
                     is reduced by a ratio of (tile size)/(image size) compared to the conventional algorithm[8]. Second, the proposed method shows scalability because the tile size is not affected
                     by the image size. Even though the input images become larger, the proposed method
                     can accelerate the OFE using the same on-chip memory whereas the conventional method
                     requires larger on-chip memory. Finally, a parallel processing inside the input image
                     is available because there is no data dependency among different tiles.
                  
                   
                
               
                     2. Analysis on Tile-based Hierarchical OFE
                  Fig. 3(a) shows the endpoint error difference of the conventional method and the proposed tile-based
                     method over different overlapping lengths. The endpoint error (EPE) is measured around
                     tile boundaries with the Middlebury dataset[10], and is defined by (1),
                  
                   
                     
                     
                           
                           
Fig. 3. (a) The endpoint error difference between original OFE algorithm and proposed
                              tile-based OFE algorithm, (b) The required memory and workload analysis by sweeping
                              tile size L, (c) The product of required memory and workload of several tile sizes.
                           
 
                         
                     
                  
                   
                     
                     
 
                     
                     
                     
                  
                  where ($u$, $v$) is the estimated optical flow vector and ($u_{GT}$ , $v_{GT}$) is
                     the ground truth optical flow vector.
                  
                  For the overlapping length larger than 5 pixels, the EPE around tile boundaries of
                     the tiling method converges to the lowest point. Therefore, the overlapping length
                     is set to 5 pixels.
                  
                  Fig. 3(b) shows the computation workload and the required memory over different tile sizes
                     with the overlapping length set as 5 pixels. Fig. 3(c) shows the (required memory × workload) over different tile sizes. In these analyses,
                     the square-shaped tile is used. The overlapping region of length 5 is concatenated
                     to the 4 sides of each tile. The results show that setting tile size as 20 generates
                     the lowest (required memory × workload).
                  
                  Fig. 4(a)-(f) show the proposed tile-based OFE results, with the overlapping length as 5 and the
                     tile size as 20, in comparison with the conventional OFE method. The results show
                     that there is only a 0.05 EPE difference between the conventional method and the proposed
                     tilebased approach.
                  
                   
                     
                     
                           
                           
Fig. 4. (a) Ground truth optical flow, (b) Conventional algorithm’s optical flow,
                              (c) Tile-based algorithm’s optical flow (d) RGB image, (e) Endpoint error map of conventional
                              algorithm, (f) Endpoint error map of tile-based.
                           
 
                         
                     
                  
                   
                
             
            
                  III. OVERALL ARCHITECTURE
               Fig. 5 illustrates the overall architecture of the proposed processor. It consists of the
                  16 tile CGD cores, 1-D SIMD core, scale space core, and top RISC controller. The tile
                  CGD cores are integrated to accelerate the CGD while exploiting the parallelism of
                  the proposed tile-based hierarchical OFE. The 1-D SIMD core is integrated to accelerate
                  the matrix computations which are utilized as generating the parameters for the CGD.
                  The scale space core accelerates the image filtering which generates the scale spaces.
                  The top RISC controller is used for the communication among every component in this
                  processor through network-on-chip (NoC).
               
                
                  
                  
                        
                        
Fig. 5. The overall architecture of the proposed OFE processor.
 
                      
                  
               
               Each tile CGD core consists of a tile CGD unit, a background detection unit (BDU),
                  a 20.4 KB tile CGD memory, and a core controller. The tile CGD unit, which includes
                  14 5-way parallel multipliers, 16 5-way parallel adders, and an adder tree. The BDU
                  performs background detection for early termination.
               
               The 1-D SIMD core consists of an element-wise multiplication engine, an element-wise
                  addition engine, and a 32 KB DMEM. The scale space core has a convolution engine for
                  filtering process of scale space generation and a 32 KB DMEM.
               
               The overall flow of the tile-based hierarchical OFE on the proposed processor is as
                  follows: 1) The scale space core generates multiple scales of the input image. 2)
                  The 1-D SIMD core generates parameters for the CGD operation using the generated images.
                  3) Each tile CGD core operates in a parallel manner to generate an optical flow for
                  each tile with the generated parameters. 4) After the optical flow tiles are collected,
                  it finally forms an entire optical flow image.
               
               The tile size and parallelism are the two major differences between the proposed processor
                  and the previous work [9][9]. In [9][9], the processor was optimized for 90 × 70 tile size and took shared memory architecture.
                  In [9][9], data of one tile were stored in the shared memory and were accessed by two CGD cores.
                  As a result, 328 KB of CGD memory was required, with 9.12 GOPS of computation overhead
                  [9][9]. On the other hand, the proposed processor is optimized for 20 × 20 tile size, which
                  shows the lowest (required memory × workload). Since the memory requirement for processing
                  a single tile has been greatly reduced, several tiles can be processed simultaneously
                  while using the similar sized memory to [9][9]. Therefore, the memory of each tile is allocated locally to each core, and 16 cores
                  are integrated to operate in parallel. As a result, external memory access of 1.74
                  MB / frame is required for real-time OFE using 326 KB of CGD memory. Even though the
                  computation overhead of 93.3 GOPS occurs, the throughput increases 2.16 times due
                  to the parallelism of 16 CGD cores.
               
                
             
            
                  IV. PROPOSED BACKGROUND DECISION UNIT WITH EARLY TERMINATION
               In the optical flow results, the pixels of zeros do not have any information and they
                  may correspond to a background of the scene. Especially in the case of action recognition,
                  there are numerous pixels with zero motion vector. The OFE result with UCF-101 dataset[11] reveals that average 78.5 % pixels of a scene have zero optical flow. Therefore,
                  CGD throughput can be increased by terminating the computations early for zero optical
                  flow pixels.
               
               Fig. 6 shows the flow of the proposed early termination scheme. The CGD processing with
                  the proposed scheme is divided into 2 phases. During phase 1, CGD iterations are performed
                  through the whole image. Then, the background detection may be performed following
                  phase 1 by thresholding the magnitude of the optical flow vector. Thus, the pixels
                  with small optical flow vector are abandoned. After the background detection, CGD
                  iterations are performed without the abandoned pixels during phase 2. Early termination
                  of the CGD process accomplishes 50.7 % throughput enhancement than the conventional
                  method with 0.7% misprediction rate in the UCF-101 test dataset.
               
                
                  
                  
                        
                        
Fig. 6. The overall flow of the early termination scheme.
 
                      
                  
               
               After the background detection, the addresses of nonterminated pixels must be saved
                  for subsequent iterations. Non-terminated pixel’s addresses are stored efficiently
                  utilizing only the distance between non-terminated pixels. In the tile CGD cores,
                  distance map registers of total 320 bytes are integrated to store.
               
               Fig. 7(a) explains the block diagram of the background detector. It updates the distance map
                  register through thresholding the magnitude of each region’s optical flow vectors
                  (du, dv). If the optical flow of the selected region is smaller than the threshold,
                  the background detector increments the stored distance value. If not, the background
                  detector stores the distance counter value into the distance map and resets the distance
                  counter to 0 for the next region. Fig. 7(b) shows the operation of the tile CGD core for early termination using the stored distance
                  map. The tile CGD core looks-up the distance map register and fetches the non-terminated
                  pixels’ optical flow vectors from the tile CGD memory. And it feeds the vector into
                  the tile CGD unit. If the distance map value is 0, the operation of the corresponding
                  row is terminated and proceeds to the next row.
               
                
                  
                  
                        
                        
Fig. 7. (a) The background detector, (b) the operation on tile CGD core.
 
                      
                  
               
                
             
            
                  V. IMPLEMENTATION RESULTS
               The proposed OFE processor shown in Fig. 8 is simulated using 65 nm CMOS technology and it occupies 12.8 mm2 die area. The processor can operate at a maximum 200 MHz clock frequency with 1.2
                  V supply, and it consumes 278.4 mW power. It achieves 99.4 fps throughput for QVGA
                  resolution (320 × 240) at 200 MHz clock frequency. Table 1 shows the comparison table with previous OFE hardware implementations. The proposed
                  tile-based OFE processor shows 116 % higher throughput than [9][9]. This is because the proposed processor can exploit higher parallelism with the same
                  memory size by selecting a more optimal tile size than [9][9]. The proposed tile-based hierarchical OFE processor fulfills the real-time constraint
                  despite the fact that the OFE is performed for all pixels (100 % density). [13][13] can also estimate a 100% density optical flow map, but the proposed algorithm shows
                  a much lower AAE than [13][13]. Fig. 9 shows the constant current injection capability of DCG. It can inject constant amplitudes
                  of 100 kHz pseudo-sine currents, up to kΩs of load condition.
               
                
                  
                  
                        
                        
Fig. 8. The layout photograph and performance summary.
 
                      
                  
               
                
                  
                  
                        
                        
Table 1. Comparison Table
                     
                      
                        
                              
                                 
                                    | 
                                       
                                    			
                                       
                                    			
                                  | 
                                 
                                       
                                    			
                                     [13][13] (2014)
                                     
                                    			
                                  | 
                                 
                                       
                                    			
                                     [12][12] (2015)
                                     
                                    			
                                  | 
                                 
                                       
                                    			
                                     [9][9] (2018)
                                     
                                    			
                                  | 
                                 
                                       
                                    			
                                     This work 
                                    			
                                  | 
                              
                           
                           
                                 
                                    | 
                                       
                                    			
                                     Processor/Process 
                                    			
                                  | 
                                 
                                       
                                    			
                                     FPGA 
                                    
                                    			
                                    (Xylinx V7) 
                                    			
                                  | 
                                 
                                       
                                    			
                                     FPGA 
                                    
                                    			
                                    (Xylinx V6) 
                                    			
                                  | 
                                 
                                       
                                    			
                                     ASIC/65mn 
                                    
                                    			
                                    CMOS 
                                    			
                                  | 
                                 
                                       
                                    			
                                     ASIC/65mn 
                                    
                                    			
                                    CMOS 
                                    			
                                  | 
                              
                              
                                    | 
                                       
                                    			
                                     Resolution 
                                    			
                                  | 
                                 
                                       
                                    			
                                     Full HD 
                                    
                                    			
                                    (1920x1080) 
                                    			
                                  | 
                                 
                                       
                                    			
                                     SVGA 
                                    
                                    			
                                    (800x600) 
                                    			
                                  | 
                                 
                                       
                                    			
                                     QVGA 
                                    
                                    			
                                    (320x240) 
                                    			
                                  | 
                                 
                                       
                                    			
                                     QVGA 
                                    
                                    			
                                    (320x240) 
                                    			
                                  | 
                              
                              
                                    | 
                                       
                                    			
                                     Power 
                                    			
                                  | 
                                 
                                       
                                    			
                                     12.22W 
                                    			
                                  | 
                                 
                                       
                                    			
                                     x 
                                    			
                                  | 
                                 
                                       
                                    			
                                     437mW 
                                    			
                                  | 
                                 
                                       
                                    			
                                     278.4mW 
                                    			
                                  | 
                              
                              
                                    | 
                                       
                                    			
                                     Frequency 
                                    			
                                  | 
                                 
                                       
                                    			
                                     84MHz 
                                    			
                                  | 
                                 
                                       
                                    			
                                     94MHz 
                                    			
                                  | 
                                 
                                       
                                    			
                                     200MHz 
                                    			
                                  | 
                                 
                                       
                                    			
                                     200MHz 
                                    			
                                  | 
                              
                              
                                    | 
                                       
                                    			
                                     Throughput 
                                    			
                                  | 
                                 
                                       
                                    			
                                     175 fps 
                                    			
                                  | 
                                 
                                       
                                    			
                                     196 fps 
                                    			
                                  | 
                                 
                                       
                                    			
                                     46.1 fps 
                                    			
                                  | 
                                 
                                       
                                    			
                                     99.4 fps 
                                    			
                                  | 
                              
                              
                                    | 
                                       
                                    			
                                     AAEa 
                                    			
                                  | 
                                 
                                       
                                    			
                                     11.85/16.51° 
                                    			
                                  | 
                                 
                                       
                                    			
                                     3.75° 
                                    			
                                  | 
                                 
                                       
                                    			
                                     4.53° 
                                    			
                                  | 
                                 
                                       
                                    			
                                     4.64° 
                                    			
                                  | 
                              
                              
                                    | 
                                       
                                    			
                                     Densityb 
                                    			
                                  | 
                                 
                                       
                                    			
                                     59.04/100 % 
                                    			
                                  | 
                                 
                                       
                                    			
                                     35.24 % 
                                    			
                                  | 
                                 
                                       
                                    			
                                     100 % 
                                    			
                                  | 
                                 
                                       
                                    			
                                     100 % 
                                    			
                                  | 
                              
                              
                                    | 
                                       
                                    			
                                     Algorithm 
                                    			
                                  | 
                                 
                                       
                                    			
                                     H & S 
                                    			
                                  | 
                                 
                                       
                                    			
                                     L & K 
                                    			
                                  | 
                                 
                                       
                                    			
                                     Tile-based processing 
                                    			
                                  | 
                                 
                                       
                                    			
                                     Tile-based processing 
                                    			
                                  | 
                              
                           
                        
 
                     
                     
                        a Average angular error measured on the Yosemite sequence without clouds.
                        
                        b Density of pixels computed across all the pixels of the Yosemite sequence without
                           clouds.
                        
                      
                   
                  
               
                
                  
                  
                        
                        
Fig. 9. (a) Estimated optical flow images, (b) Action recognition accuracy of proposed
                           optical flow and baseline [7][7] (@ 1000 randomly sampled UCF-101 test dataset).
                        
 
                      
                  
               
               Fig. 9(a) indicates the estimated optical flow images from the UCF-101 dataset[11]. Fig. 9(b) indicates the comparison of action recognition accuracy utilizing the LRCN network[6] with the optical flow maps of the proposed OFE and the baseline[7]. Randomly selected 1000 videos from UCF-101 were utilized for the test dataset, and
                  the proposed OFE shows only 0.8% accuracy drop.
               
                
             
            
                  VI. CONCLUSIONS
               In this paper, we implemented and simulated a tilebased OFE processor. It can support
                  the OFE for action recognition in mobile devices in real-time with 2 key features:
                  1) The tile-based hierarchical OFE to reduce external memory access efficiently with
                  a limited on-chip memory, 2) the background detection unit with early termination
                  to increase the throughput by 50.7 %. With these features, the proposed OFE processor
                  achieves 99.4 fps throughput for QVGA resolution (320 × 240) at 200 MHz core frequency,
                  showing 278.4 mW power consumption, and satisfying the real-time constraint.
               
             
          
         
            
                  ACKNOWLEDGMENTS
               Manuscript submitted October 5. This work was supported by Institute for Information
                  &  communications Technology Promotion(IITP) grant funded by the Korea government(MSIP)
                  (No. 2016-0-00207, Intelligent Processor Architectures and Application Softwares for
                  CNN (Convolutional Neural Network)-RNN (Recurrent Neural Network)).
               
             
            
                  
                     REFERENCES
                  
                     
                        
                        Chrungoo A., Manimaran S. S., Ravindran B., 2014, Activity Recognition for Natural
                           Human Robot Interaction, ICSR 2014. Lecture Notes in Computer Science, Springer, Vol.
                           8755

 
                      
                     
                        
                        Rezazadegan Fahimeh, Shirazi Sareh, Upcrofit Ben, Milford Michael, 2017, Action recognition:
                           From static datasets to moving robots, In Proceedings of IEEE International Conference
                           on Robotics and Automation, pp. 3185-3191

 
                      
                     
                        
                        Ji S., Xu W., Yang M., Yu K., 2013, 3D convolutional neural networks for human action
                           recognition, IEEE PAMI, Vol. 35, No. 1, pp. 221-231

 
                      
                     
                        
                        Karpathy A., Toderici G., Shetty S., Leung T., Sukthankar R., Fei-Fei L., 2014, Large-scale
                           video classification with convolutional neural networks, In Proc. CVPR

 
                      
                     
                        
                        Sevilla-Lara L., Liao Y., Guney F., Jampani V., Geiger A., Black M. J., 2017, On the
                           integration of optical flow and action recognition, arXiv preprint arXiv:1712.08416

 
                      
                     
                        
                        Donahue Jeff, Hendricks Lisa Anne, Rohrbach Marcus, Venugopalan Subhashini, Guadarrama
                           Sergio, Saenko Kate, Darrell Trevor, 2015, Long-Term Recurrent Convolutional Networks
                           for Visual Recognition and Description, IEEE Conference on Computer Vision and Pattern
                           Recognition (CVPR)

 
                      
                     
                        
                        Brox T., Bruhn A., Papenberg N., Weickert J., 2004, High accuracy optical flow estimation
                           based on a theory for warping, In ECCV

 
                      
                     
                        
                        Liu C., 2009, Beyond Pixels: Exploring New Representations and Applications for Motion
                           Analysis, PhD thesis, MIT

 
                      
                     
                        
                        Lee J., Kim C., Choi S., Shin D., Kang S., Yoo H., 2018, A 46.1 fps Global Matching
                           Optical Flow Estimation Processor for Action Recognition in Mobile Devices, 2018 IEEE
                           International Symposium on Circuits and Systems (ISCAS), Florence, Italy, pp. 1-5

 
                      
                     
                        
                        Baker S., Scharstein D., Lewis J., Roth S., Black M. J., Szeliski R., 2011, A database
                           and evaluation methodology for optical flow, International Journal of Computer Vision,
                           Vol. 92, No. 1, pp. 1-31

 
                      
                     
                        
                        Soomro K., Zamir A. R., Shah M., 2012, UCF101: A dataset of 101 human actions classes
                           from videos in the wild, CRCV-TR-12-01, Tech. Rep.

 
                      
                     
                        
                        Seong H., Rhee C., Lee H., 2015, A novel hardware architecture of the lucas-kanade
                           optical flow for reduced frame memory access, Circuits and Systems for Video Technology,
                           IEEE Transactions on, Vol. PP, No. 99, pp. 1-1

 
                      
                     
                        
                        Komorkiewicz M., Kryjak T., Gorgoń M., 2014, Efficient hardware implementation of
                           the Horn–Schunck algorithm for high-resolution real-time dense optical flow sensor,
                           Sensors, 1424-8220, Vol. 14, No. 2, pp. 2860-2891

 
                      
                   
                
             
            Author
             
             
             
            
            is a master student at the Korea Advanced Institute of Science and Technology (KAIST).
               
            
            His current research interests include energy-efficient multicore architectures and
               systems especially focused on artificial intelligence and computer vision fields.
               
            
            Lee received a BS in electrical engineering from the KAIST. 
            He is a student member of IEEE. 
            Contact him at juhyoung@kaist.ac.kr.
            
             
             
            
            (S'13 – M’15) received the B.S. and M.S. degrees in Department of Electrical Engineering
               from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
               in 2012 and 2014, respectively. 
            
            He is currently working toward the Ph.D. degree in the same department at KAIST. 
            His research interests include the low-power and memory-efficient architecture design
               for mobile vision SoC especially object segmentation, stereoscopic and deep learning.
            
             
             
            
            (S’15) received the B.S. degrees in electrical engineering from Hanyang University,
               Seoul, Korea in 2014, and the M.S. degree in electrical engineering from the Korea
               Advanced Institute of Science and Technology, Daejeon, South Korea, in 2016, where
               he is currently pursuing the Ph. D. degree. 
            
            His research interests include energy-efficient deep learning inference/training accelerator
               ASIC design, embedded deep learning platform design and verification, embedded system
               development with FPGA programming, and deep learning algorithm for sequence recognition.
            
            Affiliations: KAIST (Korea Advanced Institute of Science and Technology)
            Postal Address: #1233, Dept. of Electrical Engineering, KAIST, 335 Gwahangno, Yuseong-gu,
               Daejeon 305-701, Republic of Korea
            
            Phone: +82-42-350-5468
            Fax: +82-42-350-3410
             
             
            
            (S’16) received the B.S. and M.S. degrees with the School of Electrical Engineering,
               Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 2016.
               
            
            He is currently pursuing the Ph.D. degree with the School of Electrical Engineering.
               
            
            His current research interests include low-power vision system-on-chip design and
               deep learning processor design.