I. INTRODUCTION
               
                  Recently, virtual reality (VR) and mixed reality (MR) have become important applications
                  in smart mobile devices, such as smartphones and head-mounted displays (HMDs). Conventionally,
                  they exploited a touchscreen or a hand-held controller to interact with virtual 3D
                  objects. However, because manipulations of the touchscreen support only 2D interactions
                  in 2D space, its maneuverability in VR/MR environments is very uncomfortable due to
                  mapping from 2D interactions to 3D interactions. On the other hand, even though the
                  hand-held controller can support 3D interactions, its additional control device makes
                  it inconvenient to utilize such interactions in VR/MR applications. Therefore, a 3D
                  hand gesture interface (HGI) that supports intuitive 3D interactions without any additional
                  controller has drawn active attention instead of conventional UIs for smart mobile
                  devices.
                  
               
               
                  Fig. 1 describes the 3D HGI on an HMD system. First, it generates 3D depth maps of human
                  hands and the host processor in the HMD system calculates the location and rotation
                  of hands in a virtual 3D space from extracted depth maps. Utilizing such information
                  can support translation, rotation, and manipulation with virtual 3D objects. These
                  interactions require smart devices to essentially acquire an accurate depth map of
                  input scenes because the robustness of HGI in MR is strongly dependent on accuracy
                  of depth information. 
                  
                  
                  
                  
                  
                  
                     
                           
                           
Fig. 1. 3D hand gesture interface system. 
                         
                     
                     
                  
                  
                  There are three general approaches to extracting accurate depth maps, which are using
                  a time-of-flight (ToF) camera, a structured light system, and a stereo vision system
                  
(1). The ToF camera calculates distance by evaluating travel time of emitted light from
                  VCSEL between a camera and the objects. However, it suffers from large power consumption
                  (> 2.1 W) for infrared light emission 
(2). For example, the state-of-the-art HMD system (Hololens) 
(3) integrates 16.5Wh battery, while it requires over 4.1W power consumption to perform
                  the 3D HGI including the ToF sensor and the mobile processor (Intel Atom x5-z8500).
                  This power consumption limits the lifetime of the HMD system to only 2~3 hours which
                  is not feasible to provide alwayson 3D HGI. Although today’s ToF sensors that target
                  mobile applications consume less power around 200~300 mW 
(4-6), depth-sensing must dissipate lower power since it must be used as an always-on interface
                  for 3D HGI. Therefore, utilizing ToF is not feasible for lowpower 3D HGI in mobile
                  systems considering their limited power budgets. The structured light system projects
                  light with pattern and measures distance by distortion of the projected light. This
                  system also consumes more than 2.25 W 
(7), which is also not feasible for mobile devices, because both of its light projection
                  and depth calculations. To overcome such 
                  limits of active sensor-based approaches, stereo vision system that estimates a depth
                  map by triangulation between two cameras in a similar way that how human eyes predict
                  depth is used for mobile devices. It extracts disparity between left and right images
                  by sliding window matching and then a distance of an object is inversely proportional
                  to the measured disparity of matched pixels. Stereo vision system without active sensors
                  benefits over power consumption. Therefore, a low-power and low-latency depth-estimation
                  processor is required instead of using active depth sensors 
(2-5) because low-power and real-time operations are essential in mobile UI applications.
                  Several works implemented stereo matching processor in ASIC 
(8-13) and FPGA 
(14,15), but they still consumed too much power that they cannot be used for 3D HGI sensor
                  since their target was wide-range depth-estimation for high-end applications such
                  as unmanned vehicles. Although 
(16) consumes less power compared with other previous works 
(8-13), it is not feasible to provide the accurate 3D HGI due to its poor depth accuracy
                  from the block matching algorithm as decscribed in Section II in detail. Thus, stereo
                  matching with local aggregation is adequate for lowpower 3D HGI. However, it causes
                  massive memory accesses and computation, so it is almost impossible to realize a real-time
                  operation on CPU or GPU systems 
(17). In aspect of latency, depth-estimation must be < 10 ms with the hand pose estimation
                  
(18) since the overall UI latency should be < 40 ms 
(19). Meanwhile, it should consume < 50 mW, which is only 5% of power consumed in general
                  application processor, to enable UI to run always-on during the entire operation of
                  HMD or MR devices. 
                  
               
               
                  Fig. 2 shows the overall stereo matching flow, and it consists of 4 stages: initial matching,
                  cost aggregation, winner-takes-all (WTA), and consistency check. First, the initial
                  matching stage calculates the similarity cost map between small patches (~5x5) of
                  left and right images where the sum of absolute difference, sum of squared difference,
                  and census (22) are widely used for the matching. In the initial matching stage, size of image template
                  is a significant factor for depth accuracy. For example, using larger templates generates
                  much reliable initial matching costs as shown in Fig. 3. However, it degrades matching cost of hand, which is only our interest, due to the
                  large regions of background clutters. On the other hand, utilizing small templates
                  provides more robust matching costs, while initial matching costs become vulnerable
                  to pixel-level noise such as illumination or blurring due to the reduced sample points
                  within the template. To summarize, the results of initial matching give a poor depth
                  map because optimal template size is crucial to depth accuracy while its size is variable
                  as objects’ distance varies. Therefore, the stateof-the-art algorithms (21-25) essentially exploited cost aggregation, which aggregates the matching costs of neighboring
                  pixels, to refine the initial depth map. They usually exploit small (1x1 to 5x5) template
                  matching to reject the background clutter effect with large (15x15 to entire image)
                  aggregation regions. After that, the WTA stage selects the best-matched depth index
                  from the aggregated cost map. Finally, the left-right consistency checking stage eliminates
                  mismatching or occlusion by comparing left and right depth maps. Among the stages,
                  the initial matching and cost aggregation cause a large amount of computation and
                  memory accesses because they are performed along every disparity level (Fig. 2). For example, stereo matching requires over 630.7 Gflops and 18.3 GB/s for 100 fps
                  with 60 disparity under QVGA (320x240) resolution. Moreover, 81% of the whole computations
                  and 92% of memory accesses are concentrated in the cost aggregation stage, hence is
                  the most power-consuming part. 
                  
               
               
                  
                  
                  
                  
                  
                  
                     
                           
                           
Fig. 2. Dataflow of stereo matching process. 
                         
                     
                     
                  
                  
                  
                  
                  
                  
                  
                     
                           
                           
Fig. 3. Effect of template size in the initial matching operation.
                         
                     
                     
                  
                  
                  In order to meet the above-mentioned computational requirement, massively parallel
                  designs using more than 60-way processing units are utilized 
(8-15). They used DDR3 DRAM with cache or wide-bandwidth (1612b) SRAM to resolve huge bandwidth
                  of on-chip and offchip memory. However, both high-speed external memory and wide-bandwidth
                  SRAM cause large power consumption and area overhead. 
                  
               
               
                  In this paper, we propose a low-power and low-latency depth-estimation processor (DEP)
                  with reduced memory bandwidth by proposing algorithm and hardware cooptimization with
                  the following 3 key features: 1) shifterbased adaptive support-weight cost aggregation
                  that replaces complex floating-point operations to integer operations for power and
                  memory bandwidth reduction; 2) line streaming 7-stage pipeline architecture to realize
                  high utilization and reduce additional required memory; and 3) shift register-based
                  pipeline buffer to reduce area. The proposed chip is designed for 320x240 image resolution,
                  and it is sufficient for the 3D HGI because the adopted algorithm (18) requires 60x60 input hand images and the size of hand regions is usually (60x60 ~
                  128x128) on (15 cm ~ 30 cm) range with general webcam environments. As a result, the
                  total normalized power dissipation and required memory can be reduced by 74.7% and
                  54.6% compared with the state-of-the-art hardware (9,10), respectively, while maximally achieving 175 fps at 150 MHz under QVGA resolution.
                  
                  
               
               
                  The rest of this paper is organized as follows. Section II describes the optimal algorithm
                  selection for the 3D HGI and the proposed shifter-based cost aggregation algorithm
                  as well as the hardware architecture. In section III, the overall architecture of
                  the depth-estimation processor (DEP) with 7-stage pipeline, pipeline buffer optimization,
                  and resolution scalable pipeline control are explained with detailed hardware implementations.
                  Section IV shows the system implementation with the proposed chip and evaluation results,
                  respectively, followed by conclusion in Section V. 
                  
               
               
                  
                  
                  
                  
                  
                  
                     
                           
                           
Fig. 4. Hand depth images of (a) original input image, (b) global aggregation, (c)
                              local aggregation, (d) block aggregation.
                           
                         
                     
                     
                  
                  
               
             
            
                  II.  SHIFTER-BASED COST AGGREGATION
               
                     1.  Optimal Aggregation for 3D HGI
                  	
                     The cost aggregation is the most important stage in view of not only accuracy but
                     also memory accesses and computation in the depth-estimation. There are three basic
                     categories in the cost aggregation algorithms, which are global aggregation (21-23), local aggregation (24,25) and block aggregation (26). The global aggregation was utilized in  (8-11,13,15), while (12,14) adopted a local aggregation method, and (16) utilized block aggregation method. Fig. 4 shows depth-estimation results for each aggregation algorithms of semi-global (23), adaptive support weight (ASW) (24), and simple block aggregation (SSD + Mean filtering) (26). The global aggregation method aggregates the initial cost maps to minimize the overall
                     sum of matching. Its aggregation paths are fully connected, and final depth points
                     are selected by comparing all of the cost values along with the whole possible aggregation
                     paths. Because of exploring all possible aggregation paths, the global aggregation
                     automatically interpolates ambiguous depth regions such as textureless regions, occluded
                     regions, or repeated pattern compared with other methods, as shown in Fig. 4(b). Moreover, it generates a dense depth map without any additional post-processing.
                     However, its fully-connected aggregation path requires large computation and intermediate
                     data of which complexity is O(WxHxD2) and O(32xWxHxD), as shown in Table 1, respectively. Next, the local aggregation method aggregates the cost maps over the
                     same disparity level. It usually utilizes supporting filters that are generated by
                     intensity differences (24), or segmentation regions (25) to improve accuracy of a depth map since it does not aggregate across different disparity
                     levels. What is critical with this method is that it cannot interpolate ambiguous
                     regions because its aggregation is explored on only single disparity level. However,
                     the local aggregation provides much sharpened image and accurate depth information
                     at close objects. It also provides as high-quality depth map as the global aggregation
                     does for the 3D HGI in active regions because active distance to hands is 20 ~ 40
                     cm and hands are always located closer than other background objects. Compared to
                     global methods, its complexity of computation and intermediate data size is reduced
                     to O(WxHxD) and O(16xWxH) due to exploring only same disparity level. Finally, the
                     block aggregation method aggregates costs within the fixed-size box region. As shown
                     in Fig. 4(d), this method seems to provide the worst quality of a depth map among the three methods
                     because it just sums the initially matched costs without any supporting weights. However,
                     it reduces the complexity of computation and required memory compared with other two
                     methods because only INT16 is sufficient to compute aggregation due to its simple
                     summation-only aggregation. However, its average pixel error is 14.2%, and its result
                     is too poor to realize the accurate 3D HGI, as shown in Fig. 4 and Table 2. Therefore, the local aggregation becomes the optimal algorithm for the mobile HGI
                     to realize both low latency (< 10 ms) and low power (< 50 mW) in aspect of both accuracy
                     and algorithm complexity. In this paper, we utilize and optimize ASW (24,27) among variants of local aggregation methods for the proposed hardware. 
                     
                  
                  
                     
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 1. Complexity comparison among aggregations 
                           
                           
                              
                              
                              
                              
                           
                         
                        
                        
                        
                     
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 5. Operations of adaptive support weight aggregation
                            
                        
                        
                     
                     
                     Fig. 5 describes the operations of ASW with 60 disparity levels where it aggregates initial
                     costs level-bylevel. For each disparity level, it performs sequential aggregation
                     along with four directions (right, left, top, and bottom) for every pixel where horizontal
                     and vertical aggregations are performed in order for higher accuracy. Cost aggregation
                     along each direction performs weighted summation where the weights are generated by
                     gestalt grouping 
(24), which is formulated by using a Laplacian kernel of color difference between a center
                     pixel and an aggregated pixel. However, it must use a 32bit floatingpoint (FP) number
                     system for costs and weights since it requires exponent computation that results in
                     implementing power-consuming FP ALUs as well as huge memory bandwidth. Moreover, weighted
                     summations are performed for all pixels and disparity levels requiring large computations
                     even though the local aggregation has lower computation complexity than the global
                     aggregation. For example, it requires 579.4 Gflops and 14.6 GB/s for 100 fps under
                     QVGA resolution, implying that fine software optimization is required. Therefore,
                     we introduce a hardware-friendly ASW algorithm in the next section that uses integer
                     instead of FP for cost values. 
                     	
                  
               
 
               
                     2. Shifter-based Cost Aggregation Processing
                  	
                     Laplacian scale factor used in ASW algorithm (24,27) is the absolute difference of adjacent pixels’ intensities: 
                     
                  
                  
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     where σ is the supporting parameter, of which the values used in the proposed hardware
                     are 2, 4, 8 and 16. Then, an ASW cost with the weights is described as 
                     
                  
                  
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 2. Depth error comparison of aggregation methods 
                           
                           
                              
                              
                              
                              
                           
                         
                        
                        
                        
                     
                     
                     In 
(2), an ASW cost is calculated by weighted summation of successive costs Ci 
(24,27). Ccenter indicates the initial matching cost of a center point and Ci’s are the costs
                     of neighboring pixels. To calculate exponent operations, 
(24,27) utilizes a 32bit FP number system to reduce truncation errors during aggregation.
                     Due to large area as well as large power consumption from FP logics, 
(12) deployed a 24bit INT number system to reduce their overheads with 6.8% average pixel
                     error which is comparable with the accuracy of 
(24) (6.5%), as shown in 
Table 2. However, its 24bit number system still requires large overhead for the size of intermediate
                     memory and the area of 24bit multipliers, and the proposed algorithm is applied additional
                     approximations to reduce an additional bit-width of the costs.
                     
                  
                  
                     The stereo matching algorithm finds the pair of the best-matched points by WTA algorithm
                     that searches for the index of minimum cost along with depth levels. Therefore, the
                     result of a depth map by WTA is not changed if the inequality between two costs is
                     preserved after approximations. In the first step of approximations, the base of the
                     Laplacian kernel is changed from Euler’s number to 2 as 
                     
                  
                  
                     
                     
                        
                        
                        
                        
                        
                      
                     
                     
                     
                        
                        
                        
                        
                        
                     
                     
                     The modification to 
(3) does not change the inequality condition since $2^{-x} ∝ e^{-\log _{2}(x)}$. After
                     that, base-2 ASW cost is approximated to shifting operation as 
(4) because >> $x$ ∝ 2
-x , and it still preserves the inequality condition without any loss of generality.
                     As a result, as shown in 
Fig. 6 and 
Table 2, the accuracy differences between the proposed shifter-based aggregation and the
                     previous integer-based ASW algorithms 
(12) is -3.82%, 3.79%, +1.59%, and +0.65% on Tuskuba, Venus, Teddy, and Cone cases of
                     Middlebury stereo dataset 
(28), respectively, while large bit-width reduction.
                     
                  
                  
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 6. Results of the proposed shifter-based aggregation (a) Input image, (b) Ground
                                 truth, (c) Adaptive support weight, (d) Shifter-based adaptive support weight
                              
                            
                        
                        
                     
                     	
                  
                
               
                     3. Shifter-based Aggregation Unit
                  	
                     Fig. 7(a) describes a hardware implementation of ASW that consists of an exponent, a multiplier,
                     and an adder implemented with 32bit FP as (2). It takes weights and costs as input, and it calculates $C_{a c c}+C \cdot e^{-w}$
                     for every cycle. Then, aggregated cost over one direction is stored in an accumulation
                     register. The FP exponent logic and FP MAC require complicated hardware using either
                     lookup tables or piecewise linear approximation schemes for reduced hardware complexity.
                     However, both approaches still require large on-chip memory size and complex processing
                     logics compared with integer-based hardware, respectively. In addition, a DEP requires
                     highly-parallel aggregation unit arrays (e.g.,> 270-ways), so the overheads of area
                     and power consumption are critical. 
                     
                  
                  
                     Unlike the FP-based unit, the proposed aggregation unit in Fig. 7(b) requires only a barrel shifter and an integer adder. It generates multiplication
                     between input costs and exponential of weights by only one shifting operation. Also,
                     the proposed shifter-based ASW enables to use integer number system during aggregation.
                     The initial costs are generated by 8 points selective census matching within 5x5 template
                     size of which the maximum value is 8. After that, they are aggregated by the proposed
                     ASW within 15x15 aggregation region, and the maximum value of the intermediate and
                     the final aggregated costs are 120 and 1800, respectively. Thus, the former 32bit
                     bit-width of the initial, the intermediate, and the final costs are set to 4bit, 7bit,
                     and 11bit, respectively, without overflow. The maximum value of weights, which represent
                     the aggregation strength between neighboring pixels by similarity, is determined empirically.
                     Simulation results show that utilizing 3 bit is enough for the shifter-based ASW processing
                     without accuracy degradation. As a result, the proposed unit only contains a 4(7)bit
                     barrel shifter with a 3bit operand and a 6(11)bit accumulator for vertical (horizontal)
                     directions, respectively, reducing power consumption by 92.2% compared with original
                     FP-based implementation.  
                     
                  
                  
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 7. (a) FP-based aggregation unit, (b) Proposed shifterbased aggregation unit
                            
                        
                        
                     
                     
                     In addition to the power reduction, the bit-width reduction of processing data also
                     drastically reduces the overall intermediate data size by 69.1%. The required intermediate
                     memory reduction to 31.9 KB facilitates to integrate all the intermediate buffers
                     on the chip, removing external memory accesses during stereo matching.
                     	
                  
               
 
             
            
                  III.  PROPOSED DEPTH-ESTIMATION PROCESSOR
               
                     1. Overall Architecture
                  	
                     Fig. 8 describes the overall architecture of the proposed DEP that is composed of a top
                     controller, an input image loader, an output depth buffer, and a stereo pipeline module
                     (SPM). The 7-stage pipelined SPM estimates depth line-by-line. It is composed of an
                     input buffer, a census transformation unit, an initial matching unit, a vertical aggregation
                     unit, a horizontal aggregation unit, a WTA unit, and a left-right (L-R) consistency
                     check unit. First, the input image loader fetches 8b left and right pixels from an
                     external memory, and stores them to the 320x20 input buffer in the SPM for every clock
                     cycle. After 20 lines of inputs are fetched inside the input buffer, the census transformation
                     unit generates both of 30 left and right binary patterns and corresponding aggregation
                     weights from the 20-line inputs for every cycle. Then, the initial matching unit calculates
                     hamming distance between left and right census pairs and extracts 74 initial cost
                     maps for every 60 cycles. Next, the initial cost maps from the previous stage are
                     aggregated by the vertical and horizontal aggregation units in order with 248-way
                     and 240-way parallelism, respectively. After that, the WTA unit searches the best-matched
                     index between left and right images and generates left and right depth maps. Finally,
                     the L-R consistency check unit compares left and right depth maps to eliminate falsely
                     matched depth points, which come from occluded or textureless points, and the 60 final
                     depth points are stored to the output depth buffer for every 60 cycles. The proposed
                     shifter-based ASW completely eliminates external memory access during SPM operations
                     by holding all of the intermediate data inside pipeline buffer. To realize 10 ms stereo
                     matching latency, the initial matching, the vertical aggregation, and the horizontal
                     aggregation units are composed with homogeneous 148-way, 148-way, and 120-way parallelized
                     PEs, respectively. 
                     
                  
                  
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 8. The overall depth-estimation processor architecture
                            
                        
                        
                     
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 9. The timing diagram of the proposed DEP with hierarchical pipelining
                            
                        
                        
                     
                     
                     Fig. 9 describes a timing diagram of the proposed DEP operations with hierarchical pipelining.
                     The first is linelevel pipeline with 3 stages of line loading, line processing, and
                     line storing. The SPM estimates 1 line of depth map for every 480 clock cycle. Each
                     line processing stage consists of 7 stages of pixel-level pipeline: pre-fetching input,
                     census transformation, initial matching, vertical aggregation, horizontal aggregation,
                     WTA, and consistency check. Each stage processes pixel-level operations every 8 clocks.
                     All the pipeline stages are well balanced to achieve 94% of utilization. 
                     	
                  
               
 
               
                     2.  Line Streaming 7-stage Pipeline Architecture
                  	
                     Fig. 10 describes data processing patterns of sliding window matching and 4-direction cost
                     aggregation. In the initial matching stage, a right (reference) patch and a left (target)
                     patch are compared to generate initial costs. In this operation, the right patch is
                     reused 60 times while sliding the left patch toward right direction. In general implementation,
                     the target patches are fetched into the left buffer and a wide I/O multiplexer (MUX)
                     reorders its data to align with those in the right buffer. However, utilizing such
                     a wide MUX causes large area overhead and routing congestion because it must be connected
                     with all of the ports in the matching PE. On the other hand, the proposed architecture
                     with a shifting register (SR)-based buffer for the target patch (marked in red) moves
                     1 index every pipeline cycle, as indicated in Fig. 10(a). In the meantime, the reference patch stored in blue RFs is loaded every 60 pipeline
                     cycle. The 4-direction cost aggregation is obtained by recursively performing the
                     bi-directional aggregation for top/bottom and right/left, respectively, as shown in
                     Fig. 10(b). The size of aggregation window is 15x15, and maximally 8 costs are aggregated in
                     a single aggregation PE. Therefore, the initial costs in a buffer are selected with
                     cyclic indexing and issued to forward and backward aggregation units. The bi-direction
                     aggregated costs are generated through both units after 8 clock cycles.
                     
                  
                  
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 10. Hardware implementations in stereo matching (a) Matching hardware with shifting
                                 register-based buffer, (b) Aggregation units with 2- direction MUX-based buffer
                              
                            
                        
                        
                     
                     
                     There are 5 pipeline buffers in the SPM, which are the input buffer, the left and
                     right census registers, the initial cost register, the intermediate cost registers
                     for vertically aggregated costs, and the final aggregated cost register. The overall
                     latency of this pipeline is 480 clock cycles, and the buffers latch and fetch data
                     with synchronized pipeline cycle. First, the 3-banked input buffer issues 3 pixels
                     of the left and right images to left and right census transform  units for one clock.
                     This operation consumes 320 cycles for issuing 1 line of the input images, and remaining
                     160 cycles are used for fetching the next line to the input buffer from external memory.
                     Second, the left-/right- census units transform the pixels of the input images into
                     15 census pixels for every clock cycle at the same time. After that, they are stored
                     to the left and right census registers shown in 
Fig. 11. The right census buffers utilize double buffering and they are swapped for every
                     60 pipeline cycle (480 clock cycles), and the left census buffer is composed of the
                     SR-based buffer architecture as mentioned in 
Fig. 10. Third, upper and lower lines from the active left-/right- census buffers are fetched
                     from the buffers and the initial matching units calculate 2 lines (148 words) of the
                     initial costs for every clock cycles element-by-element as 
Fig. 11 describes. Fourth, bi-direction vertical aggregation is performed with 148-way vertical
                     aggregation units that generate 74 upper initial costs and 74 lower initial costs
                     for every 8 clock cycle. To eliminate a pipeline stall of vertical aggregations, as
                     shown in 
Fig. 11, initial matching and vertical aggregation process different lines of data with 1
                     index shifting. Finally, the horizontal aggregation is performed with 120-way horizontal
                     aggregation units and resulting aggregated costs are stored in the final aggregated
                     cost registers. In addition to vertical aggregations, the intermediate cost buffer
                     exploits double buffering architecture to reduce a pipeline stall of horizontal aggregations.
                     Both vertical and horizontal aggregation buffers utilize MUX-based buffer as shown
                     in 
Fig. 10(b). As a result, the proposed SPM processes 300 depth points with initial matching and
                     aggregations for every 2400 clock cycle with 60 disparity levels, and its average
                     utilization is 94%.
                     
                  
                  
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 11. Cost generation and pipeline buffer architecture of SPM: 1) Shifting register
                                 and double buffering for left and right census, 2) 2-path initial matching and initial
                                 cost buffer, 3) 2path vertical aggregation and horizontal aggregation
                              
                            
                        
                        
                     
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 12. Comparison between multiplexer-based and shifting register-based architecture
                                 (a) Area vs. parallelism, (b) Power vs. parallelism, (c) Area-power product vs. parallelism
                              
                            
                        
                        
                     
                     
                     The proposed SPM does not use any external memory during stereo matching. So, the
                     size of its pipeline RFs is very critical for aspect of both logic area and power
                     consumption. To reduce RFs, we also change order of aggregation direction from X-Y
                     to Y-X such as 
(12). For examples in our case, in X-Y order aggregation, it needs 960 words which are
                     composed of 60x15 and 60x1 RFs for pipeline buffers. However, in Y-X order aggregation,
                     it needs only 134 words which are composed of 74x1 and 60x1 RFs instead of 960 words.
                     This optimization also reduces weight buffers, and its effect is doubled due to reducing
                     both left and right buffers. Therefore, the proposed hardware reduces 43.9% further
                     memory in the CA stages with only 0.5% error penalty. As a result, due to line-level
                     processing and changing aggregation order, the proposed SPM requires only 17.9 KB
                     buffer size without any external memory accesses for QVGA stereo matching. 
                     	
                  
               
 
               
                     3. Shifting Register-based Pipeline Buffer
                  	
                     There are two basic pipeline buffer architectures of a MUX-based buffer and an SR-based
                     buffer designs. The difference is the way that they align enormous data to a parallel
                     PE array by using either multiplexer (MUX) or SR. In the MUX-based architecture, input
                     data from a previous stage are stored into a pipeline buffer through wide I/O MUX.
                     On the other hand, the SR-based architecture orders data by shifting 1-index for every
                     insert of input. In general, the MUX-based architecture consumes less dynamic power
                     and small area compared with the SR-based one in low parallelism design, thus, CPU
                     or DSP does not deploy SR-based architectures. However, when it comes to high parallelism
                     that consequently requires large amount of connections, its area increases tremendously
                     as well as the static power becomes dominant. These area and power overheads make
                     it inefficient in highly parallel designs such as the proposed DEP. 
                     
                  
                  
                     Simulations were taken to get the relationships of area, power, and area-power product
                     as a new figure-of-merit with respect to parallelism to optimize buffer architectures;
                     where both architectures run at 150 MHz with 1.0 V supply voltage. Barrel shifter
                     with O(n) logic complexity is used for MUX-based architecture for fair comparison
                     because buffers in stereo matching only move indices along one direction. The baseline
                     of normalization is the MUX-based architecture, and they are tested from 5-way to
                     100-way. As shown in Fig. 12(a), the MUX shows smaller size than the SR below 25-way. However, normalized area of
                     the MUX is bigger than SR when > 25-way. In view of normalized power shown in Fig. 12(b), the MUX always consumes less power due to the dynamic power consumption of SRs.
                     However, the gap between the SR and the MUX is only 4% at 100-way parallelism. Since
                     both area and power are important in hardware design, we analyze the areapower product
                     to find the optimal designs for SPM buffers. As shown in Fig. 12(c), the MUX-based design shows better performance than SR-based design when < 40-way,
                     while it is opposite when > 40-way. Since both sliding window matching and aggregation
                     are performed repeatedly by moving 1-index for every processing, pipeline buffers
                     between each 7 stages can be implemented both MUX-based and SR-based designs, and
                     its optimization can be made by utilizing each architecture according to its parallelism
                     level. Therefore, the initial matching buffer (8), the vertical aggregation buffer
                     (8), and the horizontal aggregation buffer (1) utilize the MUX-based architecture,
                     and the left and right census buffers (74) utilize SR-based architecture. As a result,
                     the optimized buffer design improves 44% timing for a critical path and reduces 29.8%
                     of overall area. 
                     	
                  
                
               
                     4. Resolution Scalable Pipeline Control
                  	
                     Due to line streaming processing, the depth-estimation can support any size of the
                     height of input image resolution. On the other hand, for width scalability, the proposed
                     SPM architecture supports any resolution with widths in multiples of 60. Since the
                     input buffer size of our DEP is 320(Max width)x21(Max aggregation range), the DEP
                     supports 60, 120, 180, 240, and 300 width images without degrading utilization using
                     300 input buffers out of 320, while the rest of 20 input buffers are used for aggregation.
                     If this buffer size is enlarged to 640, or 1920, the proposed architecture is also
                     possible to support VGA or Full-HD images without any change of PE architecture. To
                     realize this scalability, the resolution scalable pipeline control is proposed in
                     the SPM shown in Fig. 13 so that the controller does not have to be altered even though number of buffers
                     are scaled. As shown in Fig. 13(a), the hardware block of single pipeline stage receives only 4 signals: EN (enable
                     signal), RST (reset), Pn (pipeline number), and Ln (loop number). EN controls whether
                     to latch data into accumulation registers or pipeline buffers. RST resets both an
                     accumulation register inside a PE array and an alignment index in a data alignment
                     unit to zero. These two signal is mandatory signal. On the other hand, Pn and Ln are
                     optional signals for aggregation stages, WTA, and L-R consistency check stages. Pn
                     is used for controlling current position of aggregation and latching signal for pipeline
                     buffers. Ln is used for WTA and L-R consistency check operations. These 4 control
                     signals are generated by a signal generator in the SPM described in Fig. 13(b). In SPM, there are two 3bit and 6bit counters for Pn and Ln, one variable width pulse
                     generator, and a configuration registers, and they generate all of the control signals
                     required in the 7 pipeline stages. The SPM receives SPM_EN (global enable) and SPM_RST
                     (line reset signal) from the top DEP controller and it performs 480 (60x8) cycle stereo
                     operation. After 480 cycles, dependent on configuration setting, it automatically
                     proceeds next 60 depth processing or stops until next line processing. In configuration
                     registers, the number of pre-fetching lines, image resolution, and debugging settings
                     for dumping intermediate data are stored and the signal generator makes variable width
                     EN signal by this information. Due to these line-level automated control, the top
                     controller only needs to send SPM_EN signal and SPM_RST signal to the SPM while processing
                     the whole stereo matching. Fig. 13(c) describes timing diagrams for the proposed DEP. First, after the top controller in
                     DEP asserts SPM_EN and send single pulse of SPM_RST, the SPM automatically processes
                     1 line of depth-estimation. Then, SPM_RST resets the loop counter to zero inside the
                     SPM, and it makes enable signals to the 7 stages until processing the entire 1 line.
                     After SPM_RST is asserted, the variable width pulse generator in the SPM sends EN
                     signal for stage 1, which is successively propagated to stage 2~7. These signal propagations
                     can be turned on or off by setting of configuration registers in the SPM. For example,
                     before estimating first line of a depth map, the proposed hardware must pre-fetch
                     20 lines and stage 2~7 must not process any data because there are invalid images
                     inside the input buffer. In this case, the signal generator blocks propagation of
                     enable signal and performs stage 1 only for the remaining 19 lines. In this situation,
                     all other stages are stalled and clock-gated to reduce redundant power consumption.
                     Due to these simple control architecture, its control logic only occupies 0.26% of
                     overall DEP area while it supports various resolutions of input images.
                     
                  
                  
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 13. Stereo pipeline module control (a) Structure of single pipeline stage hardware,
                                 (b) Control path of stereo pipeline module, (c) Timing diagram of pipeline control
                                 signal
                              
                            
                        
                        
                     
                     	
                  
                
             
            
                  IV. IMPLEMENTATION RESULTS 
               
                     1. Chip Implementation Results 
                  	
                     The proposed 1400x2000 μm2 DEP shown in Fig. 14 is fabricated by 65 nm 1P8M logic CMOS process and Table 3 summarizes the chip specification. We redesigned the previous DEP block (29) into the standalone chip with improvement at debugging functionality, resolution
                     scalability, external interface, and timing performance. It consumes 47.2 mW with
                     175 fps (5.71 ms) throughput which is the maximum performance on 1.2 V supply voltage
                     and 166 MHz operating frequency, and only 15.56 mW with 105 fps (9.52 ms) with 1.0
                     V and 100 MHz. The proposed hardware estimates QVGA resolution depth images where
                     the maximum disparity is 60 level. Its maximum energy efficiency is 34 pJ/level·pixel
                     at 1.0 V supply voltage. The required memory is reduced by 54.6% to 17.9 KB compared
                     with the state-of-the-art result (10) and it makes possible to integrate all intermediate data into on-chip memory due
                     to the algorithm and pipeline buffer optimization. Also, the measured 15.56 mW power
                     dissipation and 34 pJ/level·pixel energy consumption, which corresponds to 75.6% reduction
                     compared to the state-of-the-art (9). 
                     
                  
                  
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 14. Chip photography
                            
                        
                        
                     
                     
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 3. Specification of the proposed DEP
                           
                           
                              
                              
                              
                              
                           
                         
                        
                        
                        
                     
                     	
                  
                
               
                     2. Evaluation System Implementation
                  	
                     Fig. 15 shows the evaluation system of the proposed DEP that is integrated on the HMD systems
                     and the DEP communicates with a host processor (Exynos-5422 application processor)
                     by USB 3.0 I/F. Stereo images are retrieved from the customized stereo camera and
                     the images are converted to grayscale by the host processor. After that, the host
                     processor sends the images to the target HMD platform and eventually sent to the DEP.
                     Overall stereo processing latency is 9.95 ms including with USB 3.0 communication
                     latency between the DEP and the host processor, which is hidden behind depthestimation
                     operations due to its streaming processing. The host processor performs 3D hand pose
                     estimation by (18), and 3D hand poses are utilized for customized UI. The final result of the extracted
                     depth maps from the DEP is visualized at a monitor. 
                     
                  
                  
                     
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 4. Performance comparison table 
                           
                           
                              
                              
                              
                              
                           
                         
                        
                        
                        
                     
                     
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 5. Average depth error on Middlebury dataset (28) 
                              
                           
                           
                              
                              
                              
                              
                           
                         
                        
                        
                        
                     
                     	
                  
                
               
                     3. Evaluation Results
                  	
                     We evaluate the proposed DEP for both Middlebury stereo dataset (28) and hand pose estimation errors. To acquire hand pose estimation errors, (18) is applied to extracted depth maps. Table 5 shows an average depth error on (28) that includes Tsukuba, Venus, Teddy, and Cones images. It is evaluated for all regions,
                     nonoccluded regions, and depth discontinuity regions of test images, and its average
                     errors are 10.7%, 7.1%, and 16.7%, respectively. Compared with the original algorithm
                     (24,27), only 0.1% of accuracy is degraded for the three categories which are negligible
                     in the 3D HGI. In addition, we also evaluate our proposed DEP with hand pose estimation
                     algorithm (18) and the HMD system shown in Fig. 15. To set hand pose estimation to 30 ms latency, we reduced sample points and iteration
                     to 128 points and 16 iterations, respectively. Also, our evaluation software is pipelined
                     with image retrieving, depth-estimation, hand pose estimation, and visualization to
                     realize overall 40 ms latency system. Fig. 16 shows evaluation results of hand pose estimation with the DEP. First, input images
                     are sent to the DEP and it generates depth maps shown in 2nd and 5th columns. Even
                     though they show depth errors for background regions due to occlusion by foreground
                     hands, they show reasonably accurate depth quality on hand regions. The 3rd and 6th
                     columns of Fig. 16 show the final hand pose results. Because (18) performs hand model regression with sampled depth points, which are the most reliable
                     128 depth points in the hand regions, the results show accurate hand poses.
                     
                  
                  
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 15. Evaluation system
                            
                        
                        
                     
                     
                     
                     
                     
                     
                     
                        
                              
                              
Fig. 16. The evaluation results of hand pose estimation for the 3D hand gesture interface
                            
                        
                        
                     
                     
                     
                     
                     
                     
                        
                        
                        
                        
                              
                              
Table 6. Hand pose estimation error 
                           
                           
                              
                              
                              
                              
                           
                         
                        
                        
                        
                     
                     
                     Table 6 shows hand pose estimation errors in the range of 25~35 cm which are usual active
                     distance in the 3D HGI on the HMD systems. The maximum error is 13.64 mm and 12.00
                     mm on regions of fingers and palm, respectively, where the corresponding average errors
                     are 7.18 mm and 6.28 mm. Since the original algorithm 
(18) which utilizes the ToF sensor instead of stereo matching shows average 5 mm of hand
                     tracking error, the accuracy of the hand tracking system with the proposed are adequate
                     to provide the natural UI for AR/MR systems.