Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 20, No. 3, p.255-270

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 6 February 2020Accepted : 28 April 2020

DOI :

https://doi.org/10.5573/JSTS.2020.20.3.255

A 9.52 ms Latency, and Low-power Streaming Depth-estimation Processor with Shifter-based Pipelined Architecture for Smart Mobile Devices

ChoiSungpill¹ LeeKyuho Jason² KimYoungwoo¹ YooHoi-Jun¹^*

(School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Korea, 34141)
(2School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology, Ulsan, Korea, 44919)

^* E-mail: hjyoo@kaist.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

The 3D hand gesture interface (HGI) for virtual reality and mixed reality on smart mobile devices is strongly dependent upon the robust depthestimation with low latency and power consumption. However, the conventional depth-estimation hardware such as active depth sensors and stereo matching accelerators cannot realize the always-on and natural 3D HGI on mobile platform due to their large power consumption from active depth sensors and computations as well as the massive external memory bandwidth, respectively. To resolve the limit, we propose a depth-estimation processor that realizes the always-on and natural 3D HGI with algorithm and hardware co-optimization. The processor features: 1) shifter-based adaptive support weight aggregation that replaces complex floating-point operations with integer operations to reduce power and bandwidth by 92.2% and 69.1%; 2) line streaming 7-stage pipeline architecture with aggregation pipeline reordering optimization to realize 94% utilization and 43.9% memory reduction; and 3) shifting register-based pipeline buffer optimization to reduce 29.8% area. The proposed depth-estimation processor realizes a real-time 3D HGI with 9.52 ms of latency under QVGA stereo inputs. It achieves external memory bandwidth reduction to 18.93 MB/s with 15.56 mW power and 2.8 mm2 area, which are 4.1x and 6.9x more efficient than state-of-the-arts ^(9,¹⁰⁾, respectively.

Index Terms

Energy-efficient digital circuit, depthestimation, stereo vision, low power, high throughput ASIC, image processing, memory-efficient design

I. INTRODUCTION

Recently, virtual reality (VR) and mixed reality (MR) have become important applications in smart mobile devices, such as smartphones and head-mounted displays (HMDs). Conventionally, they exploited a touchscreen or a hand-held controller to interact with virtual 3D objects. However, because manipulations of the touchscreen support only 2D interactions in 2D space, its maneuverability in VR/MR environments is very uncomfortable due to mapping from 2D interactions to 3D interactions. On the other hand, even though the hand-held controller can support 3D interactions, its additional control device makes it inconvenient to utilize such interactions in VR/MR applications. Therefore, a 3D hand gesture interface (HGI) that supports intuitive 3D interactions without any additional controller has drawn active attention instead of conventional UIs for smart mobile devices.

Fig. 1 describes the 3D HGI on an HMD system. First, it generates 3D depth maps of human hands and the host processor in the HMD system calculates the location and rotation of hands in a virtual 3D space from extracted depth maps. Utilizing such information can support translation, rotation, and manipulation with virtual 3D objects. These interactions require smart devices to essentially acquire an accurate depth map of input scenes because the robustness of HGI in MR is strongly dependent on accuracy of depth information.

Fig. 1. 3D hand gesture interface system.

There are three general approaches to extracting accurate depth maps, which are using a time-of-flight (ToF) camera, a structured light system, and a stereo vision system ⁽¹⁾. The ToF camera calculates distance by evaluating travel time of emitted light from VCSEL between a camera and the objects. However, it suffers from large power consumption (> 2.1 W) for infrared light emission ⁽²⁾. For example, the state-of-the-art HMD system (Hololens) ⁽³⁾ integrates 16.5Wh battery, while it requires over 4.1W power consumption to perform the 3D HGI including the ToF sensor and the mobile processor (Intel Atom x5-z8500). This power consumption limits the lifetime of the HMD system to only 2~3 hours which is not feasible to provide alwayson 3D HGI. Although today’s ToF sensors that target mobile applications consume less power around 200~300 mW ^(4-⁶⁾, depth-sensing must dissipate lower power since it must be used as an always-on interface for 3D HGI. Therefore, utilizing ToF is not feasible for lowpower 3D HGI in mobile systems considering their limited power budgets. The structured light system projects light with pattern and measures distance by distortion of the projected light. This system also consumes more than 2.25 W ⁽⁷⁾, which is also not feasible for mobile devices, because both of its light projection and depth calculations. To overcome such limits of active sensor-based approaches, stereo vision system that estimates a depth map by triangulation between two cameras in a similar way that how human eyes predict depth is used for mobile devices. It extracts disparity between left and right images by sliding window matching and then a distance of an object is inversely proportional to the measured disparity of matched pixels. Stereo vision system without active sensors benefits over power consumption. Therefore, a low-power and low-latency depth-estimation processor is required instead of using active depth sensors ^(2-⁵⁾ because low-power and real-time operations are essential in mobile UI applications. Several works implemented stereo matching processor in ASIC ^(8-¹³⁾ and FPGA ^(14,¹⁵⁾, but they still consumed too much power that they cannot be used for 3D HGI sensor since their target was wide-range depth-estimation for high-end applications such as unmanned vehicles. Although ⁽¹⁶⁾ consumes less power compared with other previous works ^(8-¹³⁾, it is not feasible to provide the accurate 3D HGI due to its poor depth accuracy from the block matching algorithm as decscribed in Section II in detail. Thus, stereo matching with local aggregation is adequate for lowpower 3D HGI. However, it causes massive memory accesses and computation, so it is almost impossible to realize a real-time operation on CPU or GPU systems ⁽¹⁷⁾. In aspect of latency, depth-estimation must be < 10 ms with the hand pose estimation ⁽¹⁸⁾ since the overall UI latency should be < 40 ms ⁽¹⁹⁾. Meanwhile, it should consume < 50 mW, which is only 5% of power consumed in general application processor, to enable UI to run always-on during the entire operation of HMD or MR devices.

Fig. 2 shows the overall stereo matching flow, and it consists of 4 stages: initial matching, cost aggregation, winner-takes-all (WTA), and consistency check. First, the initial matching stage calculates the similarity cost map between small patches (~5x5) of left and right images where the sum of absolute difference, sum of squared difference, and census ⁽²²⁾ are widely used for the matching. In the initial matching stage, size of image template is a significant factor for depth accuracy. For example, using larger templates generates much reliable initial matching costs as shown in Fig. 3. However, it degrades matching cost of hand, which is only our interest, due to the large regions of background clutters. On the other hand, utilizing small templates provides more robust matching costs, while initial matching costs become vulnerable to pixel-level noise such as illumination or blurring due to the reduced sample points within the template. To summarize, the results of initial matching give a poor depth map because optimal template size is crucial to depth accuracy while its size is variable as objects’ distance varies. Therefore, the stateof-the-art algorithms ^(21-²⁵⁾ essentially exploited cost aggregation, which aggregates the matching costs of neighboring pixels, to refine the initial depth map. They usually exploit small (1x1 to 5x5) template matching to reject the background clutter effect with large (15x15 to entire image) aggregation regions. After that, the WTA stage selects the best-matched depth index from the aggregated cost map. Finally, the left-right consistency checking stage eliminates mismatching or occlusion by comparing left and right depth maps. Among the stages, the initial matching and cost aggregation cause a large amount of computation and memory accesses because they are performed along every disparity level (Fig. 2). For example, stereo matching requires over 630.7 Gflops and 18.3 GB/s for 100 fps with 60 disparity under QVGA (320x240) resolution. Moreover, 81% of the whole computations and 92% of memory accesses are concentrated in the cost aggregation stage, hence is the most power-consuming part.

Fig. 2. Dataflow of stereo matching process.

Fig. 3. Effect of template size in the initial matching operation.

In order to meet the above-mentioned computational requirement, massively parallel designs using more than 60-way processing units are utilized ^(8-¹⁵⁾. They used DDR3 DRAM with cache or wide-bandwidth (1612b) SRAM to resolve huge bandwidth of on-chip and offchip memory. However, both high-speed external memory and wide-bandwidth SRAM cause large power consumption and area overhead.

In this paper, we propose a low-power and low-latency depth-estimation processor (DEP) with reduced memory bandwidth by proposing algorithm and hardware cooptimization with the following 3 key features: 1) shifterbased adaptive support-weight cost aggregation that replaces complex floating-point operations to integer operations for power and memory bandwidth reduction; 2) line streaming 7-stage pipeline architecture to realize high utilization and reduce additional required memory; and 3) shift register-based pipeline buffer to reduce area. The proposed chip is designed for 320x240 image resolution, and it is sufficient for the 3D HGI because the adopted algorithm ⁽¹⁸⁾ requires 60x60 input hand images and the size of hand regions is usually (60x60 ~ 128x128) on (15 cm ~ 30 cm) range with general webcam environments. As a result, the total normalized power dissipation and required memory can be reduced by 74.7% and 54.6% compared with the state-of-the-art hardware ^(9,¹⁰⁾, respectively, while maximally achieving 175 fps at 150 MHz under QVGA resolution.

The rest of this paper is organized as follows. Section II describes the optimal algorithm selection for the 3D HGI and the proposed shifter-based cost aggregation algorithm as well as the hardware architecture. In section III, the overall architecture of the depth-estimation processor (DEP) with 7-stage pipeline, pipeline buffer optimization, and resolution scalable pipeline control are explained with detailed hardware implementations. Section IV shows the system implementation with the proposed chip and evaluation results, respectively, followed by conclusion in Section V.

Fig. 4. Hand depth images of (a) original input image, (b) global aggregation, (c) local aggregation, (d) block aggregation.

II. SHIFTER-BASED COST AGGREGATION

1. Optimal Aggregation for 3D HGI

The cost aggregation is the most important stage in view of not only accuracy but also memory accesses and computation in the depth-estimation. There are three basic categories in the cost aggregation algorithms, which are global aggregation ^(21-²³⁾, local aggregation ^(24,²⁵⁾ and block aggregation ⁽²⁶⁾. The global aggregation was utilized in ^(8-^11,^13,¹⁵⁾, while ^(12,¹⁴⁾ adopted a local aggregation method, and ⁽¹⁶⁾ utilized block aggregation method. Fig. 4 shows depth-estimation results for each aggregation algorithms of semi-global ⁽²³⁾, adaptive support weight (ASW) ⁽²⁴⁾, and simple block aggregation (SSD + Mean filtering) ⁽²⁶⁾. The global aggregation method aggregates the initial cost maps to minimize the overall sum of matching. Its aggregation paths are fully connected, and final depth points are selected by comparing all of the cost values along with the whole possible aggregation paths. Because of exploring all possible aggregation paths, the global aggregation automatically interpolates ambiguous depth regions such as textureless regions, occluded regions, or repeated pattern compared with other methods, as shown in Fig. 4(b). Moreover, it generates a dense depth map without any additional post-processing. However, its fully-connected aggregation path requires large computation and intermediate data of which complexity is O(WxHxD2) and O(32xWxHxD), as shown in Table 1, respectively. Next, the local aggregation method aggregates the cost maps over the same disparity level. It usually utilizes supporting filters that are generated by intensity differences ⁽²⁴⁾, or segmentation regions ⁽²⁵⁾ to improve accuracy of a depth map since it does not aggregate across different disparity levels. What is critical with this method is that it cannot interpolate ambiguous regions because its aggregation is explored on only single disparity level. However, the local aggregation provides much sharpened image and accurate depth information at close objects. It also provides as high-quality depth map as the global aggregation does for the 3D HGI in active regions because active distance to hands is 20 ~ 40 cm and hands are always located closer than other background objects. Compared to global methods, its complexity of computation and intermediate data size is reduced to O(WxHxD) and O(16xWxH) due to exploring only same disparity level. Finally, the block aggregation method aggregates costs within the fixed-size box region. As shown in Fig. 4(d), this method seems to provide the worst quality of a depth map among the three methods because it just sums the initially matched costs without any supporting weights. However, it reduces the complexity of computation and required memory compared with other two methods because only INT16 is sufficient to compute aggregation due to its simple summation-only aggregation. However, its average pixel error is 14.2%, and its result is too poor to realize the accurate 3D HGI, as shown in Fig. 4 and Table 2. Therefore, the local aggregation becomes the optimal algorithm for the mobile HGI to realize both low latency (< 10 ms) and low power (< 50 mW) in aspect of both accuracy and algorithm complexity. In this paper, we utilize and optimize ASW ^(24,²⁷⁾ among variants of local aggregation methods for the proposed hardware.

Table 1. Complexity comparison among aggregations

Fig. 5. Operations of adaptive support weight aggregation

Fig. 5 describes the operations of ASW with 60 disparity levels where it aggregates initial costs level-bylevel. For each disparity level, it performs sequential aggregation along with four directions (right, left, top, and bottom) for every pixel where horizontal and vertical aggregations are performed in order for higher accuracy. Cost aggregation along each direction performs weighted summation where the weights are generated by gestalt grouping ⁽²⁴⁾, which is formulated by using a Laplacian kernel of color difference between a center pixel and an aggregated pixel. However, it must use a 32bit floatingpoint (FP) number system for costs and weights since it requires exponent computation that results in implementing power-consuming FP ALUs as well as huge memory bandwidth. Moreover, weighted summations are performed for all pixels and disparity levels requiring large computations even though the local aggregation has lower computation complexity than the global aggregation. For example, it requires 579.4 Gflops and 14.6 GB/s for 100 fps under QVGA resolution, implying that fine software optimization is required. Therefore, we introduce a hardware-friendly ASW algorithm in the next section that uses integer instead of FP for cost values.

2. Shifter-based Cost Aggregation Processing

Laplacian scale factor used in ASW algorithm ^(24,²⁷⁾ is the absolute difference of adjacent pixels’ intensities:

(1)

$$w_{i}=\mid \text {intensity}_{i}-\text {intensity}_{i-1} \mid / \sigma$$

where σ is the supporting parameter, of which the values used in the proposed hardware are 2, 4, 8 and 16. Then, an ASW cost with the weights is described as

(2)

$$C_{A S W}=C_{\text {center }}+\left(C_{1}+\left(C_{2}+C_{3} \cdot e^{-1 / 3}\right) \cdot e^{-w_{2}}\right) \cdot e^{-w_{1}}$$

Table 2. Depth error comparison of aggregation methods

In (2), an ASW cost is calculated by weighted summation of successive costs Ci ^(24,²⁷⁾. Ccenter indicates the initial matching cost of a center point and Ci’s are the costs of neighboring pixels. To calculate exponent operations, ^(24,²⁷⁾ utilizes a 32bit FP number system to reduce truncation errors during aggregation. Due to large area as well as large power consumption from FP logics, ⁽¹²⁾ deployed a 24bit INT number system to reduce their overheads with 6.8% average pixel error which is comparable with the accuracy of ⁽²⁴⁾ (6.5%), as shown in Table 2. However, its 24bit number system still requires large overhead for the size of intermediate memory and the area of 24bit multipliers, and the proposed algorithm is applied additional approximations to reduce an additional bit-width of the costs.

The stereo matching algorithm finds the pair of the best-matched points by WTA algorithm that searches for the index of minimum cost along with depth levels. Therefore, the result of a depth map by WTA is not changed if the inequality between two costs is preserved after approximations. In the first step of approximations, the base of the Laplacian kernel is changed from Euler’s number to 2 as

(3)

$C_{\text {Вазе-2}}=C_{\text {сетег}}+\left(C_{1}+\left(C_{2}+C_{3} \cdot 2^{-w_{3}}\right) \cdot 2^{-w_{2}}\right) \cdot 2^{-w_{1}}$

(4)

$C_{\text {Shifer}}=C_{\text {center}}+\left(C_{1}+\left(C_{2}+C_{3}>>w_{3}\right)>>w_{2}\right)>>w_{1}$

The modification to (3) does not change the inequality condition since $2^{-x} ∝ e^{-\log _{2}(x)}$. After that, base-2 ASW cost is approximated to shifting operation as (4) because >> $x$ ∝ 2^-x , and it still preserves the inequality condition without any loss of generality. As a result, as shown in Fig. 6 and Table 2, the accuracy differences between the proposed shifter-based aggregation and the previous integer-based ASW algorithms ⁽¹²⁾ is -3.82%, 3.79%, +1.59%, and +0.65% on Tuskuba, Venus, Teddy, and Cone cases of Middlebury stereo dataset ⁽²⁸⁾, respectively, while large bit-width reduction.

Fig. 6. Results of the proposed shifter-based aggregation (a) Input image, (b) Ground truth, (c) Adaptive support weight, (d) Shifter-based adaptive support weight

3. Shifter-based Aggregation Unit

Fig. 7(a) describes a hardware implementation of ASW that consists of an exponent, a multiplier, and an adder implemented with 32bit FP as (2). It takes weights and costs as input, and it calculates $C_{a c c}+C \cdot e^{-w}$ for every cycle. Then, aggregated cost over one direction is stored in an accumulation register. The FP exponent logic and FP MAC require complicated hardware using either lookup tables or piecewise linear approximation schemes for reduced hardware complexity. However, both approaches still require large on-chip memory size and complex processing logics compared with integer-based hardware, respectively. In addition, a DEP requires highly-parallel aggregation unit arrays (e.g.,> 270-ways), so the overheads of area and power consumption are critical.

Unlike the FP-based unit, the proposed aggregation unit in Fig. 7(b) requires only a barrel shifter and an integer adder. It generates multiplication between input costs and exponential of weights by only one shifting operation. Also, the proposed shifter-based ASW enables to use integer number system during aggregation. The initial costs are generated by 8 points selective census matching within 5x5 template size of which the maximum value is 8. After that, they are aggregated by the proposed ASW within 15x15 aggregation region, and the maximum value of the intermediate and the final aggregated costs are 120 and 1800, respectively. Thus, the former 32bit bit-width of the initial, the intermediate, and the final costs are set to 4bit, 7bit, and 11bit, respectively, without overflow. The maximum value of weights, which represent the aggregation strength between neighboring pixels by similarity, is determined empirically. Simulation results show that utilizing 3 bit is enough for the shifter-based ASW processing without accuracy degradation. As a result, the proposed unit only contains a 4(7)bit barrel shifter with a 3bit operand and a 6(11)bit accumulator for vertical (horizontal) directions, respectively, reducing power consumption by 92.2% compared with original FP-based implementation.

Fig. 7. (a) FP-based aggregation unit, (b) Proposed shifterbased aggregation unit

In addition to the power reduction, the bit-width reduction of processing data also drastically reduces the overall intermediate data size by 69.1%. The required intermediate memory reduction to 31.9 KB facilitates to integrate all the intermediate buffers on the chip, removing external memory accesses during stereo matching.

III. PROPOSED DEPTH-ESTIMATION PROCESSOR

1. Overall Architecture

Fig. 8 describes the overall architecture of the proposed DEP that is composed of a top controller, an input image loader, an output depth buffer, and a stereo pipeline module (SPM). The 7-stage pipelined SPM estimates depth line-by-line. It is composed of an input buffer, a census transformation unit, an initial matching unit, a vertical aggregation unit, a horizontal aggregation unit, a WTA unit, and a left-right (L-R) consistency check unit. First, the input image loader fetches 8b left and right pixels from an external memory, and stores them to the 320x20 input buffer in the SPM for every clock cycle. After 20 lines of inputs are fetched inside the input buffer, the census transformation unit generates both of 30 left and right binary patterns and corresponding aggregation weights from the 20-line inputs for every cycle. Then, the initial matching unit calculates hamming distance between left and right census pairs and extracts 74 initial cost maps for every 60 cycles. Next, the initial cost maps from the previous stage are aggregated by the vertical and horizontal aggregation units in order with 248-way and 240-way parallelism, respectively. After that, the WTA unit searches the best-matched index between left and right images and generates left and right depth maps. Finally, the L-R consistency check unit compares left and right depth maps to eliminate falsely matched depth points, which come from occluded or textureless points, and the 60 final depth points are stored to the output depth buffer for every 60 cycles. The proposed shifter-based ASW completely eliminates external memory access during SPM operations by holding all of the intermediate data inside pipeline buffer. To realize 10 ms stereo matching latency, the initial matching, the vertical aggregation, and the horizontal aggregation units are composed with homogeneous 148-way, 148-way, and 120-way parallelized PEs, respectively.

Fig. 8. The overall depth-estimation processor architecture

Fig. 9. The timing diagram of the proposed DEP with hierarchical pipelining

Fig. 9 describes a timing diagram of the proposed DEP operations with hierarchical pipelining. The first is linelevel pipeline with 3 stages of line loading, line processing, and line storing. The SPM estimates 1 line of depth map for every 480 clock cycle. Each line processing stage consists of 7 stages of pixel-level pipeline: pre-fetching input, census transformation, initial matching, vertical aggregation, horizontal aggregation, WTA, and consistency check. Each stage processes pixel-level operations every 8 clocks. All the pipeline stages are well balanced to achieve 94% of utilization.

2. Line Streaming 7-stage Pipeline Architecture

Fig. 10 describes data processing patterns of sliding window matching and 4-direction cost aggregation. In the initial matching stage, a right (reference) patch and a left (target) patch are compared to generate initial costs. In this operation, the right patch is reused 60 times while sliding the left patch toward right direction. In general implementation, the target patches are fetched into the left buffer and a wide I/O multiplexer (MUX) reorders its data to align with those in the right buffer. However, utilizing such a wide MUX causes large area overhead and routing congestion because it must be connected with all of the ports in the matching PE. On the other hand, the proposed architecture with a shifting register (SR)-based buffer for the target patch (marked in red) moves 1 index every pipeline cycle, as indicated in Fig. 10(a). In the meantime, the reference patch stored in blue RFs is loaded every 60 pipeline cycle. The 4-direction cost aggregation is obtained by recursively performing the bi-directional aggregation for top/bottom and right/left, respectively, as shown in Fig. 10(b). The size of aggregation window is 15x15, and maximally 8 costs are aggregated in a single aggregation PE. Therefore, the initial costs in a buffer are selected with cyclic indexing and issued to forward and backward aggregation units. The bi-direction aggregated costs are generated through both units after 8 clock cycles.

Fig. 10. Hardware implementations in stereo matching (a) Matching hardware with shifting register-based buffer, (b) Aggregation units with 2- direction MUX-based buffer

There are 5 pipeline buffers in the SPM, which are the input buffer, the left and right census registers, the initial cost register, the intermediate cost registers for vertically aggregated costs, and the final aggregated cost register. The overall latency of this pipeline is 480 clock cycles, and the buffers latch and fetch data with synchronized pipeline cycle. First, the 3-banked input buffer issues 3 pixels of the left and right images to left and right census transform units for one clock. This operation consumes 320 cycles for issuing 1 line of the input images, and remaining 160 cycles are used for fetching the next line to the input buffer from external memory. Second, the left-/right- census units transform the pixels of the input images into 15 census pixels for every clock cycle at the same time. After that, they are stored to the left and right census registers shown in Fig. 11. The right census buffers utilize double buffering and they are swapped for every 60 pipeline cycle (480 clock cycles), and the left census buffer is composed of the SR-based buffer architecture as mentioned in Fig. 10. Third, upper and lower lines from the active left-/right- census buffers are fetched from the buffers and the initial matching units calculate 2 lines (148 words) of the initial costs for every clock cycles element-by-element as Fig. 11 describes. Fourth, bi-direction vertical aggregation is performed with 148-way vertical aggregation units that generate 74 upper initial costs and 74 lower initial costs for every 8 clock cycle. To eliminate a pipeline stall of vertical aggregations, as shown in Fig. 11, initial matching and vertical aggregation process different lines of data with 1 index shifting. Finally, the horizontal aggregation is performed with 120-way horizontal aggregation units and resulting aggregated costs are stored in the final aggregated cost registers. In addition to vertical aggregations, the intermediate cost buffer exploits double buffering architecture to reduce a pipeline stall of horizontal aggregations. Both vertical and horizontal aggregation buffers utilize MUX-based buffer as shown in Fig. 10(b). As a result, the proposed SPM processes 300 depth points with initial matching and aggregations for every 2400 clock cycle with 60 disparity levels, and its average utilization is 94%.

Fig. 11. Cost generation and pipeline buffer architecture of SPM: 1) Shifting register and double buffering for left and right census, 2) 2-path initial matching and initial cost buffer, 3) 2path vertical aggregation and horizontal aggregation

Fig. 12. Comparison between multiplexer-based and shifting register-based architecture (a) Area vs. parallelism, (b) Power vs. parallelism, (c) Area-power product vs. parallelism

The proposed SPM does not use any external memory during stereo matching. So, the size of its pipeline RFs is very critical for aspect of both logic area and power consumption. To reduce RFs, we also change order of aggregation direction from X-Y to Y-X such as ⁽¹²⁾. For examples in our case, in X-Y order aggregation, it needs 960 words which are composed of 60x15 and 60x1 RFs for pipeline buffers. However, in Y-X order aggregation, it needs only 134 words which are composed of 74x1 and 60x1 RFs instead of 960 words. This optimization also reduces weight buffers, and its effect is doubled due to reducing both left and right buffers. Therefore, the proposed hardware reduces 43.9% further memory in the CA stages with only 0.5% error penalty. As a result, due to line-level processing and changing aggregation order, the proposed SPM requires only 17.9 KB buffer size without any external memory accesses for QVGA stereo matching.

3. Shifting Register-based Pipeline Buffer

There are two basic pipeline buffer architectures of a MUX-based buffer and an SR-based buffer designs. The difference is the way that they align enormous data to a parallel PE array by using either multiplexer (MUX) or SR. In the MUX-based architecture, input data from a previous stage are stored into a pipeline buffer through wide I/O MUX. On the other hand, the SR-based architecture orders data by shifting 1-index for every insert of input. In general, the MUX-based architecture consumes less dynamic power and small area compared with the SR-based one in low parallelism design, thus, CPU or DSP does not deploy SR-based architectures. However, when it comes to high parallelism that consequently requires large amount of connections, its area increases tremendously as well as the static power becomes dominant. These area and power overheads make it inefficient in highly parallel designs such as the proposed DEP.

Simulations were taken to get the relationships of area, power, and area-power product as a new figure-of-merit with respect to parallelism to optimize buffer architectures; where both architectures run at 150 MHz with 1.0 V supply voltage. Barrel shifter with O(n) logic complexity is used for MUX-based architecture for fair comparison because buffers in stereo matching only move indices along one direction. The baseline of normalization is the MUX-based architecture, and they are tested from 5-way to 100-way. As shown in Fig. 12(a), the MUX shows smaller size than the SR below 25-way. However, normalized area of the MUX is bigger than SR when > 25-way. In view of normalized power shown in Fig. 12(b), the MUX always consumes less power due to the dynamic power consumption of SRs. However, the gap between the SR and the MUX is only 4% at 100-way parallelism. Since both area and power are important in hardware design, we analyze the areapower product to find the optimal designs for SPM buffers. As shown in Fig. 12(c), the MUX-based design shows better performance than SR-based design when < 40-way, while it is opposite when > 40-way. Since both sliding window matching and aggregation are performed repeatedly by moving 1-index for every processing, pipeline buffers between each 7 stages can be implemented both MUX-based and SR-based designs, and its optimization can be made by utilizing each architecture according to its parallelism level. Therefore, the initial matching buffer (8), the vertical aggregation buffer (8), and the horizontal aggregation buffer (1) utilize the MUX-based architecture, and the left and right census buffers (74) utilize SR-based architecture. As a result, the optimized buffer design improves 44% timing for a critical path and reduces 29.8% of overall area.

4. Resolution Scalable Pipeline Control

Due to line streaming processing, the depth-estimation can support any size of the height of input image resolution. On the other hand, for width scalability, the proposed SPM architecture supports any resolution with widths in multiples of 60. Since the input buffer size of our DEP is 320(Max width)x21(Max aggregation range), the DEP supports 60, 120, 180, 240, and 300 width images without degrading utilization using 300 input buffers out of 320, while the rest of 20 input buffers are used for aggregation. If this buffer size is enlarged to 640, or 1920, the proposed architecture is also possible to support VGA or Full-HD images without any change of PE architecture. To realize this scalability, the resolution scalable pipeline control is proposed in the SPM shown in Fig. 13 so that the controller does not have to be altered even though number of buffers are scaled. As shown in Fig. 13(a), the hardware block of single pipeline stage receives only 4 signals: EN (enable signal), RST (reset), Pn (pipeline number), and Ln (loop number). EN controls whether to latch data into accumulation registers or pipeline buffers. RST resets both an accumulation register inside a PE array and an alignment index in a data alignment unit to zero. These two signal is mandatory signal. On the other hand, Pn and Ln are optional signals for aggregation stages, WTA, and L-R consistency check stages. Pn is used for controlling current position of aggregation and latching signal for pipeline buffers. Ln is used for WTA and L-R consistency check operations. These 4 control signals are generated by a signal generator in the SPM described in Fig. 13(b). In SPM, there are two 3bit and 6bit counters for Pn and Ln, one variable width pulse generator, and a configuration registers, and they generate all of the control signals required in the 7 pipeline stages. The SPM receives SPM_EN (global enable) and SPM_RST (line reset signal) from the top DEP controller and it performs 480 (60x8) cycle stereo operation. After 480 cycles, dependent on configuration setting, it automatically proceeds next 60 depth processing or stops until next line processing. In configuration registers, the number of pre-fetching lines, image resolution, and debugging settings for dumping intermediate data are stored and the signal generator makes variable width EN signal by this information. Due to these line-level automated control, the top controller only needs to send SPM_EN signal and SPM_RST signal to the SPM while processing the whole stereo matching. Fig. 13(c) describes timing diagrams for the proposed DEP. First, after the top controller in DEP asserts SPM_EN and send single pulse of SPM_RST, the SPM automatically processes 1 line of depth-estimation. Then, SPM_RST resets the loop counter to zero inside the SPM, and it makes enable signals to the 7 stages until processing the entire 1 line. After SPM_RST is asserted, the variable width pulse generator in the SPM sends EN signal for stage 1, which is successively propagated to stage 2~7. These signal propagations can be turned on or off by setting of configuration registers in the SPM. For example, before estimating first line of a depth map, the proposed hardware must pre-fetch 20 lines and stage 2~7 must not process any data because there are invalid images inside the input buffer. In this case, the signal generator blocks propagation of enable signal and performs stage 1 only for the remaining 19 lines. In this situation, all other stages are stalled and clock-gated to reduce redundant power consumption. Due to these simple control architecture, its control logic only occupies 0.26% of overall DEP area while it supports various resolutions of input images.

Fig. 13. Stereo pipeline module control (a) Structure of single pipeline stage hardware, (b) Control path of stereo pipeline module, (c) Timing diagram of pipeline control signal

IV. IMPLEMENTATION RESULTS

1. Chip Implementation Results

The proposed 1400x2000 μm2 DEP shown in Fig. 14 is fabricated by 65 nm 1P8M logic CMOS process and Table 3 summarizes the chip specification. We redesigned the previous DEP block ⁽²⁹⁾ into the standalone chip with improvement at debugging functionality, resolution scalability, external interface, and timing performance. It consumes 47.2 mW with 175 fps (5.71 ms) throughput which is the maximum performance on 1.2 V supply voltage and 166 MHz operating frequency, and only 15.56 mW with 105 fps (9.52 ms) with 1.0 V and 100 MHz. The proposed hardware estimates QVGA resolution depth images where the maximum disparity is 60 level. Its maximum energy efficiency is 34 pJ/level·pixel at 1.0 V supply voltage. The required memory is reduced by 54.6% to 17.9 KB compared with the state-of-the-art result ⁽¹⁰⁾ and it makes possible to integrate all intermediate data into on-chip memory due to the algorithm and pipeline buffer optimization. Also, the measured 15.56 mW power dissipation and 34 pJ/level·pixel energy consumption, which corresponds to 75.6% reduction compared to the state-of-the-art ⁽⁹⁾.

Fig. 14. Chip photography

Table 3. Specification of the proposed DEP

2. Evaluation System Implementation

Fig. 15 shows the evaluation system of the proposed DEP that is integrated on the HMD systems and the DEP communicates with a host processor (Exynos-5422 application processor) by USB 3.0 I/F. Stereo images are retrieved from the customized stereo camera and the images are converted to grayscale by the host processor. After that, the host processor sends the images to the target HMD platform and eventually sent to the DEP. Overall stereo processing latency is 9.95 ms including with USB 3.0 communication latency between the DEP and the host processor, which is hidden behind depthestimation operations due to its streaming processing. The host processor performs 3D hand pose estimation by ⁽¹⁸⁾, and 3D hand poses are utilized for customized UI. The final result of the extracted depth maps from the DEP is visualized at a monitor.

Table 4. Performance comparison table

Table 5. Average depth error on Middlebury dataset ⁽²⁸⁾

3. Evaluation Results

We evaluate the proposed DEP for both Middlebury stereo dataset ⁽²⁸⁾ and hand pose estimation errors. To acquire hand pose estimation errors, ⁽¹⁸⁾ is applied to extracted depth maps. Table 5 shows an average depth error on ⁽²⁸⁾ that includes Tsukuba, Venus, Teddy, and Cones images. It is evaluated for all regions, nonoccluded regions, and depth discontinuity regions of test images, and its average errors are 10.7%, 7.1%, and 16.7%, respectively. Compared with the original algorithm ^(24,²⁷⁾, only 0.1% of accuracy is degraded for the three categories which are negligible in the 3D HGI. In addition, we also evaluate our proposed DEP with hand pose estimation algorithm ⁽¹⁸⁾ and the HMD system shown in Fig. 15. To set hand pose estimation to 30 ms latency, we reduced sample points and iteration to 128 points and 16 iterations, respectively. Also, our evaluation software is pipelined with image retrieving, depth-estimation, hand pose estimation, and visualization to realize overall 40 ms latency system. Fig. 16 shows evaluation results of hand pose estimation with the DEP. First, input images are sent to the DEP and it generates depth maps shown in 2nd and 5th columns. Even though they show depth errors for background regions due to occlusion by foreground hands, they show reasonably accurate depth quality on hand regions. The 3rd and 6th columns of Fig. 16 show the final hand pose results. Because ⁽¹⁸⁾ performs hand model regression with sampled depth points, which are the most reliable 128 depth points in the hand regions, the results show accurate hand poses.

Fig. 15. Evaluation system

Fig. 16. The evaluation results of hand pose estimation for the 3D hand gesture interface

Table 6. Hand pose estimation error

Table 6 shows hand pose estimation errors in the range of 25~35 cm which are usual active distance in the 3D HGI on the HMD systems. The maximum error is 13.64 mm and 12.00 mm on regions of fingers and palm, respectively, where the corresponding average errors are 7.18 mm and 6.28 mm. Since the original algorithm ⁽¹⁸⁾ which utilizes the ToF sensor instead of stereo matching shows average 5 mm of hand tracking error, the accuracy of the hand tracking system with the proposed are adequate to provide the natural UI for AR/MR systems.

V. CONCLUSIONS

In this paper, an energy-efficient DEP is proposed to enable real-time depth-estimation with QVGA images. We proposed shifter-based aggregation processing and shift register architecture for data-intensive depthestimation hardware. Due to shifter-based aggregation processing which is a hardware-friendly algorithm, we can achieve 92.2% less power consumption and 69.1% less required memory. The shifting register-based pipeline architecture also reducing 60% of total area and improving 19.8% of maximum speed. By reordering the aggregation pipeline and well-balanced pipeline buffer architecture, 95.8% of internal memory can be reduced, resulting in eliminating external memory accesses. As a result, the proposed processor is capable of 60-level depth-estimation for hand gesture recognition with 9.52 ms latency and only 17.9 KB internal memory with QVGA image. It consumes only 15.56 mW power at 150 MHz and 1.0 V, enabling low latency and power preprocessing for smart mobile user interface.

ACKNOWLEDGMENTS

This research was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (No. 20170-01803) Development of Fundamental Technology of Core Components for Augmented and Virtual Reality Devices.

REFERENCES

Larry Li, May 2014, Time-of-Flight Camera – An Introduction, Texas Instrument, June 2016, http://www.ti.com/lit/wp/sloa190b/sloa190b.pdf

Payne , Feb 2014, 7.6 A 512×424 CMOS 3D Time-ofFlight image sensor with multi-frequency photodemodulation up to 130MHz and 2GS/s ADC, International Solid-State Circuits Conference, pp. 134-135

Hololens Hardware detail , Available: https://docs. microsoft.com/en-us/windows/mixed-reality/hololens- hardware-details

Bamji Cyrus S., Feb 2018, 1Mpixel 65nm BSI 320MHz Demodulated TOF image sensor with 3.5mm global shutter pixels and analog binning, International Solid-State Circuits Conference, pp. 94-95

REAL3TM hardware documentation , Available: https://www.infineon.com/dgdl/Infineon-REAL3% 20Image%20Sensor%20Family-PB-v01_00-EN. PDF?fileId=5546d462518ffd850151a0afc2302a58

Keel M., 2019, A 640×480 Indirect Time-ofFlight CMOS Image Sensor with 4-tap 7-μm Global-Shutter Pixel and Fixed-Pattern Phase Noise Self-Compensation Scheme, 2019 Symposium on VLSI Circuits, Kyoto, Japan, pp. c258-C259

Tenney , Dec 2012, Microsoft Kinect – Hardware. CAST Technical Publications Series. Number 10354, GMV, June 2016 http://gmv. cast.uark.edu/scanning/hardware/microsoft-kinectresourceshardware/

Chen Hong-Hui, Feb 2015, 23.2 A 1920×1080 30fps 611mW five-view depth-estimation processor for light-field applications, International Solid-State Circuits Conference, pp. 422-423

Cheng Chao-Chung, 2010, Architecture design of stereo matching using belief propagation, International Symp. Circuits and Systems, pp. 4109-4112

Li Z., Feb 2017, 3.7 A 1920×1080 30fps 2.3TOPS/W stereo-depth processor for robust autonomous navigation, International Solid-State Circuits Conference, pp. 62-63

Lee K. J., Jan 2017, A 502-GOPS and 0.984-mW Dual-Mode Intelligent ADAS SoC With Real-Time Semiglobal Matching and Intention Prediction for Smart Automotive Black Box System., Journal of Solid-State Circuits, Vol. 52, No. 1, pp. 139-150

Chang N. Y. C., June 2010, Algorithm and Architecture of Disparity Estimation With Mini-Census Adaptive Support Weight, Transactions on Circuits and Systems for Video Technology, Vol. 20, No. 6, pp. 792-805

Li Z., Apr 2019, A 1920x1080 25-Frames/s 2.4TOPS/W low-power 6-D vision processor for unified optical flow and stereo depth with semiglobal matching, IEEE Journal of Solid-State Circuits, Vol. 54, No. 4, pp. 1048-1058

Zhang X., May 2019, NIPM-s WMF: toward efficient FPGA design for high-definition large-disparity stereo matching, IEEE Trans. on Circuits and Systems for Video Tech., Vol. 29, No. 5, pp. 1530-1543

Chen L., Nov 2018, A 95pJ/label wide-range depthestimation processor for full-hd light-field applications on FPGA, IEEE Asian Solid-State Circuits Conference, pp. 8180-8181

Lee J., 2017, A 31.2pJ/disparity· pixel stereo matching processor with stereo SRAM for mobile UI application, 2017 Symposium on VLSI Circuits, Kyoto, pp. C158-C159

Tippetts , 18 Jan 2013, Review of stereo vision algorithms and their suitability for resource-limited systems., Journal of Real-Time Image Processing, Vol. 11, No. 11, pp. 5-25

Qian C., June 2014, Realtime and Robust Hand Tracking from Depth, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1106-1113

Jota R., May 2013, How fast is fast enough?: a study of the effects of latency in direct-touch pointing tasks, SIGCHI Conference on Human Factors in Computing Systems, pp. 2291-2300

Kuhl Annika, Feb 2005, Comparison of stereo matching algorithms for mobile robots, Centre for Intelligent Information Processing System, pp. 4-24

Kolmogorov , 2006, Graph cut algorithms for binocular stereo with occlusions, Handbook of Mathematical Models in Computer Vision, pp. 423-437

Felzenszwalb , Oct 2006, Efficient belief propagation for early vision, International journal of computer vision, Vol. 70, No. 1, pp. 41-54

Hirschmuller , Feb 2008, Stereo processing by semiglobal matching and mutual information, Transactions on pattern analysis and machine intelligence, Vol. 30, No. 2, pp. 328-341

Yoon Kuk-Jin, Jun 2005, Locally adaptive supportweight approach for visual correspondence search, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 924-931

Li L., Nov 2016, PMSC: PatchMatch-Based Superpixel Cut for Accurate Stereo Matching, Transactions on Circuits and Systems for Video Technology, Vol. pp, No. 99, pp. 1-1

Scharstein D., Apr 2002, A Taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Int. J. Comput. Vision, Vol. 47, No. nos. 1– 3, pp. 7-42

Wang L., June 2006, High-Quality Real-Time Stereo Using Adaptive Cost Aggregation and Dynamic Programming, 3D Data Processing, Visualization, and Transmission, Third International Symposium on, pp. 798-805

Scharstein D., June 2003, High-accuracy stereo depth maps using structured light, Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 195-202

Choi S., Nov 2016, A 34pJ/level pixel depth-estimation processor with shifter-based pipelined architecture for mobile user interface, Asian Solid-State Circuits Conference, pp. 257-260

Author

Sungpill Choi (S’13)

received B.S., M.S., and Ph. D degrees in the School of Electrical Engineering from Korea Advaced Institute of Science and Technology (KAIST), Daejeon, Korea in 2013, 2015, and 2019 respectively.

His current research interests include stereo matching processor, hand gesture recognition interface processor, deep learning processor, and hardware-efficient computer vision algorithm.

Kyuho Jason Lee (S’12-M’17)

received B.S., M.S., and Ph. D. degrees in the School of Electrical Engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea in 2012, 2014 and 2017, respectively.

From 2017 to 2018, he has researched as a postdoctoral researcher in Information Engineering and Electronics Research Institute, KAIST, Daejeon, Korea.

Now he is an Assistant Professor at the School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology (UNIST). research interests include mixed-mode neuromorphic SoC, deep learning processor, Network-on-Chip architectures, and intelligent computer vision processor for mobile devices and autonomous vehicles.

Youngwoo Kim (S’18)

received the B.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2018, where he is currently pursuing the M.S. degree.

His current research interests include low-power memory-efficient architecture, hardware-oriented algorithms especially focused on deep learning system-on-chip.

Hoi-Jun Yoo (M’95–SM’04–F’08)

received the bachelor’s degree from the Electronic Department, Seoul National University, Seoul, South Korea, in 1983, and the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 1985 and 1988, respectively.

Since 1998, he has been the Faculty Member with the Department of Electrical Engineering, KAIST.

From 2001 to 2005, he was the Director of the Korean System Integration and IP Authoring Research Center, Seoul.

In 2007, he founded the System Design Innovation and Application Research Center, KAIST.

Since 2010, he has been the General Chair for the Korean Institute of Next Generation Computing, Seoul.

He is currently a Full Professor with KAIST.

He has served as a member for the Executive Committee of ISSCC, the Symposium on VLSI, and ASSCC, and the TPC Chair of the A-SSCC 2008 and ISWC 2010, the IEEE Distinguished Lecturer from 2010 to 2011, the Far East Chair for the ISSCC from 2011 to 2012, the Technology Direction Sub-Committee Chair for the ISSCC in 2013, the TPC Vice Chair for the ISSCC in 2014, and the TPC Chair for the ISSCC in 2015.