Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 19, No. 1, p.97-108

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 18 December 2018Revised : 0 0 0Accepted : 28 January 2019

DOI :

https://doi.org/10.5573/JSTS.2019.19.1.097

A Multifunction Unit for Matrix, Vector and Elementary Functions Computation in Mobile GPU Shaders

(Byeong-Gyu Nam) ¹^†

(Department of Computer Science and Engineering, Chungnam National University, 99, Daehak-ro, Yuseong-gu, Daejeon, 305-764, Korea)

^†Corresponding author, E-mail: bgnam@cnu.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

This paper presents a delay and area optimized multifunction unit with a wide operation set of matrix, vector, and elementary functions for mobile and wearable GPU shaders. The proposed unit unifies 25 non-trivial arithmetic operations on its single four-way arithmetic hardware. It employs two programmable modules for the unification of its operations. Novel unification architectures for these programmable modules are presented to optimize delay and area of the unit. Based on these architectures, the multifunction unit shows 10% and 3.5% reductions in its delay and area, respectively, compared with the prior art.

Index Terms

Multifunction unit, hybrid number system, logarithmic arithmetic, 3D computer graphics, shader, mobile GPU

I. INTRODUCTION

Embedded 3D graphics APIs like OpenGL-ES^[1] define programmable graphics shaders to provide advanced graphics effects such as displacement mapping and per-pixel lighting. Therefore, programmable vector processors called vertex and pixel shaders are introduced to the graphics pipeline to provide the programmable shading effects^[2]. These shaders are responsible for running a variety of graphics kernels that use matrix, vector, and elementary functions to simulate various kinds of advanced graphics effects. However, the limited operation set of traditional shaders does not directly support the matrix-vector multiplication, vector division, power function, etc. whose cycle counts are critical in improving the performance of the OpenGL transformation and lighting (TnL) operations.

There have been studies on multifunction units for mobile shaders in [3]^[3] and [4]^[4] to support rich operation sets in power- and area-efficient ways. The work in [3]^[3] presented a unification of the vector and elementary functions on a single framework. However, it was working on the fixed-point numbers which is not compatible with modern graphics APIs that require floating-point operations^[1]. Moreover, it dealt with the unification of vector and elementary functions only, not supporting the matrix operations that take the largest part in geometry processing. The work in [4]^[4] presented a unification of matrix, vector, and elementary functions on the floating-point data format for the first time, but its delay and area were not fairly optimized due to the complications in the unification. Specifically, the unification of the large operation set incurred a long propagation delay and a large area overhead to the programmable modules i.e. programmable multiplier (PMUL) and programmable adder (PADD) in the unit^[4] introduced for the unification.

In this paper, we present a multifunction unit with its delay and area optimized for mobile and wearable GPU shaders. The unit exploits the logarithmic arithmetic^[5] at its arithmetic core for an efficient unification of a large operation set as in [4]^[4]. It unifies 25 non-trivial arithmetic operations on a single arithmetic unit, whose operation set is summarized in Table 1. We propose novel unification architectures for the reduced delay and area of the multifunction unit and consequently, achieve substantial reductions in delay and area of the unit by 10% and 3.5%, respectively, compared with those of the previous study^[4].

Table 1. Operation set of 25 arithmetic operations

	Standard*	Sunny Subjects (N = 5)	Cloudy Subjects (N = 5)	Subjects
Category	Operations
Matrix operation	matrix-vector multiplication
Vector SIMD operations	add, sub, mul, div, sqrt, div-by-sqrt, mul-and- add, lerp
Vector product operations	dot-product, cross-product
Trigonometric functions	sin, cos, tan, sinh, cosh, tanh, asin, acos, atan, asinh, acosh, atanh
Elementary functions	power ( $x^{y}$ ), logarithm ( $\log _{x} y$ ) x

This paper is organized as follows. Section 2 describes the multifunction unit and the proposed unification architectures for the programmable modules. Evaluations are conducted in section 3 on delay and area of the proposed unit. Finally, we conclude in section 4.

II. MULTIFUNCTION UNIT

Basic organization of the multifunction unit is based on [4]^[4], but its delay and area are optimized by proposing novel unification architectures for the programmable modules in the unit. The unit unifies the matrix, vector, and elementary functions on a single four-way arithmetic hardware by supporting 25 non-trivial arithmetic operations listed in Table 1. The hybrid number system (HNS)^[6] of floating-point and logarithmic numbers is adopted in this unit to leverage its arithmetic efficiency i.e. the floating-point inputs are converted into the logarithmic domain, where nonlinear operations get simple, and the results are restored to the floating-point domain, where linear additions and subtractions are carried out. There are two novel programmable modules i.e. programmable converter (PCNV) and programmable accumulator (PACC) in the unit for the unification of multiple operations. They can be programmed into a variety of functions to implement target operations in delay and area efficient ways compared with the PMUL and PADD accommodated in [4]^[4]. In this section, we present the proposed architectures for these PCNV and PACC to optimize the delay and area of the multifunction unit.

1. Overall Architecture

The overall architecture of the proposed multifunction unit has five pipeline stages with four vector lanes, as depicted in Fig. 1. The logarithmic converters (LOGCs) and antilogarithmic converters (ALOGCs) for the HNS operations are adopted from [4]^[4] as shown in Fig. 2. Both the LOGC and ALOGC consist of an LUT for producing shift terms and an adder tree for accumulating them. Four 32-bit floating-point inputs are converted into the logarithmic numbers through the 4 LOGCs in E1 stage. E2 stage includes the PCNV that can be programmed into 4 LOGCs, 4 ALOGCs, or 4 constant multipliers (CMULs) according to the target operation under execution. E3 stage is the logarithmic arithmetic domain that contains 4 carry propagate adders (CPAs) along with the 4 shifters in E2 stage to provide the arithmetic cores for the logarithmic operations. The logarithmic arithmetic results are converted to the floating-point numbers through the 4 ALOGCs in E3 stage. The PACC in E4 stage can be programmed into a 4-way SIMD FLP adder, a 5-input FLP summation tree, or an ALOGC according to the target operation. The 4 FLP adders in E5 stage carries out the final accumulation of the two phase results from the matrix-vector multiplication. This pipeline architecture allows a single-cycle throughput with maximum five-cycle latency for all the operations supported in Table 1 except the matrix-vector multiplication which demonstrates half-cycle throughput with six-cycle latency.

Fig. 1. Overall architecture of multifunction unit.

Fig. 2. HNS number converters (a) logarithmic converter, (b) antilogarithmic converter.

Table 1. Operation set of 25 arithmetic operations

Category	Operations
Matrix operation	matrix-vector multiplication
Vector SIMD operations	add, sub, mul, div, sqrt, div-by-sqrt, mul-and- add, lerp
Vector product operations	dot-product, cross-product
Trigonometric functions	sin, cos, tan, sinh, cosh, tanh, asin, acos, atan, asinh, acosh, atanh
Elementary functions	power ($x^{y}$), logarithm ($\log _{x} y$) x

2. Programmable Converter

In previous work [4]^[4], a programmable multiplier (PMUL) was incorporated in E2 stage to implement the matrix, vector, and elementary functions on a single arithmetic unit. However, the integration of the power function requiring a 32b×24b multiplication in the logarithmic domain as expressed in Eq. (1a) incurred a long propagation delay and a large area overhead in the PMUL since the multiplication requires a large adder tree across the entire PMUL module as shown in Fig. 3(a). Therefore, in this work, we propose a novel architecture to eliminate the multiplication by transforming the expression of power function into Eq. (1b). This transformation eliminates the 32b×24b Booth multiplier in the PMUL, which results in substantial reductions in delay and area of the module. In this new architecture, the cascades of the 2 LOGCs and 2 ALOGCs are required instead to implement the transformation presented in Eq. (1b).

(1a)

$x^{y}=2^{\left(y \times \log _{2} x\right)}$

(1b)

$2^{\left(y \times \log _{2} x\right)}=2^{\left(2^{\left(\log _{2} y+\log _{2}\left(\log _{2} x\right)\right)}\right)}$

Fig. 3. Comparison of programmable multiplier [4]^[4] (a) and proposed programmable converter (b).

In addition, the trigonometric functions represented by polynomial expansions like Taylor series as in Eq. (2) required 32b×6b Booth multipliers in the PMUL^[4] to implement the term $k_{i} \times \log _{2} x$ in the logarithmic domain, which also incurred delay and area overhead.

(2)

$\sum_{i=0}^{4}\left[\oplus_{i}\left(c_{i} x^{k_{i}}\right)\right]=c_{0} x^{k_{0}}+\sum_{i=1}^{4}\left[\oplus_{i} 2^{\left(\log _{2} c_{i}+k_{i} \times \log _{2} x\right)}\right]$

where $\oplus_{i} \in\{+,-\}$ , and $c_{i}$ and $k_{i}$ are positive real and integer constants, respectively, and thus, the $\log _{2} c_{i}$’s are converted offline.

Therefore, we exploit constant multipliers (CMULs) in this work instead of the expensive 32b×6b Booth multipliers using just a few shifts and additions as in Eq. (3) since the ki’s are just small integer constants.

(3)

$k_{i} \cdot X=s_{i} \cdot X+\left(X \ll p_{i}\right)+\left(X \ll q_{i}\right)+\left(X \ll r_{i}\right)$

where shift amounts $p_{i}$ and $q_{i}$ are 1, 2, or 3 to compose a small integer $k_{i}$, and $s_{i}$ is set to 0 for even values of $k_{i}$ or to 1 for odd values.

Fig. 4 shows the organization of the CMUL implementing the Eq. (3). It consists of an LUT storing $p_{i}$, $q_{i}$ and $r_{i}$ per trigonometric function to produce shift terms accordingly and an adder tree for accumulating them. This CMUL replaces the 32b×6b Booth multiplier, which brings substantial reductions in the delay and area of the module.

Fig. 4. Constant multiplier (CMUL).

The final summation $\oplus_{i}$ of the terms in Eq. (2) can be implemented by programming the PACC in E4 stage into an FLP summation tree to be explained in next subsection. The first term $c_{0} x^{k_{0}}$ is directly fed to the summation tree through the augmented bias port as it is just a constant or simply $x$.

Eliminating the Booth multipliers for both the powering and trigonometric functions, we propose a programmable converter (PCNV) in E2 stage as shown in Fig. 3(b) to replace the PMUL from the previous work ^[4]. The PCNV combines the LOGC, ALOGC, and CMUL by sharing the common adder tree and thus, can be programmed into 4 LOGCs, 4 ALOGCs, or 4 CMULs according to target operation. Each lane of the PCNV accommodates an $LUT_{LOGC}$, an $LUT_{ALOGC}$ and an $LUT_{TRG}$ to produce the shift terms for each block and a shared adder tree to accumulate the terms realizing the LOGC, ALOGC and CMUL together in a lane. Now, the cascade of 2 LOGCs for the HNS term $\log _{2}\left(\log _{2} x\right)$ in Eq. (1b) can be realized with a LOGC in E1 stage together with this PCNV programmed into a LOGC in E2 stage. The cascade of 2 ALOGCs in Eq. (1b) can be implemented using one ALOGC in E3 stage and the other one in E4 stage by programming the PACC into an ALOGC, which is to be illustrated in next subsection. The constant multiplications in the Eq. (2) can be implemented by programming the PCNV into 4 CMULs. Consequently, the PCNV eliminates both the 32b×24b and 32b×6b Booth multipliers required for the powering and trigonometric functions from the module. This PCNV can also be programmed into 4 LOGCs to obtain the 8 LOGCs required for the vector operations together with the 4 LOGCs in E1 stage or 4 ALOGCs to get the 8 ALOGCs for matrix-vector multiplication together with the 4 ALOGCs in E3 stage.

3. Programmable Accumulator

In [4]^[4], the programmable adder (PADD) in E4 stage exploited the 4 CPAs to configure the PADD into a 4-way SIMD FLP adder or a single 5-input FLP summation tree according to the target operation. This approach incurred a long propagation delay in the PADD as it takes three CPA delays to complete the 5-input summation as shown in Fig. 5(a). Moreover, in this work, we need to incorporate an ALOGC to this PADD to realize the cascade of 2 ALOGCs for the power function as we discussed in previous subsection. Therefore, we instead propose a programmable accumulator (PACC) accommodating a carry save adder (CSA) tree and an LUTALOGC as shown in Fig. 5(b) to replace the long CPA tree and to incorporate an ALOGC to complete the power function. Now, the final summation tree for the trigonometric functions and the final ALOGC for the power function can be implemented together in the PACC by sharing the CSA tree as depicted in Fig. 5(b). This architecture replaces the long CPA tree with a short CSA tree and thereby results in a substantial reduction in the propagation delay. The resulting PACC can be programmed into a 4-way SIMD FLP adder for matrix and vector operations, an FLP summation tree for trigonometric functions and dot product, or an ALOGC for the power function.

Fig. 5. Comparison of programmable adder [4]^[4] (a), proposed programmable accumulator (b).

The architectures for the PCNV and the PACC proposed in subsections II.2 and II.3 bring reductions in delay, but in terms of area, there is a decrease in PCNV compared with PMUL but an increase in PACC compared with PADD since the elimination of Booth multiplier in PCNV resulted in the addition of an ALOGC to the PACC. However, we will present an analysis and comparison on the area estimates for the PCNV and PACC along with those for the PMUL and PADD in section III to show that the proposed architectures eventually achieve the area reduction as well in the multifunction unit.

4. Operation Set Implementation

The basic schemes to implement the operation set of this multifunction unit are based on the descriptions presented in [4]^[4]. In this subsection, we will briefly describe the implementation scheme for each operation to capture the idea behind the unification and improvements from the previous work.

A. Matrix-Vector Multiplication

The geometry transformation in 3D graphics is computed by a multiplication of a 4×4-matrix with a 4-element vector as expressed in Eq. (4) that requires 20 LOGCs, 16 adders, 16 ALOGCs and 12 FLP adders in HNS arithmetic.

(4)

$\left[ \begin{array}{llll}{c_{00}} & {c_{01}} & {c_{02}} & {c_{03}} \\ {c_{10}} & {c_{11}} & {c_{12}} & {c_{13}} \\ {c_{20}} & {c_{21}} & {c_{22}} & {c_{23}} \\ {c_{30}} & {c_{31}} & {c_{32}} & {c_{33}}\end{array}\right] \left[ \begin{array}{l}{x_{0}} \\ {x_{1}} \\ {x_{2}} \\ {x_{3}}\end{array}\right]= \left[ \begin{array}{c}{c_{00}} \\ {c_{10}} \\ {c_{20}} \\ {c_{30}}\end{array}\right] x_{0}+\left[ \begin{array}{c}{c_{01}} \\ {c_{11}} \\ {c_{21}} \\ {c_{31}}\end{array}\right] x_{1}+\left[ \begin{array}{c}{c_{02}} \\ {c_{12}} \\ {c_{22}} \\ {c_{32}}\end{array}\right] x_{2}+\left[ \begin{array}{c}{c_{03}} \\ {c_{13}} \\ {c_{23}} \\ {c_{33}}\end{array}\right] x_{3}$

$= \overrightarrow{2}^{\left(\left[ \begin{array}{l1}{\log _{2} c_{00}} \\ {\log _{2} c_{10}} \\ {\log _{2} c_{20}} \\ {\log _{2} c_{30}}\end{array}\right]+\log _{2} x_{0}\right)} + \overrightarrow{2}^{\left(\left[ \begin{array}{l1}{\log _{2} c_{01}} \\ {\log _{2} c_{11}} \\ {\log _{2} c_{21}} \\ {\log _{2} c_{31}}\end{array}\right]+\log _{2} x_{1}$\right)} + \overrightarrow{2}^{\left(\left[ \begin{array}{ll}{\log _{2} c_{02}} \\ {\log _{2} c_{12}} \\ {\log _{2} c_{22}} \\ {\log _{2} c_{32}}\end{array}\right]+\log _{2} x_{2}$\right)} + \overrightarrow{2}^{\left(\left[ \begin{array}{ll}{\log _{2} c_{03}} \\ {\log _{2} c_{13}} \\ {\log _{2} c_{23}} \\ {\log _{2} c_{33}}\end{array}\right]+\log _{2} x_{3}$\right)}$

This HNS matrix-vector multiplication (MAT) can be implemented in two phases on our four-way arithmetic unit as illustrated in Fig. 6. The 16 coefficients for a transformation matrix can be pre-converted into the logarithmic domain offline. So, the required number of LOGCs is reduced from 20 to 4 only for converting 4-way input vector ($x_{0}$, $x_{1}$, $x_{2}$, $x_{3}$), and just 2 LOGCs are required per phase which can be obtained from the LOGCs in E1 stage. The required 16 ALOGCs are also implemented in two phases i.e. 8 ALOGCs per phase by programming the PCNV in E2 stage into 4 ALOGCs together with the 4 ALOGCs in E3 stage. The 16 adders for the multiplications in logarithmic arithmetic are prepared in two phases as well i.e. 8 adders per phase using the 4 CPAs in E1 stage and the other 4 CPAs in E3 stage. In the first phase of the MAT, the 4-way outcome of the 4 CPAs in E1 stage goes through the PCNV in E2 stage programmed into 4 ALOGCs and the other 4-way result from the 4 CPAs in E3 stage goes through the 4 ALOGCs in E3 stage, which produces two of 4-way FLP multiplication results. These two results are added together in E4 stage by programming the PACC into a 4-way SIMD FLP adder to produce the first phase result. Repeating this process in the second phase and accumulating the outcome with the first phase result through the 4-way FLP adder in E5 stage completes the MAT operation. The 4-way FLP adder in E4 stage involved in these two phases and the final 4-way FLP adder in E5 stage implement the 12 FLP adders required for the MAT. This two-phase implementation results in half-cycle throughput of the MAT on this unit.

Fig. 6. Two-phase matrix-vector multiplication.

B. Vector Operations

All the vector SIMD operations (VEC) such as vector addition, multiplication, division, multiply-and-add, etc. listed in Table 1 can be expressed in a single generic operation as Eq. (5).

(5)

$\left(x_{i} \otimes y_{i}^{p} \oplus z_{i}\right)_{i \in\{0,1,2,3 \}}=\left(\overrightarrow{2}^{\log _{2} x_{i} \oplus \log _{2} y_{i} \gg q} \oplus z_{i}\right)_{i \in\{0,1,2,3\}}$

where $\otimes \in\{x, \div\}, \oplus \in\{+,-\}, p \in\{0.5,1\}, q \in\{0,1\}$. The operator $\otimes$ and the raise to the $p$ are converted into $\oplus$ and shift of $q$ in the logarithmic domain, respectively.

From the Eq. (5), vector SIMD operations require a pair of 4-way logarithmic conversions, which can be prepared with the 4 LOGCs in E1 stage and the PCNV in E2 stage programmed into 4 LOGCs. The shift and $\oplus$ operations in the logarithmic domain are implemented with the 4 shifters in E2 stage and 4 CPAs in E3 stage, respectively. The results from the logarithmic operations are converted into the floating-point numbers through the 4 ALOGCs in E3 stage, and the final $\oplus$ is implemented by programming the PACC in E4 stage into a 4-way SIMD FLP adder.

Vector SIMD lerp (LRP) in Eq. (6) can be realized with the 4 LOGCs in E1 stage and the PCNV in E2 stage programmed into 4 LOGCs to implement the pair of 4-way logarithmic conversions in Eq. (6). This vector SIMD hardware is augmented with 4 FLP adders in E1 stage to implement the required 4-way FLP subtraction in the HNS term $\log _{2}\left(z_{i}-y_{i}\right)_{i \in\{0,1,2,3\}}$. The 4 CPAs in E3 stage implement the 4-way addition in the logarithmic domain, and the result goes through the 4 ALOGCs in E3 stage realizing the 4-way FLP multiplication. The final 4-way FLP addition can be carried out with the PACC in E4 stage programmed into a 4-way SIMD FLP adder.

(6)

$\left(x_{i}\left(z_{i}-y_{i}\right)+y_{i}\right)_{i \in\{0,1,2,3\}}=\left(\overrightarrow{2}^{\log _{2} x_{i}+\log _{2}\left(z_{i}-y_{i}\right)}+y_{i}\right)_{i \in\{0,1,2,3\}}$

The dot product (DOT) given in Eq. (7) is also implemented with the 4 LOGCs in E1 stage and the PCNV in E2 stage programmed into 4 LOGCs to realize the pair of 4-way logarithmic conversions in Eq. (7). The 4 CPAs in E3 stage implement the 4-way addition in the logarithmic domain, and the result goes through the 4 ALOGCs in E3 stage producing 4-way FLP multiplication result. The PACC in this case is programmed into a single FLP summation tree in E4 stage for the final summation of the 4-way multiplication result.

(7)

$\sum_{i=0}^{i=3} x_{i} \times y_{i}=\sum_{i=0}^{i=3} 2^{\log _{2} x_{i}+\log _{2} y_{i}}$

Finally, the cross-product (CRS) in Eq. (8) requires 6 LOGCs since 6 different operands (i.e. $x_{0}$, $x_{1}$, $x_{2}$, $y_{0}$, $y_{1}$ and $y_{2}$) are involved in the 6 multiplications in CRS. It also requires 6 adders and 6 ALOGCs to realize the multiplications in logarithmic arithmetic. These 6 LOGCs and 6 ALOGCs can be obtained from the 4 LOGCs in E1 stage and 4 ALOGCs in E3 stage together with the PCNV programmed into 2 LOGCs and 2 ALOGCs. The 6 adders can be obtained from 2 CPAs in E1 stage together with 4 CPAs in E3 stage. The final 3-way FLP subtraction between the products can be realized with the PACC in E4 stage programmed into a SIMD FLP adder.

(8)

$\left[ \begin{array}{l}{x_{1} y_{2}-y_{1} x_{2}} \\ {x_{2} y_{0}-y_{2} x_{0}} \\ {x_{0} y_{1}-y_{0} x_{1}}\end{array}\right]=\left[ \begin{array}{c}{2^{\log _{2} x_{1}+\log _{2} y_{2}}-2^{\log _{2} y_{1}+\log _{2} x_{2}}} \\ {2^{\log _{2} x_{2}+\log _{2} y_{0}}-2^{\log _{2} y_{2}+\log _{2} x_{0}}} \\ {2^{\log _{2} x_{0}+\log _{2} y_{1}}-2^{\log _{2} y_{0}+\log _{2} x_{1}}}\end{array}\right]$

C. Elementary Functions

Logarithm with an arbitrary base (LOG) can be realized with the 2 LOGCs in E1 stage along with the PCNV programmed into 2 LOGCs in E2 stage to make up a pair of cascaded LOGCs required for the HNS terms $\log _{2}\left(\log _{2} y\right)$ and $\log _{2}\left(\log _{2} x\right)$ in Eq. (9). The subtraction in the logarithmic domain between these two HNS terms is implemented with a CPA in E3 stage, and the result goes through an ALOGC in E3 stage completing the LOG operation.

(9)

$\log _{x} y=\log _{2} y / \log _{2} x=2^{\log _{2}\left(\log _{2} y\right)-\log _{2}\left(\log _{2} x\right)}$

Power function (POW) and trigonometric functions (TRG) are also implemented with proper programming of the PCNV and PACC as described in subsections II.2 and II.3. The number of blocks required to implement each operation is summarized in Table 2, and configuration of the multifunction unit for each operation is illustrated in Fig. 7.

Table 2. Block usages for each operation in the multifunction unit

Operation	Block usage
Operation	LOGC	ALOGC	FLP adder	FLP sum tree	Const. mul.
Matrix-vector multiplication (MUL)	2/phase	8/phase	0	0	0
Vector mul, mad, div, sqrt, etc. (VEC)	8	4	4	0	0
Vector lerp (LRP)	8	4	8	0	0
Vector dot-product (DOT)	8	4	0	1	0
Vector cross-product (CRS)	6	6	3	0	0
Trigonometric functions (TRG)	1	4	0	1	4
Power (POW)	3	2	0	0	0
Logarithm with variable base (LOG)	4	1	0	0	0

Fig. 7. Configurations of the proposed multifunction unit for each category of operation.

III. EVALUATION RESULTS

Propagation delay of the proposed multifunction unit is evaluated using a technology independent model proposed in [7]^[7] for technology independent comparisons. In this model, the delays are expressed in terms of FO4 delays. The delay estimates for the building blocks are listed in Table 3 based on the values presented in [7]^[7]. According to these values, the delay estimates for the proposed PCNV and PACC, and the PMUL and PADD from previous work [4]^[4] are evaluated as follows:

Table 3. Delay and area estimates for main building blocks

Component	Delay ($fo_{4}$)	Area ($fa$)
1-bit 2:1 MUX	1.3	0.33
1-bit Shifter	6.5	1.66
1-bit 3:2 CSA	3.2	1
1-bit 4:2 CSA	4.3	2
N-bit radix-4 CPA	$1.8 \times\left(2+\left\lceil\log _{4} N\right\rceil\right)$	$1.2 N+0.6 \sum_{k=0}^{\left\lceil\log _{4} N\right\rceil- 1}\left(N-4^{k}\right)$
1-$kbit$ LUT	6/4	89

(10)

$\tau_{P C N V}= \tau_{lut_{-} \log }\left(1 lut_{\log }\right) + \tau_{\text {shift}_{-} \log }\left(3.9 \mathrm{fo}_{4}\right)+2 \times \tau_{m u x 2 : 1}\left(1.3 fo_{4}\right)$

$+ \tau_{c s a 3 : 2}\left(3.2 fo_{4}\right)+\tau_{c s a 4 : 2}\left(4.3 fo_{4}\right)+\tau_{c p a 32 b}\left(9 fo_{4}\right)$

$=1 l u t_{l o g}+23 f o_{4}$

$\tau_{P A C C}=\tau_{s w a p}\left(1.3 f o_{4}\right)+\tau_{s h i f f_{-} a l l i g n}\left(6.5 f o_{4}\right)+\tau_{m u x 2 :1}\left(1.3 fo_{4}\right)$

$+\tau_{c s a 3 : 2}\left(3.2 fo_{4}\right)+\tau_{c s a 4 : 2}\left(4.3 f o_{4}\right)+\tau_{c p a 30 b}\left(9 fo_{4}\right)+ \tau_{a b s 30 b}\left(6 fo_{4}\right)+\tau_{s h i f t_{-} \text {norm}}\left(6.5 fo_{4}\right)$

$=38.1 f o_{4}$

(11)

$\tau_{P M U L}= \tau_{l u t_{-} \log }\left(1 l u t_{l o g}\right)+ \tau_{s h i f t_{-} \log }\left(3.9 \mathrm{fo}_{4}\right)+ \tau_{m u x 2 : 1}\left(1.3 f o_{4}\right)$

$+ \tau_{c s a 4 : 2}\left(4.3 f o_{4}\right)+\tau_{m u x 3 : 1}\left(2 fo_{4}\right)+\tau_{c s a 4 : 2}\left(4.3 f o_{4}\right)$

$+ \tau_{m u x 3:1}\left(2 f o_{4}\right)+\tau_{c s a 4 : 2}\left(4.3 f o_{4}\right)+ \tau_{m u x 2:1}\left(1.3 f o_{4}\right)$

$+ \tau_{c s a 4 : 2}\left(4.3 f o_{4}\right)+\tau_{c p a 38b}\left(9 f o_{4}\right)$

$=1 l u t_{\log }+36.7 \mathrm{fo}_{4}$

$\tau_{P A D D}= \tau_{m u x 2 : 1_{-} \text {swap}}\left(1.3 \mathrm{fo}_{4}\right) + \tau_{\text {shift}_{-} \text {align}}\left(6.5 \mathrm{fo}_{4}\right)+ \tau_{c p a 29 b}\left(9 fo_{4}\right)$

$+ \tau_{m u x 2 : 1}\left(1.3 f o_{4}\right)+\tau_{c p a 29 b}\left(9 f o_{4}\right)+ \tau_{m u x 2 : 1}\left(1.3 f o_{4}\right)$

$+\tau_{c p a 30 b}\left(9 f o_{4}\right)+\tau_{a b s 30 b}\left(6 f o_{4}\right)+ \tau_{\text {shift}_{-} \text {norm}}\left(6.5 \mathrm{fo}_{4}\right)$

$=349.9 f o_{4}$

The delay estimates for the proposed PCNV and PACC in Eq. (10) show 33% and 24% reductions compared with those of the PMUL and PADD in Eq. (11), respectively, assuming the delay estimates for the 512-bit LUT is equivalent to 4.5$f_{O_{4}}$.

We use the area model proposed in [8]^[8] also for a technology independent comparison in which the areas are presented in the number of full adders ($fa$) as adders are fundamental building blocks in arithmetic units. Area estimates for individual components are also given in Table 3 and the evaluations for the proposed PCNV and PACC are given in Eq. (12). Areas for the PMUL and PADD from previous work [4]^[4] are also evaluated in Eq. (13) for a comparison.

(12)

$\sigma_{PCNV}=4 \times (\sigma_{lut_{-} \log } (1 lut_{\log }) + \sigma_{shft_{-} {\log }}(142.6 fa) + \sigma_{lut_{-} {alog} }(1 lut_{\log })$

$+ \sigma_{shft_{-} {alog} }(118.8 fa) + \sigma_{lut_{-} {trg} }(1 lut_{trg })+ \sigma_{shft_{-} {trg} }(10.7 fa)$

$+ 4 \times \sigma_{mux2:1_{-} {24b} }(8 fa) + 6 \times \sigma_{mux2:1_{-} {32b} } (10.7 fa) )$

$= 4 lut_{log} + 4 lut_{alog} + 4 lut_{trg} + 2190.8 fa$

$\sigma_{PACC}=3 \times (\sigma_{mux 2:1_{-} {swap} _{-} {24 b}}(8 fa)+ \sigma_{shift_{-} {align} _{-} {24 b}}(40 fa) + \sigma_{cpa_{-} {29 b}}(74.4 fa)$

$+ \sigma_{lza _{-} {29 b}}(46.7 fa) + \sigma_{abs _{-} {29 b}}(37.2 fa) + \sigma_{shift _{-} {norm} _{-} {24 b}}(48.3 fa))$

$+ \sigma_{f2i}(108 fa) + \sigma_{lut _{-} {alog}}(1 lut _{alog}) + \sigma_{shift _{-} {alog}}(118.8 fa)$

$+ \sigma_{mux2:1 _{-} {swap} _{-} {24 b}}(8 fa) + \sigma_{shift _{-} {align} _{-} {24 b}}(40 fa) + 4 \times \sigma_{mux2:1 _{-} {30 b}}(10 fa)$

$+ \sigma_{csa3:2 _{-} {30 b}}(30 fa) + \sigma_{csa4:2 _{-} {30 b}}(60 fa) + \sigma_{cpa _{-} {30 b}}(77.4 fa)$

$+ \sigma_{lza _{-} {30 b}}(48.3 fa) + \sigma_{abs _{-} {30 b}}(38.7 fa) + \sigma_{shift _{-} {norm} _{-} {30 b}}(50 fa)$

$= 1 lut _{alog} + 1423 fa$

$\sigma_{P C N V}+\sigma_{P A C C}=4 l u t_{l o g}+5 l u t_{a l o g}+4 l u t_{t r g}+3613.8 f a$

(13)

$\sigma_{PMUL}=4 \times (\sigma_{lut_{-} \log } (1 lut_{\log }) + \sigma_{shft_{-} {\log }}(142.6 fa) + \sigma_{lut_{-} {alog} }(1 lut_{\log })$

$+ \sigma_{shft_{-} {alog} }(118.8 fa) + \sigma_{csa3:2_{-} {38b} }(38 fa)+ \sigma_{csa4:2_{-} {38b} }(76 fa)$

$+ \sigma_{cpa_{-} {38b} }(101.4 fa) + \sigma_{Booth_{-} {enc} } (673.5 fa) +48 \times \sigma_{mux2:1_{-} {38b} }(12.7 fa)$

$= 4 lut_{log} + 4 lut_{alog} + 3190.3 fa$

$\sigma_{PADD}=2 \times (\sigma_{mux 2:1_{-} {swap} _{-} {24 b}}(8 fa)+ \sigma_{shift_{-} {align} _{-} {24 b}}(40 fa) + \sigma_{cpa_{-} {29 b}}(74.4 fa)$

$+ \sigma_{lza _{-} {29 b}}(46.7 fa) + \sigma_{abs _{-} {29 b}}(37.2 fa) + \sigma_{shift _{-} {norm} _{-} {29 b}}(48.3 fa))$

$+ 2 \times (\sigma_{mux 2:1_{-} {swap} _{-} {24 b}}(8 fa) + \sigma_{shift _{-} {align} _{-} {24 b}}(40 fa) + \sigma_{cpa_{-} {30 b}}(77.4 fa)$

$+ \sigma_{lza _{-} {30 b}}(48.3 fa) + \sigma_{abs _{-} {30 b}}(46.7 fa) + \sigma_{shift _{-} {norm} _{-} {30 b}}(50 fa))$

$+ 3 \times \sigma_{mux2:1 _{-} {29 b}}(9.7 fa) + 3 \times \sigma_{mux2:1 _{-} {30 b}}(10 fa)$

$= 1109.1 fa$

$\sigma_{P M U L}+\sigma_{P A D D}=4 l u t_{l o g}+4 l u t_{a l o g}+4299.4 f a$

As discussed in subsection II.3, there is an area reduction in PCNV from PMUL but an increase in PACC from PADD as analyzed in Eqs. (Eq. (12), Eq. (13)). However, comparing the combined area of PCNV and PACC pair and that of PMUL and PADD pair, the proposed PCNV and PACC pair shows 13% reduction in its area assuming the area for 512-bit LUT equivalents to 59$fa$. Therefore, we find the proposed unification architectures contribute to the reductions in both delay and area of the multifunction unit.

Accommodating these PCNV and PACC, proposed multifunction unit was modeled at structural level using Verilog HDL and synthesized using Synopsis Design Compiler with a 0.11 mm CMOS standard cell library. Synthesis was conducted under 25°C, 1.2 V, and typical corner operating conditions. The synthesis results demonstrate 93k NAND2 gates with a delay of 13.37 ns for the entire multifunction unit as summarized in Table 4. The unit from the previous work [4]^[4] are also synthesized in parallel for a comparison, and results show 10% and 3.5% reductions in the delay and area of the proposed multifunction unit compared with those of the previous design [4]^[4].

Table 4. Synthesis results for the multifunction unit

Stage	Delay (ns)		Area (NAND2)
Stage	This work	Prior work [4]^[4]	This work	Prior work [4]^[4]
E1	2.88	2.88	18.3k	18.3k
E2	2.37	3.46	36.7k	43.6k
E3	2.89	2.89	18.5k	18.5k
E4	2.83	3.5	10.6k	7.1k
E5	2.41	2.41	8.9k	8.9k
Total	13.38	14.94	93k	96.4k

IV. CONCLUSION

A novel architecture for a wide operation set multifunction unit is presented for mobile and wearable GPU shaders. Our unit adopts the hybrid number system (HNS) of the floating-point and logarithmic numbers for an efficient unification of 25 non-trivial arithmetic operations on a single arithmetic unit. Novel unification architectures are presented for the delay and area optimization of the programmable modules i.e. PCNV and PACC in the multifunction unit. Evaluations are conducted on these modules based on technology independent models for comparison purposes. Results show that the delays of PCNV and PACC are reduced by 33% and 24% from those of PMUL and PADD, respectively. Area estimates for PCNV and PACC pair shows 13% reduction from that of PMUL and PADD pair. Based on these optimizations, the proposed multifunction unit adopting the PCNV and PACC demonstrates 10% and 3.5% reductions in delay and area compared with those of the previous work. Therefore, we conclude the proposed unification architectures optimize the delay and area of the HNS multifunction unit for GPU shaders in mobile and wearable devices.

ACKNOWLEDGMENTS

This work was supported by research fund of Chungnam National University.

REFERENCES

Khronos Group , OpenGL-ES 2.0, 3.0, http://www.khronos.org

Lindholm E., Kilgard M. J., Moreton H., Aug. 2003, A User-Programmable Vertex Engine, Proc. SIGGRAPH 2001, pp. 149-158

Nam B. G., Kim H., Yoo H. J., Apr. 2008, Power and Area Efficient Unified Computation of Vector and Elementary Functions for Handheld 3D Graphics Systems, IEEE Trans. Computers, Vol. 57, No. 4, pp. 490-504

Nam B. G., Yoo H. J., May 2009, An Embedded Stream Processor Core based on Logarithmic Arithmetic for a Low-Power 3D Graphics SoC, IEEE J. Solid-State Circuits, Vol. 44, No. 5, pp. 1554-1570

Mitchell Jr. J. N., Aug. 1962, Computer Multiplication and Division Using Binary Logarithms, IRE Trans. Electronic Computers, Vol. 11, pp. 512-517

Lai F. S., Wu C. F. E., Aug. 1991, A Hybrid Number System Processor with Geometric and Complex Arithmetic Capabilities, IEEE Trans. Computers, Vol. 40, No. 8, pp. 952-962

Vazquez A., Bruguera J. D., Apr. 2011, Composite Iterative Algorithm and Architecture for q-th Root Calculation, Proc. 20th IEEE Symp. On Computer Arithmetic, pp. 52-61

Pineiro J. A., Oberman S. F., Muller J. M., Bruguera J. D., Mar. 2005, High-Speed Function Approximation Using a Minimax Quadratic Interpolator, IEEE Trans. Computers, Vol. 54, No. 3, pp. 304-318

Author

Byeong-Gyu Nam

received his B.S. degree (summa cum laude) in computer engineering from Kyungpook National University, Daegu, Korea, in 1999, M.S. and Ph.D. degrees in electrical engineering and computer science from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2001 and 2007, respectively.

His Ph.D. work focused on low-power GPU design for wireless mobile devices.

In 2001, he joined Electronics and Telecommunications Research Institute (ETRI), Daejeon, Korea, where he was involved in a network processor design for InfiniBandTM protocol.

From 2007 to 2010, he was with Samsung Electronics, Giheung, Korea, where he worked on world first 1-GHz ARM CortexTM microprocessor design.

Dr. Nam is currently with Chungnam National University, Daejeon, Korea, as an associate professor.

He is serving as a vice director of the System Design Innovation and Application Research Center (SDIA), KAIST and a member of steering committee of the IC Design Education Center (IDEC), KAIST.

His current interests include mobile GPU, machine learning processor, microprocessor, low-power SoC and embedded software.

He co-authored the book Mobile 3D Graphics SoC: From Algorithm to Chip (Wiley, 2010) and presented tutorials on mobile processor design at IEEE ISSCC 2012 and IEEE A-SSCC 2011.

He received the CNU Recognition of Excellent Professors in 2013 and the A-SSCC Distinguished Design Award in 2016.

He is serving as the Chair of Digital Architectures and Systems (DAS) subcommittee in ISSCC and a member of the TPC for IEEE ISSCC, IEEE A-SSCC, IEEE COOL Chips, VLSIDAT, ASP-DAC, and ISOCC.

He served as a Guest Editor of the IEEE Journal of Solid-State Circuits (JSSC) and is an Associate Editor of the IEIE Journal of Semiconductor Technology and Science (JSTS).