(Byeong-Gyu Nam)
1†
-
(Department of Computer Science and Engineering, Chungnam National University, 99,
Daehak-ro, Yuseong-gu, Daejeon, 305-764, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Multifunction unit, hybrid number system, logarithmic arithmetic, 3D computer graphics, shader, mobile GPU
I. INTRODUCTION
Embedded 3D graphics APIs like OpenGL-ES[1] define programmable graphics shaders to provide advanced graphics effects such as
displacement mapping and per-pixel lighting. Therefore, programmable vector processors
called vertex and pixel shaders are introduced to the graphics pipeline to provide
the programmable shading effects[2]. These shaders are responsible for running a variety of graphics kernels that use
matrix, vector, and elementary functions to simulate various kinds of advanced graphics
effects. However, the limited operation set of traditional shaders does not directly
support the matrix-vector multiplication, vector division, power function, etc. whose
cycle counts are critical in improving the performance of the OpenGL transformation
and lighting (TnL) operations.
There have been studies on multifunction units for mobile shaders in [3][3] and [4][4] to support rich operation sets in power- and area-efficient ways. The work in [3][3] presented a unification of the vector and elementary functions on a single framework.
However, it was working on the fixed-point numbers which is not compatible with modern
graphics APIs that require floating-point operations[1]. Moreover, it dealt with the unification of vector and elementary functions only,
not supporting the matrix operations that take the largest part in geometry processing.
The work in [4][4] presented a unification of matrix, vector, and elementary functions on the floating-point
data format for the first time, but its delay and area were not fairly optimized due
to the complications in the unification. Specifically, the unification of the large
operation set incurred a long propagation delay and a large area overhead to the programmable
modules i.e. programmable multiplier (PMUL) and programmable adder (PADD) in the unit[4] introduced for the unification.
In this paper, we present a multifunction unit with its delay and area optimized for
mobile and wearable GPU shaders. The unit exploits the logarithmic arithmetic[5] at its arithmetic core for an efficient unification of a large operation set as in
[4][4]. It unifies 25 non-trivial arithmetic operations on a single arithmetic unit, whose
operation set is summarized in Table 1. We propose novel unification architectures for the reduced delay and area of the
multifunction unit and consequently, achieve substantial reductions in delay and area
of the unit by 10% and 3.5%, respectively, compared with those of the previous study[4].
Table 1. Operation set of 25 arithmetic operations
|
Standard*
|
Sunny Subjects
(N = 5)
|
Cloudy Subjects
(N = 5)
|
Subjects
|
Category
|
Operations
|
Matrix operation
|
matrix-vector multiplication
|
Vector SIMD operations
|
add, sub, mul, div, sqrt, div-by-sqrt, mul-and- add, lerp
|
Vector product operations
|
dot-product, cross-product
|
Trigonometric functions
|
sin, cos, tan, sinh, cosh, tanh, asin, acos, atan, asinh, acosh, atanh
|
Elementary functions
|
power ( $x^{y}$ ), logarithm ( $\log _{x} y$ )
x
|
This paper is organized as follows. Section 2 describes the multifunction unit and
the proposed unification architectures for the programmable modules. Evaluations are
conducted in section 3 on delay and area of the proposed unit. Finally, we conclude
in section 4.
II. MULTIFUNCTION UNIT
Basic organization of the multifunction unit is based on [4][4], but its delay and area are optimized by proposing novel unification architectures
for the programmable modules in the unit. The unit unifies the matrix, vector, and
elementary functions on a single four-way arithmetic hardware by supporting 25 non-trivial
arithmetic operations listed in Table 1. The hybrid number system (HNS)[6] of floating-point and logarithmic numbers is adopted in this unit to leverage its
arithmetic efficiency i.e. the floating-point inputs are converted into the logarithmic
domain, where nonlinear operations get simple, and the results are restored to the
floating-point domain, where linear additions and subtractions are carried out. There
are two novel programmable modules i.e. programmable converter (PCNV) and programmable
accumulator (PACC) in the unit for the unification of multiple operations. They can
be programmed into a variety of functions to implement target operations in delay
and area efficient ways compared with the PMUL and PADD accommodated in [4][4]. In this section, we present the proposed architectures for these PCNV and PACC to
optimize the delay and area of the multifunction unit.
1. Overall Architecture
The overall architecture of the proposed multifunction unit has five pipeline stages
with four vector lanes, as depicted in Fig. 1. The logarithmic converters (LOGCs) and antilogarithmic converters (ALOGCs) for the
HNS operations are adopted from [4][4] as shown in Fig. 2. Both the LOGC and ALOGC consist of an LUT for producing shift terms and an adder
tree for accumulating them. Four 32-bit floating-point inputs are converted into the
logarithmic numbers through the 4 LOGCs in E1 stage. E2 stage includes the PCNV that
can be programmed into 4 LOGCs, 4 ALOGCs, or 4 constant multipliers (CMULs) according
to the target operation under execution. E3 stage is the logarithmic arithmetic domain
that contains 4 carry propagate adders (CPAs) along with the 4 shifters in E2 stage
to provide the arithmetic cores for the logarithmic operations. The logarithmic arithmetic
results are converted to the floating-point numbers through the 4 ALOGCs in E3 stage.
The PACC in E4 stage can be programmed into a 4-way SIMD FLP adder, a 5-input FLP
summation tree, or an ALOGC according to the target operation. The 4 FLP adders in
E5 stage carries out the final accumulation of the two phase results from the matrix-vector
multiplication. This pipeline architecture allows a single-cycle throughput with maximum
five-cycle latency for all the operations supported in Table 1 except the matrix-vector multiplication which demonstrates half-cycle throughput
with six-cycle latency.
Fig. 1. Overall architecture of multifunction unit.
Fig. 2. HNS number converters (a) logarithmic converter, (b) antilogarithmic converter.
Table 1. Operation set of 25 arithmetic operations
Category
|
Operations
|
Matrix operation
|
matrix-vector multiplication
|
Vector SIMD operations
|
add, sub, mul, div, sqrt, div-by-sqrt, mul-and- add, lerp
|
Vector product operations
|
dot-product, cross-product
|
Trigonometric functions
|
sin, cos, tan, sinh, cosh, tanh, asin, acos, atan, asinh, acosh, atanh
|
Elementary functions
|
power ($x^{y}$), logarithm ($\log _{x} y$)
x
|
2. Programmable Converter
In previous work [4][4], a programmable multiplier (PMUL) was incorporated in E2 stage to implement the matrix,
vector, and elementary functions on a single arithmetic unit. However, the integration
of the power function requiring a 32b×24b multiplication in the logarithmic domain
as expressed in Eq. (1a) incurred a long propagation delay and a large area overhead in the PMUL since the
multiplication requires a large adder tree across the entire PMUL module as shown
in Fig. 3(a). Therefore, in this work, we propose a novel architecture to eliminate the multiplication
by transforming the expression of power function into Eq. (1b). This transformation eliminates the 32b×24b Booth multiplier in the PMUL, which results
in substantial reductions in delay and area of the module. In this new architecture,
the cascades of the 2 LOGCs and 2 ALOGCs are required instead to implement the transformation
presented in Eq. (1b).
Fig. 3. Comparison of programmable multiplier [4][4] (a) and proposed programmable converter (b).
In addition, the trigonometric functions represented by polynomial expansions like
Taylor series as in Eq. (2) required 32b×6b Booth multipliers in the PMUL[4] to implement the term $k_{i} \times \log _{2} x$ in the logarithmic domain, which
also incurred delay and area overhead.
where $\oplus_{i} \in\{+,-\}$ , and $c_{i}$ and $k_{i}$ are positive real and integer
constants, respectively, and thus, the $\log _{2} c_{i}$’s are converted offline.
Therefore, we exploit constant multipliers (CMULs) in this work instead of the expensive
32b×6b Booth multipliers using just a few shifts and additions as in Eq. (3) since the ki’s are just small integer constants.
where shift amounts $p_{i}$ and $q_{i}$ are 1, 2, or 3 to compose a small integer
$k_{i}$, and $s_{i}$ is set to 0 for even values of $k_{i}$ or to 1 for odd values.
Fig. 4 shows the organization of the CMUL implementing the Eq. (3). It consists of an LUT storing $p_{i}$, $q_{i}$ and $r_{i}$ per trigonometric function
to produce shift terms accordingly and an adder tree for accumulating them. This CMUL
replaces the 32b×6b Booth multiplier, which brings substantial reductions in the delay
and area of the module.
Fig. 4. Constant multiplier (CMUL).
The final summation $\oplus_{i}$ of the terms in Eq. (2) can be implemented by programming the PACC in E4 stage into an FLP summation tree
to be explained in next subsection. The first term $c_{0} x^{k_{0}}$ is directly fed
to the summation tree through the augmented bias port as it is just a constant or
simply $x$.
Eliminating the Booth multipliers for both the powering and trigonometric functions,
we propose a programmable converter (PCNV) in E2 stage as shown in Fig. 3(b) to replace the PMUL from the previous work [4]. The PCNV combines the LOGC, ALOGC, and CMUL by sharing the common adder tree and
thus, can be programmed into 4 LOGCs, 4 ALOGCs, or 4 CMULs according to target operation.
Each lane of the PCNV accommodates an $LUT_{LOGC}$, an $LUT_{ALOGC}$ and an $LUT_{TRG}$
to produce the shift terms for each block and a shared adder tree to accumulate the
terms realizing the LOGC, ALOGC and CMUL together in a lane. Now, the cascade of 2
LOGCs for the HNS term $\log _{2}\left(\log _{2} x\right)$ in Eq. (1b) can be realized with a LOGC in E1 stage together with this PCNV programmed into a
LOGC in E2 stage. The cascade of 2 ALOGCs in Eq. (1b) can be implemented using one ALOGC in E3 stage and the other one in E4 stage by programming
the PACC into an ALOGC, which is to be illustrated in next subsection. The constant
multiplications in the Eq. (2) can be implemented by programming the PCNV into 4 CMULs. Consequently, the PCNV eliminates
both the 32b×24b and 32b×6b Booth multipliers required for the powering and trigonometric
functions from the module. This PCNV can also be programmed into 4 LOGCs to obtain
the 8 LOGCs required for the vector operations together with the 4 LOGCs in E1 stage
or 4 ALOGCs to get the 8 ALOGCs for matrix-vector multiplication together with the
4 ALOGCs in E3 stage.
3. Programmable Accumulator
In [4][4], the programmable adder (PADD) in E4 stage exploited the 4 CPAs to configure the
PADD into a 4-way SIMD FLP adder or a single 5-input FLP summation tree according
to the target operation. This approach incurred a long propagation delay in the PADD
as it takes three CPA delays to complete the 5-input summation as shown in Fig. 5(a). Moreover, in this work, we need to incorporate an ALOGC to this PADD to realize
the cascade of 2 ALOGCs for the power function as we discussed in previous subsection.
Therefore, we instead propose a programmable accumulator (PACC) accommodating a carry
save adder (CSA) tree and an LUTALOGC as shown in Fig. 5(b) to replace the long CPA tree and to incorporate an ALOGC to complete the power function.
Now, the final summation tree for the trigonometric functions and the final ALOGC
for the power function can be implemented together in the PACC by sharing the CSA
tree as depicted in Fig. 5(b). This architecture replaces the long CPA tree with a short CSA tree and thereby results
in a substantial reduction in the propagation delay. The resulting PACC can be programmed
into a 4-way SIMD FLP adder for matrix and vector operations, an FLP summation tree
for trigonometric functions and dot product, or an ALOGC for the power function.
Fig. 5. Comparison of programmable adder [4][4] (a), proposed programmable accumulator (b).
The architectures for the PCNV and the PACC proposed in subsections II.2 and II.3
bring reductions in delay, but in terms of area, there is a decrease in PCNV compared
with PMUL but an increase in PACC compared with PADD since the elimination of Booth
multiplier in PCNV resulted in the addition of an ALOGC to the PACC. However, we will
present an analysis and comparison on the area estimates for the PCNV and PACC along
with those for the PMUL and PADD in section III to show that the proposed architectures
eventually achieve the area reduction as well in the multifunction unit.
4. Operation Set Implementation
The basic schemes to implement the operation set of this multifunction unit are based
on the descriptions presented in [4][4]. In this subsection, we will briefly describe the implementation scheme for each
operation to capture the idea behind the unification and improvements from the previous
work.
A. Matrix-Vector Multiplication
The geometry transformation in 3D graphics is computed by a multiplication of a 4×4-matrix
with a 4-element vector as expressed in Eq. (4) that requires 20 LOGCs, 16 adders, 16 ALOGCs and 12 FLP adders in HNS arithmetic.
This HNS matrix-vector multiplication (MAT) can be implemented in two phases on our
four-way arithmetic unit as illustrated in Fig. 6. The 16 coefficients for a transformation matrix can be pre-converted into the logarithmic
domain offline. So, the required number of LOGCs is reduced from 20 to 4 only for
converting 4-way input vector ($x_{0}$, $x_{1}$, $x_{2}$, $x_{3}$), and just 2 LOGCs
are required per phase which can be obtained from the LOGCs in E1 stage. The required
16 ALOGCs are also implemented in two phases i.e. 8 ALOGCs per phase by programming
the PCNV in E2 stage into 4 ALOGCs together with the 4 ALOGCs in E3 stage. The 16
adders for the multiplications in logarithmic arithmetic are prepared in two phases
as well i.e. 8 adders per phase using the 4 CPAs in E1 stage and the other 4 CPAs
in E3 stage. In the first phase of the MAT, the 4-way outcome of the 4 CPAs in E1
stage goes through the PCNV in E2 stage programmed into 4 ALOGCs and the other 4-way
result from the 4 CPAs in E3 stage goes through the 4 ALOGCs in E3 stage, which produces
two of 4-way FLP multiplication results. These two results are added together in E4
stage by programming the PACC into a 4-way SIMD FLP adder to produce the first phase
result. Repeating this process in the second phase and accumulating the outcome with
the first phase result through the 4-way FLP adder in E5 stage completes the MAT operation.
The 4-way FLP adder in E4 stage involved in these two phases and the final 4-way FLP
adder in E5 stage implement the 12 FLP adders required for the MAT. This two-phase
implementation results in half-cycle throughput of the MAT on this unit.
Fig. 6. Two-phase matrix-vector multiplication.
B. Vector Operations
All the vector SIMD operations (VEC) such as vector addition, multiplication, division,
multiply-and-add, etc. listed in Table 1 can be expressed in a single generic operation as Eq. (5).
where $\otimes \in\{x, \div\}, \oplus \in\{+,-\}, p \in\{0.5,1\}, q \in\{0,1\}$. The
operator $\otimes$ and the raise to the $p$ are converted into $\oplus$ and shift
of $q$ in the logarithmic domain, respectively.
From the Eq. (5), vector SIMD operations require a pair of 4-way logarithmic conversions, which can
be prepared with the 4 LOGCs in E1 stage and the PCNV in E2 stage programmed into
4 LOGCs. The shift and $\oplus$ operations in the logarithmic domain are implemented
with the 4 shifters in E2 stage and 4 CPAs in E3 stage, respectively. The results
from the logarithmic operations are converted into the floating-point numbers through
the 4 ALOGCs in E3 stage, and the final $\oplus$ is implemented by programming the
PACC in E4 stage into a 4-way SIMD FLP adder.
Vector SIMD lerp (LRP) in Eq. (6) can be realized with the 4 LOGCs in E1 stage and the PCNV in E2 stage programmed
into 4 LOGCs to implement the pair of 4-way logarithmic conversions in Eq. (6). This vector SIMD hardware is augmented with 4 FLP adders in E1 stage to implement
the required 4-way FLP subtraction in the HNS term $\log _{2}\left(z_{i}-y_{i}\right)_{i
\in\{0,1,2,3\}}$. The 4 CPAs in E3 stage implement the 4-way addition in the logarithmic
domain, and the result goes through the 4 ALOGCs in E3 stage realizing the 4-way FLP
multiplication. The final 4-way FLP addition can be carried out with the PACC in E4
stage programmed into a 4-way SIMD FLP adder.
The dot product (DOT) given in Eq. (7) is also implemented with the 4 LOGCs in E1 stage and the PCNV in E2 stage programmed
into 4 LOGCs to realize the pair of 4-way logarithmic conversions in Eq. (7). The 4 CPAs in E3 stage implement the 4-way addition in the logarithmic domain, and
the result goes through the 4 ALOGCs in E3 stage producing 4-way FLP multiplication
result. The PACC in this case is programmed into a single FLP summation tree in E4
stage for the final summation of the 4-way multiplication result.
Finally, the cross-product (CRS) in Eq. (8) requires 6 LOGCs since 6 different operands (i.e. $x_{0}$, $x_{1}$, $x_{2}$, $y_{0}$,
$y_{1}$ and $y_{2}$) are involved in the 6 multiplications in CRS. It also requires
6 adders and 6 ALOGCs to realize the multiplications in logarithmic arithmetic. These
6 LOGCs and 6 ALOGCs can be obtained from the 4 LOGCs in E1 stage and 4 ALOGCs in
E3 stage together with the PCNV programmed into 2 LOGCs and 2 ALOGCs. The 6 adders
can be obtained from 2 CPAs in E1 stage together with 4 CPAs in E3 stage. The final
3-way FLP subtraction between the products can be realized with the PACC in E4 stage
programmed into a SIMD FLP adder.
C. Elementary Functions
Logarithm with an arbitrary base (LOG) can be realized with the 2 LOGCs in E1 stage
along with the PCNV programmed into 2 LOGCs in E2 stage to make up a pair of cascaded
LOGCs required for the HNS terms $\log _{2}\left(\log _{2} y\right)$ and $\log _{2}\left(\log
_{2} x\right)$ in Eq. (9). The subtraction in the logarithmic domain between these two HNS terms is implemented
with a CPA in E3 stage, and the result goes through an ALOGC in E3 stage completing
the LOG operation.
Power function (POW) and trigonometric functions (TRG) are also implemented with proper
programming of the PCNV and PACC as described in subsections II.2 and II.3. The number
of blocks required to implement each operation is summarized in Table 2, and configuration of the multifunction unit for each operation is illustrated in
Fig. 7.
Table 2. Block usages for each operation in the multifunction unit
Operation
|
Block usage
|
LOGC
|
ALOGC
|
FLP adder
|
FLP sum tree
|
Const. mul.
|
Matrix-vector multiplication (MUL)
|
2/phase
|
8/phase
|
0
|
0
|
0
|
Vector mul, mad, div, sqrt, etc. (VEC)
|
8
|
4
|
4
|
0
|
0
|
Vector lerp (LRP)
|
8
|
4
|
8
|
0
|
0
|
Vector dot-product (DOT)
|
8
|
4
|
0
|
1
|
0
|
Vector cross-product (CRS)
|
6
|
6
|
3
|
0
|
0
|
Trigonometric functions (TRG)
|
1
|
4
|
0
|
1
|
4
|
Power (POW)
|
3
|
2
|
0
|
0
|
0
|
Logarithm with variable base (LOG)
|
4
|
1
|
0
|
0
|
0
|
Fig. 7. Configurations of the proposed multifunction unit for each category of operation.
III. EVALUATION RESULTS
Propagation delay of the proposed multifunction unit is evaluated using a technology
independent model proposed in [7][7] for technology independent comparisons. In this model, the delays are expressed in
terms of FO4 delays. The delay estimates for the building blocks are listed in Table 3 based on the values presented in [7][7]. According to these values, the delay estimates for the proposed PCNV and PACC, and
the PMUL and PADD from previous work [4][4] are evaluated as follows:
Table 3. Delay and area estimates for main building blocks
Component
|
Delay ($fo_{4}$)
|
Area ($fa$)
|
1-bit 2:1 MUX
|
1.3
|
0.33
|
1-bit Shifter
|
6.5
|
1.66
|
1-bit 3:2 CSA
|
3.2
|
1
|
1-bit 4:2 CSA
|
4.3
|
2
|
N-bit radix-4 CPA
|
$1.8 \times\left(2+\left\lceil\log _{4} N\right\rceil\right)$
|
$1.2 N+0.6 \sum_{k=0}^{\left\lceil\log _{4} N\right\rceil- 1}\left(N-4^{k}\right)$
|
1-$kbit$ LUT
|
6/4
|
89
|
The delay estimates for the proposed PCNV and PACC in Eq. (10) show 33% and 24% reductions compared with those of the PMUL and PADD in Eq. (11), respectively, assuming the delay estimates for the 512-bit LUT is equivalent to
4.5$f_{O_{4}}$.
We use the area model proposed in [8][8] also for a technology independent comparison in which the areas are presented in
the number of full adders ($fa$) as adders are fundamental building blocks in arithmetic
units. Area estimates for individual components are also given in Table 3 and the evaluations for the proposed PCNV and PACC are given in Eq. (12). Areas for the PMUL and PADD from previous work [4][4] are also evaluated in Eq. (13) for a comparison.
As discussed in subsection II.3, there is an area reduction in PCNV from PMUL but
an increase in PACC from PADD as analyzed in Eqs. (Eq. (12), Eq. (13)). However, comparing the combined area of PCNV and PACC pair and that of PMUL and
PADD pair, the proposed PCNV and PACC pair shows 13% reduction in its area assuming
the area for 512-bit LUT equivalents to 59$fa$. Therefore, we find the proposed unification
architectures contribute to the reductions in both delay and area of the multifunction
unit.
Accommodating these PCNV and PACC, proposed multifunction unit was modeled at structural
level using Verilog HDL and synthesized using Synopsis Design Compiler with a 0.11
mm CMOS standard cell library. Synthesis was conducted under 25°C, 1.2 V, and typical
corner operating conditions. The synthesis results demonstrate 93k NAND2 gates with
a delay of 13.37 ns for the entire multifunction unit as summarized in Table 4. The unit from the previous work [4][4] are also synthesized in parallel for a comparison, and results show 10% and 3.5%
reductions in the delay and area of the proposed multifunction unit compared with
those of the previous design [4][4].
Table 4. Synthesis results for the multifunction unit
Stage
|
Delay (ns)
|
Area (NAND2)
|
This work
|
Prior work [4][4]
|
This work
|
Prior work [4][4]
|
E1
|
2.88
|
2.88
|
18.3k
|
18.3k
|
E2
|
2.37
|
3.46
|
36.7k
|
43.6k
|
E3
|
2.89
|
2.89
|
18.5k
|
18.5k
|
E4
|
2.83
|
3.5
|
10.6k
|
7.1k
|
E5
|
2.41
|
2.41
|
8.9k
|
8.9k
|
Total
|
13.38
|
14.94
|
93k
|
96.4k
|
IV. CONCLUSION
A novel architecture for a wide operation set multifunction unit is presented for
mobile and wearable GPU shaders. Our unit adopts the hybrid number system (HNS) of
the floating-point and logarithmic numbers for an efficient unification of 25 non-trivial
arithmetic operations on a single arithmetic unit. Novel unification architectures
are presented for the delay and area optimization of the programmable modules i.e.
PCNV and PACC in the multifunction unit. Evaluations are conducted on these modules
based on technology independent models for comparison purposes. Results show that
the delays of PCNV and PACC are reduced by 33% and 24% from those of PMUL and PADD,
respectively. Area estimates for PCNV and PACC pair shows 13% reduction from that
of PMUL and PADD pair. Based on these optimizations, the proposed multifunction unit
adopting the PCNV and PACC demonstrates 10% and 3.5% reductions in delay and area
compared with those of the previous work. Therefore, we conclude the proposed unification
architectures optimize the delay and area of the HNS multifunction unit for GPU shaders
in mobile and wearable devices.
ACKNOWLEDGMENTS
This work was supported by research fund of Chungnam National University.
REFERENCES
Khronos Group , OpenGL-ES 2.0, 3.0, http://www.khronos.org
Lindholm E., Kilgard M. J., Moreton H., Aug. 2003, A User-Programmable Vertex Engine,
Proc. SIGGRAPH 2001, pp. 149-158
Nam B. G., Kim H., Yoo H. J., Apr. 2008, Power and Area Efficient Unified Computation
of Vector and Elementary Functions for Handheld 3D Graphics Systems, IEEE Trans. Computers,
Vol. 57, No. 4, pp. 490-504
Nam B. G., Yoo H. J., May 2009, An Embedded Stream Processor Core based on Logarithmic
Arithmetic for a Low-Power 3D Graphics SoC, IEEE J. Solid-State Circuits, Vol. 44,
No. 5, pp. 1554-1570
Mitchell Jr. J. N., Aug. 1962, Computer Multiplication and Division Using Binary Logarithms,
IRE Trans. Electronic Computers, Vol. 11, pp. 512-517
Lai F. S., Wu C. F. E., Aug. 1991, A Hybrid Number System Processor with Geometric
and Complex Arithmetic Capabilities, IEEE Trans. Computers, Vol. 40, No. 8, pp. 952-962
Vazquez A., Bruguera J. D., Apr. 2011, Composite Iterative Algorithm and Architecture
for q-th Root Calculation, Proc. 20th IEEE Symp. On Computer Arithmetic, pp. 52-61
Pineiro J. A., Oberman S. F., Muller J. M., Bruguera J. D., Mar. 2005, High-Speed
Function Approximation Using a Minimax Quadratic Interpolator, IEEE Trans. Computers,
Vol. 54, No. 3, pp. 304-318
Author
received his B.S. degree (summa cum laude) in computer engineering from Kyungpook
National University, Daegu, Korea, in 1999, M.S. and Ph.D. degrees in electrical engineering
and computer science from Korea Advanced Institute of Science and Technology (KAIST),
Daejeon, Korea, in 2001 and 2007, respectively.
His Ph.D. work focused on low-power GPU design for wireless mobile devices.
In 2001, he joined Electronics and Telecommunications Research Institute (ETRI), Daejeon,
Korea, where he was involved in a network processor design for InfiniBandTM protocol.
From 2007 to 2010, he was with Samsung Electronics, Giheung, Korea, where he worked
on world first 1-GHz ARM CortexTM microprocessor design.
Dr. Nam is currently with Chungnam National University, Daejeon, Korea, as an associate
professor.
He is serving as a vice director of the System Design Innovation and Application Research
Center (SDIA), KAIST and a member of steering committee of the IC Design Education
Center (IDEC), KAIST.
His current interests include mobile GPU, machine learning processor, microprocessor,
low-power SoC and embedded software.
He co-authored the book Mobile 3D Graphics SoC: From Algorithm to Chip (Wiley, 2010)
and presented tutorials on mobile processor design at IEEE ISSCC 2012 and IEEE A-SSCC
2011.
He received the CNU Recognition of Excellent Professors in 2013 and the A-SSCC Distinguished
Design Award in 2016.
He is serving as the Chair of Digital Architectures and Systems (DAS) subcommittee
in ISSCC and a member of the TPC for IEEE ISSCC, IEEE A-SSCC, IEEE COOL Chips, VLSIDAT,
ASP-DAC, and ISOCC.
He served as a Guest Editor of the IEEE Journal of Solid-State Circuits (JSSC) and
is an Associate Editor of the IEIE Journal of Semiconductor Technology and Science
(JSTS).