High-performance Sparsity-aware NPU with Reconfigurable Comparator-multiplier Architecture
RyuSungju1,*
KimJae-Joon1
-
(Department of System Semiconductor Engineering, Sogang University, Seoul, Korea )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Neural processing unit, sparse matrix, multiplier, weight pruning, hardware accelerator
I. Introduction
Neural networks have been rapidly growing over recent years to achieve high accuracy
on the various tasks including large language models, diffusion models, and multi-modal
AIs. However, due to the significantly large number of computations, various compression
techniques have been explored.
Weight pruning method is one of the well known approaches where less important connections
between input/output neurons are deleted, so it typically shows 75-90% of the weight
compression ratio. Such an approach reduces a number of multiply-accumulate (MAC)
computations in the software-level, and the sparse matrix computation in the hardware
requires complex index-matching logic which degrades the performance improvements.
SCNN [1] first convolves the sparse input matrix by sparse kernels without any considerations
of indices for non-zero items, and then the destination address for the partial sums
(psums) are computed. In the stage, stalls frequently occur to handle multiple psums
targeting the same psum memory bank. SNAP, SparTen, and SPRITE [2-4] match the non-zero inputs by corresponding non-zero weights using index-matching
units made by comparators. Afterwards, the matched input/weight pairs are sent to
the MAC array for the computation. However, the index-matching units typically generate
a different number of input/weight matched pairs depending on the matrix densities.
For example, if the matrix density is low, the comparator array for the index-matching
generates only a few matched pairs. On the other hand, with the higher matrix density,
the comparators make more matched input/ weight pairs. As a result, low matrix density
leads to under-utilization of the MAC array because only a small number of matched
inputs/weights are consumed by the backend computation, and high matrix density leads
to the waste of the comparator resources considering that most of the matched pairs
need to wait to be fed to a small number of MAC array.
To relieve the computing bottleneck in the comparator/MAC array across all the matrix
densities, we propose a high-performance sparsity-aware NPU architecture with a reconfigurable
comparator-multiplier scheme. So, our contribution in this paper is as follows.
1) Our scheme dynamically adjusts the number of comparators and multipliers depending
on the target matrix density.
2) Proposed reconfigurable comparator/multiplier circuit is fabricated as a test-chip
using a 28 nm CMOS technology.
Section II reviews the previous sparsity-aware hardware accelerators. We introduce
the proposed reconfigurable comparator-multiplier architecture in Section III. Section
IV explains the experimental results, and we conclude this paper in Section V.
II. Preliminary: Sparsity-aware Hardware Accelerators
SCNN [1] exploited the input and weight sparsities using Cartesian product, thereby compressing
input/weight matrices and saving the computational complexity and memory footprint.
It first finds the non-zero elements from the matrices, and all the non-zero inputs
are multiplied by all the non-zero weights. During the multiplication, SCNN did not
consider the indices of the input/weight/psum. However, the most important thing in
sparse matrix computation is to precisely assign the result of multiplying the input/weight
pairs to the psum located at the corresponding address. Hence, the destination address
is computed, and the psum is stored in the buffer with the computed corresponding
address. However, multiple psums are frequently tried to be forwarded to the buffers
in this stage. If multiple psums are subject to the same memory bank, only one of
the psums can be stored in the buffer. Other psums must wait until the bank is idle,
which leads to traffic congestion in the writeback stage and under-utilization in
the previous multiplication stage.
To mitigate the degradation of throughput, inner-product-based approaches [2,3] used the index-matching unit. The channel indices for the non-zero elements are fetched
to the index holder. All the input indices are compared to all the weight indices.
If an input index is the same as the weight index, the matched non-zero input/weight
data are sent to the backend multiplier. Such inner-product-based sparse matrix acceleration
approaches do not show the backend stall, because the index-matching is performed
in the pre-compute stage. However, The index-matching unit generates an inconsistant
number of input/weight pairs depending on the matrix density, so it cannot effectively
use the computing bandwidth. When matrix density is low, the index-matching unit cannot
generate a sufficient number of matched pairs for the multipliers. On the other hand,
in the high matrix density, a larger number of matched pairs is forwarded to the multipliers
than the number of multiplications consumed in each clock cycle.
Conventional approaches where the number of index-matching comparators/multipliers
is fixed in the design stage show limited performance across the various matrix densities.
So, there is a pressing need to design an architecture which adjusts the number of
comparators/ multipliers depending on the matrix density.
III. Proposed Architecture
This Section introduces a proposed architecture. First, we explain a reconfigurable
comparator-adder scheme. Using the reconfigurable circuit, we construct the index-matching
units and multipliers. The number of modules varies depending on the target layer
sparsity. Furthermore, we fabricated a test-chip for the verification, so the details
for the chip implementation will be explained.
1. Reconfigurable Comparator/Adder
Fig. 1 shows the schematic of proposed reconfigurable comparator-adder design. For a simple
explanation, we assumed that input data is 4-bit and ripple carry adder is considered.
However, we can obviously use any other bit widths and adopt different adder styles
in this design. The reconfigurable circuit first receives two operands (A[3:0], B[3:0])
and a comparator signal (Comp.). If the Comp. is off, the reconfigurable circuit acts
as a conventional adder computing addition result (S[3:0]). On the other hand, if
the Comp. is on, the operand B is bitwise flipped, and then the adder operates as
a subtractor mode. When the subtractor outputs (S[3:0]) zero, it means two operands
are the same (A = B). Moreover, we can perform the magnitude comparator as well as
the equality comparator described above. If the sign-bit of the subtraction is `1',
A is larger than B (A > B). Otherwise, A is smaller than or equal to B (A ${\leq}$
B). Using these output signals, we can generate 3 output comparison results (A < B,
A = B, and A > B).
Fig. 1. Proposed reconfigurable comparator-adder circuit.
2. Index-matching Unit and Multiplier
So far, we introduced the reconfigurable comparator/ adder scheme, and we now expand
the design to the implementation of the index-matching unit and the multiplier. Fig. 2 illustrates the reconfigurable index-matching unit and multiplier.
First, the index-matching unit consists of the index holder and comparators. The index
holder holds the indices of the non-zero inputs/weights. The comparator array receives
the non-zero indices from the index holder, and it matches the inputs by weights.
If the input index is matched by one of the weight indices, corresponding non-zero
input/weight pairs are forwarded to the backend multipliers. In this case, please
note that the number of matched pairs varies depending on the input/weight matrix
sparsity. Hence, we change the amount of hardware resource for the index-matching
depending on the sparsity. At the low matrix density (Fig. 2(a)), only a few matched input/weight pairs are generated. For instance, when the matrix
density is 0.1, 0.1${\times}$n input/weight pairs are generated on average (n is the
number of non-zero values for the index comparison). Then, only 0.1${\times}$n multipliers
are required for the computation. On the other hand, in the density of 0.9 (Fig. 2(b)), 0.9${\times}$n pairs are matched on average. At the time, 0.9${\times}$n multipliers
are needed, which is larger than the lower density case.
After the comparators is allocated to the index-matching unit, the rest of the reconfigurable
comparator/adder circuits are used for the multiplier module. The basic multiplication
method in the digital circuits first generates partial products by multiplying the
multiplicand by multiplier bits, and then it accumulates the partial products using
adders. Based on the approach, we design the multipliers with multiple reconfigurable
adders.
As a result, our reconfigurable circuit can change the number of comparators and multipliers
depending on the target matrix densities (Fig. 3).
Fig. 2. Index-matching unit and multipliers: (a) Low-density case; (b) high-density
case.
Fig. 3. Reconfiguration of proposed circuit: (a) At the lower density, more comparators
are used than multipliers; (b) At the higher density, more adders are used than comparators.
3. Top-level Architecture and Test-chip Implemen-tation
The top-level architecture of this design has 8 processing elements (PEs), input/weight
global SRAM buffer, I/O blocks and controller (Fig. 4(a)). The inputs are broadcast to all the PEs for the data reuse, and weight are dedicated
to each PE.
A PE (Fig. 4(b)) has register files for non-zero inputs and weights, and index aligning logic is
used to appropriately assign incoming non-zero data to the index holder. An index
holder contains non-zero indices. There are 7 computing units (CUs) and 1 16b multiplier.
Each of the CUs has 32 8-bit reconfigurable comparator/adder circuits, so the CU can
execute 32 8-bit comparisons in parallel or can compute 16 16-bit addition which is
eventually configured to a 16-bit multiplier. When performing dense matrix multiplication,
we do not require the comparators. In this case, all the reconfigurable circuits are
used for the 16-bit multiplication without computing any comparison operations.
Fig. 5 shows the chip layout and die photography of the proposed design, and the test-chip
was fabricated using a 28 nm CMOS technology at 200 MHz. Synopsys Design Compiler
and IC Compiler were used for the synthesis. Timing information was analyzed using
Synopsys PrimeTime. The design has 64 reconfigurable CUs, 36 KB of register files
in the PEs, and 202 KB of global buffers, thereby it holds 238.5 KB on-chip memory
in total.
Fig. 4. (a) Top-level architecture of our design; (b) a processing element (PE). GLB
stands for global SRAM buffer.
Fig. 5. (a) Chip layout; (b) die photography.
IV. Results
In this Section, we evaluate the performance of the proposed design method comparing
it with previous design approaches [2,3]. Considering that all the designs have a different amount of hardware resources such
as MAC units and index-matching units, we adopted the configuration of each hardware
in our design for each comparison stage. We first evaluate the throughput on a variety
of matrix densities, and then analysis on the real benchmarks is performed.
Fig. 6(a) shows the comparison of throughput between our design and SNAP [2]. In the SNAP architecture, each PE has a small number (3) of multipliers and 3 PEs
share a comparator array. To relieve the under-utilization in the low matrix density
region, they used the 32${\times}$32 for the comparator array size, thereby attaining
75% of the multiplier utilization for the typical workload in the high density region.
However, neural network usually shows different matrix densities across the layers,
so it still has much lower MAC utilization (30-35%) at the much sparse region (density:
0.1), and shows waste of the comparator resource at the dense matrix region (density:
0.6-1.0). Meanwhile, our design can adjust the ratio between comparators and multipliers,
so we can achieve the optimal utilization of hardware resources across all matrix
densities. As a result, our design shows higher throughput than the SNAP baseline
by 1.06-8.00${\times}$.
In Fig. 6(b), we furthermore compare our design to the SparTen hardware accelerator [3]. Similar to the SNAP, SparTen uses a lot of (128) comparators for each multiplier
to achieve the high MAC utilization. However, such an approach leads to the high resource
overhead to implement the comparator array. As explained so far, we balance the number
of comparators and multipliers to find the optimal computation cost. Thus, our design
improves the throughput by 7-17${\times}$ compared to the SparTen baseline.
In both cases, the performance improvements over the baselines in the density=1.0
are dramatically higher than other density cases. At the fully dense matrix, we do
not use any comparators and all the reconfigurable comparator/adder circuits are set
to the adder mode for the multiplication, because all the data are non-zero values
in the region.
In our analysis, the input/weight densities on AlexNet and VGG-16 are 0.69/0.41 and
0.61/0.33. So, throughput improvements on benchmarks are 1.2-2.1${\times}$ and 12-14${\times}$.
We moreover compare the computing area/power with SNAP. Our design shows 6%/4% larger
area/power than the SNAP because the SNAP uses fixed multiplier and our design increases
the area due to the reconfigurable multiplier-comparator logic. However, in the SNAP,
93% of the area comes from the comparator array and only 7% of the area is for multipliers,
so the reconfigurable multiplier does not lead to large area overhead. Similarly,
our design has larger area and power consumption by 3% and 2% than the SparTen baseline
where 3% of the area is occupied by the multiplier.
Fig. 6. Improvement of throughput over (a) SNAP; (b) SparTen.
V. Conclusion
In this paper, we introduced a reconfigurable comparator-multiplier architecture to
design a high-performance sparsity-aware NPU across all the matrix densities. We adjusted
the number of comparators/ multipliers depending on the workloads, thereby maintaining
high hardware resources in the sparse neural network computation. We fabricated our
hardware in a 28 nm CMOS technology, and experimental results show that we improved
the throughput by 1.06-17.00${\times}$ compared to the baselines.
ACKNOWLEDGMENTS
This work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korea government (MSIT) (NRF-2022R1F1A1070414, 100%). The chip fabrication
and EDA tool were supported by the IC Design Education Center (IDEC), Korea (Samsung
28nm CMOS).
References
A. Parashar et al., “Scnn: An accelerator for compressed-sparse convolutional neural
networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture
(ISCA). IEEE, 2017, pp. 27-40.
J.-F. Zhang et al., “Snap: An efficient sparse neural acceleration processor for unstructured
sparse deep neural network inference,” IEEE Journal of Solid-State Circuits, vol.
56, no. 2, pp. 636-647, 2020.
A. Gondimalla et al., “Sparten: A sparse tensor accelerator for convolutional neural
networks,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture,
2019, pp. 151-165.
S. Ryu et al., “Sprite: sparsity-aware neural processing unit with constant probability
of index-matching,” in 2021 Design, Automation & Test in Europe Conference & Exhibition
(DATE). IEEE, 2021, pp. 663-666.
Sungju Ryu is currently an Assistant Professor at Sogang University. He received
the B.S. degree from Pusan National University, in 2015, and the Ph.D. degree from
POSTECH in 2021. His current research interests include energy-efficient NPU and processing-in-memory.
Jae-Joon Kim is currently a professor at Seoul National University, Seoul, South
Korea. He received the B.S. and M.S. degrees in electronics engineering from Seoul
National University, Seoul, South Korea, in 1994 and 1998, respectively, and the Ph.D.
degree from the School of Electrical and Computer Engineering, Purdue University,
West Lafayette, IN, USA, in 2004. From 2004 to 2013, he was a Research Staff Member
with the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He was a
Professor at the Pohang University of Science and Technology, Pohang, South Korea
from 2013 to 2021. His current research interests include the design of deep learning
hardware accelerators, neuromorphic processors, hardware security circuits, and circuits
for exploratory devices.