Mobile QR Code QR CODE

  1. (Department of System Semiconductor Engineering, Sogang University, Seoul, Korea )



Neural processing unit, sparse matrix, multiplier, weight pruning, hardware accelerator

I. Introduction

Neural networks have been rapidly growing over recent years to achieve high accuracy on the various tasks including large language models, diffusion models, and multi-modal AIs. However, due to the significantly large number of computations, various compression techniques have been explored.

Weight pruning method is one of the well known approaches where less important connections between input/output neurons are deleted, so it typically shows 75-90% of the weight compression ratio. Such an approach reduces a number of multiply-accumulate (MAC) computations in the software-level, and the sparse matrix computation in the hardware requires complex index-matching logic which degrades the performance improvements. SCNN [1] first convolves the sparse input matrix by sparse kernels without any considerations of indices for non-zero items, and then the destination address for the partial sums (psums) are computed. In the stage, stalls frequently occur to handle multiple psums targeting the same psum memory bank. SNAP, SparTen, and SPRITE [2-4] match the non-zero inputs by corresponding non-zero weights using index-matching units made by comparators. Afterwards, the matched input/weight pairs are sent to the MAC array for the computation. However, the index-matching units typically generate a different number of input/weight matched pairs depending on the matrix densities. For example, if the matrix density is low, the comparator array for the index-matching generates only a few matched pairs. On the other hand, with the higher matrix density, the comparators make more matched input/ weight pairs. As a result, low matrix density leads to under-utilization of the MAC array because only a small number of matched inputs/weights are consumed by the backend computation, and high matrix density leads to the waste of the comparator resources considering that most of the matched pairs need to wait to be fed to a small number of MAC array.

To relieve the computing bottleneck in the comparator/MAC array across all the matrix densities, we propose a high-performance sparsity-aware NPU architecture with a reconfigurable comparator-multiplier scheme. So, our contribution in this paper is as follows.

1) Our scheme dynamically adjusts the number of comparators and multipliers depending on the target matrix density.

2) Proposed reconfigurable comparator/multiplier circuit is fabricated as a test-chip using a 28 nm CMOS technology.

Section II reviews the previous sparsity-aware hardware accelerators. We introduce the proposed reconfigurable comparator-multiplier architecture in Section III. Section IV explains the experimental results, and we conclude this paper in Section V.

II. Preliminary: Sparsity-aware Hardware Accelerators

SCNN [1] exploited the input and weight sparsities using Cartesian product, thereby compressing input/weight matrices and saving the computational complexity and memory footprint. It first finds the non-zero elements from the matrices, and all the non-zero inputs are multiplied by all the non-zero weights. During the multiplication, SCNN did not consider the indices of the input/weight/psum. However, the most important thing in sparse matrix computation is to precisely assign the result of multiplying the input/weight pairs to the psum located at the corresponding address. Hence, the destination address is computed, and the psum is stored in the buffer with the computed corresponding address. However, multiple psums are frequently tried to be forwarded to the buffers in this stage. If multiple psums are subject to the same memory bank, only one of the psums can be stored in the buffer. Other psums must wait until the bank is idle, which leads to traffic congestion in the writeback stage and under-utilization in the previous multiplication stage.

To mitigate the degradation of throughput, inner-product-based approaches [2,3] used the index-matching unit. The channel indices for the non-zero elements are fetched to the index holder. All the input indices are compared to all the weight indices. If an input index is the same as the weight index, the matched non-zero input/weight data are sent to the backend multiplier. Such inner-product-based sparse matrix acceleration approaches do not show the backend stall, because the index-matching is performed in the pre-compute stage. However, The index-matching unit generates an inconsistant number of input/weight pairs depending on the matrix density, so it cannot effectively use the computing bandwidth. When matrix density is low, the index-matching unit cannot generate a sufficient number of matched pairs for the multipliers. On the other hand, in the high matrix density, a larger number of matched pairs is forwarded to the multipliers than the number of multiplications consumed in each clock cycle.

Conventional approaches where the number of index-matching comparators/multipliers is fixed in the design stage show limited performance across the various matrix densities. So, there is a pressing need to design an architecture which adjusts the number of comparators/ multipliers depending on the matrix density.

III. Proposed Architecture

This Section introduces a proposed architecture. First, we explain a reconfigurable comparator-adder scheme. Using the reconfigurable circuit, we construct the index-matching units and multipliers. The number of modules varies depending on the target layer sparsity. Furthermore, we fabricated a test-chip for the verification, so the details for the chip implementation will be explained.

1. Reconfigurable Comparator/Adder

Fig. 1 shows the schematic of proposed reconfigurable comparator-adder design. For a simple explanation, we assumed that input data is 4-bit and ripple carry adder is considered. However, we can obviously use any other bit widths and adopt different adder styles in this design. The reconfigurable circuit first receives two operands (A[3:0], B[3:0]) and a comparator signal (Comp.). If the Comp. is off, the reconfigurable circuit acts as a conventional adder computing addition result (S[3:0]). On the other hand, if the Comp. is on, the operand B is bitwise flipped, and then the adder operates as a subtractor mode. When the subtractor outputs (S[3:0]) zero, it means two operands are the same (A = B). Moreover, we can perform the magnitude comparator as well as the equality comparator described above. If the sign-bit of the subtraction is `1', A is larger than B (A > B). Otherwise, A is smaller than or equal to B (A ${\leq}$ B). Using these output signals, we can generate 3 output comparison results (A < B, A = B, and A > B).

Fig. 1. Proposed reconfigurable comparator-adder circuit.

../../Resources/ieie/JSTS.2024.24.6.572/fig1.png

2. Index-matching Unit and Multiplier

So far, we introduced the reconfigurable comparator/ adder scheme, and we now expand the design to the implementation of the index-matching unit and the multiplier. Fig. 2 illustrates the reconfigurable index-matching unit and multiplier.

First, the index-matching unit consists of the index holder and comparators. The index holder holds the indices of the non-zero inputs/weights. The comparator array receives the non-zero indices from the index holder, and it matches the inputs by weights. If the input index is matched by one of the weight indices, corresponding non-zero input/weight pairs are forwarded to the backend multipliers. In this case, please note that the number of matched pairs varies depending on the input/weight matrix sparsity. Hence, we change the amount of hardware resource for the index-matching depending on the sparsity. At the low matrix density (Fig. 2(a)), only a few matched input/weight pairs are generated. For instance, when the matrix density is 0.1, 0.1${\times}$n input/weight pairs are generated on average (n is the number of non-zero values for the index comparison). Then, only 0.1${\times}$n multipliers are required for the computation. On the other hand, in the density of 0.9 (Fig. 2(b)), 0.9${\times}$n pairs are matched on average. At the time, 0.9${\times}$n multipliers are needed, which is larger than the lower density case.

After the comparators is allocated to the index-matching unit, the rest of the reconfigurable comparator/adder circuits are used for the multiplier module. The basic multiplication method in the digital circuits first generates partial products by multiplying the multiplicand by multiplier bits, and then it accumulates the partial products using adders. Based on the approach, we design the multipliers with multiple reconfigurable adders.

As a result, our reconfigurable circuit can change the number of comparators and multipliers depending on the target matrix densities (Fig. 3).

Fig. 2. Index-matching unit and multipliers: (a) Low-density case; (b) high-density case.

../../Resources/ieie/JSTS.2024.24.6.572/fig2.png

Fig. 3. Reconfiguration of proposed circuit: (a) At the lower density, more comparators are used than multipliers; (b) At the higher density, more adders are used than comparators.

../../Resources/ieie/JSTS.2024.24.6.572/fig3.png

3. Top-level Architecture and Test-chip Implemen-tation

The top-level architecture of this design has 8 processing elements (PEs), input/weight global SRAM buffer, I/O blocks and controller (Fig. 4(a)). The inputs are broadcast to all the PEs for the data reuse, and weight are dedicated to each PE.

A PE (Fig. 4(b)) has register files for non-zero inputs and weights, and index aligning logic is used to appropriately assign incoming non-zero data to the index holder. An index holder contains non-zero indices. There are 7 computing units (CUs) and 1 16b multiplier. Each of the CUs has 32 8-bit reconfigurable comparator/adder circuits, so the CU can execute 32 8-bit comparisons in parallel or can compute 16 16-bit addition which is eventually configured to a 16-bit multiplier. When performing dense matrix multiplication, we do not require the comparators. In this case, all the reconfigurable circuits are used for the 16-bit multiplication without computing any comparison operations.

Fig. 5 shows the chip layout and die photography of the proposed design, and the test-chip was fabricated using a 28 nm CMOS technology at 200 MHz. Synopsys Design Compiler and IC Compiler were used for the synthesis. Timing information was analyzed using Synopsys PrimeTime. The design has 64 reconfigurable CUs, 36 KB of register files in the PEs, and 202 KB of global buffers, thereby it holds 238.5 KB on-chip memory in total.

Fig. 4. (a) Top-level architecture of our design; (b) a processing element (PE). GLB stands for global SRAM buffer.

../../Resources/ieie/JSTS.2024.24.6.572/fig4.png

Fig. 5. (a) Chip layout; (b) die photography.

../../Resources/ieie/JSTS.2024.24.6.572/fig5.png

IV. Results

In this Section, we evaluate the performance of the proposed design method comparing it with previous design approaches [2,3]. Considering that all the designs have a different amount of hardware resources such as MAC units and index-matching units, we adopted the configuration of each hardware in our design for each comparison stage. We first evaluate the throughput on a variety of matrix densities, and then analysis on the real benchmarks is performed.

Fig. 6(a) shows the comparison of throughput between our design and SNAP [2]. In the SNAP architecture, each PE has a small number (3) of multipliers and 3 PEs share a comparator array. To relieve the under-utilization in the low matrix density region, they used the 32${\times}$32 for the comparator array size, thereby attaining 75% of the multiplier utilization for the typical workload in the high density region. However, neural network usually shows different matrix densities across the layers, so it still has much lower MAC utilization (30-35%) at the much sparse region (density: 0.1), and shows waste of the comparator resource at the dense matrix region (density: 0.6-1.0). Meanwhile, our design can adjust the ratio between comparators and multipliers, so we can achieve the optimal utilization of hardware resources across all matrix densities. As a result, our design shows higher throughput than the SNAP baseline by 1.06-8.00${\times}$.

In Fig. 6(b), we furthermore compare our design to the SparTen hardware accelerator [3]. Similar to the SNAP, SparTen uses a lot of (128) comparators for each multiplier to achieve the high MAC utilization. However, such an approach leads to the high resource overhead to implement the comparator array. As explained so far, we balance the number of comparators and multipliers to find the optimal computation cost. Thus, our design improves the throughput by 7-17${\times}$ compared to the SparTen baseline.

In both cases, the performance improvements over the baselines in the density=1.0 are dramatically higher than other density cases. At the fully dense matrix, we do not use any comparators and all the reconfigurable comparator/adder circuits are set to the adder mode for the multiplication, because all the data are non-zero values in the region.

In our analysis, the input/weight densities on AlexNet and VGG-16 are 0.69/0.41 and 0.61/0.33. So, throughput improvements on benchmarks are 1.2-2.1${\times}$ and 12-14${\times}$.

We moreover compare the computing area/power with SNAP. Our design shows 6%/4% larger area/power than the SNAP because the SNAP uses fixed multiplier and our design increases the area due to the reconfigurable multiplier-comparator logic. However, in the SNAP, 93% of the area comes from the comparator array and only 7% of the area is for multipliers, so the reconfigurable multiplier does not lead to large area overhead. Similarly, our design has larger area and power consumption by 3% and 2% than the SparTen baseline where 3% of the area is occupied by the multiplier.

Fig. 6. Improvement of throughput over (a) SNAP; (b) SparTen.

../../Resources/ieie/JSTS.2024.24.6.572/fig6.png

V. Conclusion

In this paper, we introduced a reconfigurable comparator-multiplier architecture to design a high-performance sparsity-aware NPU across all the matrix densities. We adjusted the number of comparators/ multipliers depending on the workloads, thereby maintaining high hardware resources in the sparse neural network computation. We fabricated our hardware in a 28 nm CMOS technology, and experimental results show that we improved the throughput by 1.06-17.00${\times}$ compared to the baselines.

ACKNOWLEDGMENTS

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2022R1F1A1070414, 100%). The chip fabrication and EDA tool were supported by the IC Design Education Center (IDEC), Korea (Samsung 28nm CMOS).

References

1 
A. Parashar et al., “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 27-40.DOI
2 
J.-F. Zhang et al., “Snap: An efficient sparse neural acceleration processor for unstructured sparse deep neural network inference,” IEEE Journal of Solid-State Circuits, vol. 56, no. 2, pp. 636-647, 2020.DOI
3 
A. Gondimalla et al., “Sparten: A sparse tensor accelerator for convolutional neural networks,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 151-165.DOI
4 
S. Ryu et al., “Sprite: sparsity-aware neural processing unit with constant probability of index-matching,” in 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2021, pp. 663-666.DOI
Sungju Ryu
../../Resources/ieie/JSTS.2024.24.6.572/au1.png

Sungju Ryu is currently an Assistant Professor at Sogang University. He received the B.S. degree from Pusan National University, in 2015, and the Ph.D. degree from POSTECH in 2021. His current research interests include energy-efficient NPU and processing-in-memory.

Jae-Joon Kim
../../Resources/ieie/JSTS.2024.24.6.572/au2.png

Jae-Joon Kim is currently a professor at Seoul National University, Seoul, South Korea. He received the B.S. and M.S. degrees in electronics engineering from Seoul National University, Seoul, South Korea, in 1994 and 1998, respectively, and the Ph.D. degree from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA, in 2004. From 2004 to 2013, he was a Research Staff Member with the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He was a Professor at the Pohang University of Science and Technology, Pohang, South Korea from 2013 to 2021. His current research interests include the design of deep learning hardware accelerators, neuromorphic processors, hardware security circuits, and circuits for exploratory devices.