High-performance Sparsity-aware NPU with Reconfigurable Comparator-multiplier Architecture
            
                     RyuSungju1,*
                     KimJae-Joon1
               
                  - 
                           
                        (Department of System Semiconductor Engineering, Sogang University, Seoul, Korea )
                        
 
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Index Terms
               
                Neural processing unit,  sparse matrix,  multiplier,  weight pruning,  hardware accelerator
             
            
          
         
            
                  I. Introduction	
               Neural networks have been rapidly growing over recent years to achieve high accuracy
                  on the various tasks including large language models, diffusion models, and multi-modal
                  AIs. However, due to the significantly large number of computations, various compression
                  techniques have been explored.
               
               Weight pruning method is one of the well known approaches where less important connections
                  between input/output neurons are deleted, so it typically shows 75-90% of the weight
                  compression ratio. Such an approach reduces a number of multiply-accumulate (MAC)
                  computations in the software-level, and the sparse matrix computation in the hardware
                  requires complex index-matching logic which degrades the performance improvements.
                  SCNN [1] first convolves the sparse input matrix by sparse kernels without any considerations
                  of indices for non-zero items, and then the destination address for the partial sums
                  (psums) are computed. In the stage, stalls frequently occur to handle multiple psums
                  targeting the same psum memory bank. SNAP, SparTen, and SPRITE [2-4] match the non-zero inputs by corresponding non-zero weights using index-matching
                  units made by comparators. Afterwards, the matched input/weight pairs are sent to
                  the MAC array for the computation. However, the index-matching units typically generate
                  a different number of input/weight matched pairs depending on the matrix densities.
                  For example, if the matrix density is low, the comparator array for the index-matching
                  generates only a few matched pairs. On the other hand, with the higher matrix density,
                  the comparators make more matched input/ weight pairs. As a result, low matrix density
                  leads to under-utilization of the MAC array because only a small number of matched
                  inputs/weights are consumed by the backend computation, and high matrix density leads
                  to the waste of the comparator resources considering that most of the matched pairs
                  need to wait to be fed to a small number of MAC array.
               
               To relieve the computing bottleneck in the comparator/MAC array across all the matrix
                  densities, we propose a high-performance sparsity-aware NPU architecture with a reconfigurable
                  comparator-multiplier scheme. So, our contribution in this paper is as follows.
               
               1) Our scheme dynamically adjusts the number of comparators and multipliers depending
                  on the target matrix density.
               
               2) Proposed reconfigurable comparator/multiplier circuit is fabricated as a test-chip
                  using a 28 nm CMOS technology.
               
               Section II reviews the previous sparsity-aware hardware accelerators. We introduce
                  the proposed reconfigurable comparator-multiplier architecture in Section III. Section
                  IV explains the experimental results, and we conclude this paper in Section V.
               
             
            
                  II. Preliminary: Sparsity-aware Hardware Accelerators	
               SCNN [1] exploited the input and weight sparsities using Cartesian product, thereby compressing
                  input/weight matrices and saving the computational complexity and memory footprint.
                  It first finds the non-zero elements from the matrices, and all the non-zero inputs
                  are multiplied by all the non-zero weights. During the multiplication, SCNN did not
                  consider the indices of the input/weight/psum. However, the most important thing in
                  sparse matrix computation is to precisely assign the result of multiplying the input/weight
                  pairs to the psum located at the corresponding address. Hence, the destination address
                  is computed, and the psum is stored in the buffer with the computed corresponding
                  address. However, multiple psums are frequently tried to be forwarded to the buffers
                  in this stage. If multiple psums are subject to the same memory bank, only one of
                  the psums can be stored in the buffer. Other psums must wait until the bank is idle,
                  which leads to traffic congestion in the writeback stage and under-utilization in
                  the previous multiplication stage.
               
               To mitigate the degradation of throughput, inner-product-based approaches [2,3] used the index-matching unit. The channel indices for the non-zero elements are fetched
                  to the index holder. All the input indices are compared to all the weight indices.
                  If an input index is the same as the weight index, the matched non-zero input/weight
                  data are sent to the backend multiplier. Such inner-product-based sparse matrix acceleration
                  approaches do not show the backend stall, because the index-matching is performed
                  in the pre-compute stage. However, The index-matching unit generates an inconsistant
                  number of input/weight pairs depending on the matrix density, so it cannot effectively
                  use the computing bandwidth. When matrix density is low, the index-matching unit cannot
                  generate a sufficient number of matched pairs for the multipliers. On the other hand,
                  in the high matrix density, a larger number of matched pairs is forwarded to the multipliers
                  than the number of multiplications consumed in each clock cycle. 
               
               Conventional approaches where the number of index-matching comparators/multipliers
                  is fixed in the design stage show limited performance across the various matrix densities.
                  So, there is a pressing need to design an architecture which adjusts the number of
                  comparators/ multipliers depending on the matrix density.
               
             
            
                  III. Proposed Architecture	
               This Section introduces a proposed architecture. First, we explain a reconfigurable
                  comparator-adder scheme. Using the reconfigurable circuit, we construct the index-matching
                  units and multipliers. The number of modules varies depending on the target layer
                  sparsity. Furthermore, we fabricated a test-chip for the verification, so the details
                  for the chip implementation will be explained.
               
               
                     1. Reconfigurable Comparator/Adder
                  Fig. 1 shows the schematic of proposed reconfigurable comparator-adder design. For a simple
                     explanation, we assumed that input data is 4-bit and ripple carry adder is considered.
                     However, we can obviously use any other bit widths and adopt different adder styles
                     in this design. The reconfigurable circuit first receives two operands (A[3:0], B[3:0])
                     and a comparator signal (Comp.). If the Comp. is off, the reconfigurable circuit acts
                     as a conventional adder computing addition result (S[3:0]). On the other hand, if
                     the Comp. is on, the operand B is bitwise flipped, and then the adder operates as
                     a subtractor mode. When the subtractor outputs (S[3:0]) zero, it means two operands
                     are the same (A = B). Moreover, we can perform the magnitude comparator as well as
                     the equality comparator described above. If the sign-bit of the subtraction is `1',
                     A is larger than B (A > B). Otherwise, A is smaller than or equal to B (A ${\leq}$
                     B). Using these output signals, we can generate 3 output comparison results (A < B,
                     A = B, and A > B).
                  
                  
                        
                        
Fig. 1. Proposed reconfigurable comparator-adder circuit.
                      
                
               
                     2. Index-matching Unit and Multiplier
                  So far, we introduced the reconfigurable comparator/ adder scheme, and we now expand
                     the design to the implementation of the index-matching unit and the multiplier. Fig. 2 illustrates the reconfigurable index-matching unit and multiplier.
                  
                  First, the index-matching unit consists of the index holder and comparators. The index
                     holder holds the indices of the non-zero inputs/weights. The comparator array receives
                     the non-zero indices from the index holder, and it matches the inputs by weights.
                     If the input index is matched by one of the weight indices, corresponding non-zero
                     input/weight pairs are forwarded to the backend multipliers. In this case, please
                     note that the number of matched pairs varies depending on the input/weight matrix
                     sparsity. Hence, we change the amount of hardware resource for the index-matching
                     depending on the sparsity. At the low matrix density (Fig. 2(a)), only a few matched input/weight pairs are generated. For instance, when the matrix
                     density is 0.1, 0.1${\times}$n input/weight pairs are generated on average (n is the
                     number of non-zero values for the index comparison). Then, only 0.1${\times}$n multipliers
                     are required for the computation. On the other hand, in the density of 0.9 (Fig. 2(b)), 0.9${\times}$n pairs are matched on average. At the time, 0.9${\times}$n multipliers
                     are needed, which is larger than the lower density case.
                  
                  After the comparators is allocated to the index-matching unit, the rest of the reconfigurable
                     comparator/adder circuits are used for the multiplier module. The basic multiplication
                     method in the digital circuits first generates partial products by multiplying the
                     multiplicand by multiplier bits, and then it accumulates the partial products using
                     adders. Based on the approach, we design the multipliers with multiple reconfigurable
                     adders.
                  
                  As a result, our reconfigurable circuit can change the number of comparators and multipliers
                     depending on the target matrix densities (Fig. 3).
                  
                  
                        
                        
Fig. 2. Index-matching unit and multipliers: (a) Low-density case; (b) high-density
                           case.
                        
                      
                  
                        
                        
Fig. 3. Reconfiguration of proposed circuit: (a) At the lower density, more comparators
                           are used than multipliers; (b) At the higher density, more adders are used than comparators.
                        
                      
                
               
                     3. Top-level Architecture and Test-chip Implemen-tation
                  The top-level architecture of this design has 8 processing elements (PEs), input/weight
                     global SRAM buffer, I/O blocks and controller (Fig. 4(a)). The inputs are broadcast to all the PEs for the data reuse, and weight are dedicated
                     to each PE.
                  
                  A PE (Fig. 4(b)) has register files for non-zero inputs and weights, and index aligning logic is
                     used to appropriately assign incoming non-zero data to the index holder. An index
                     holder contains non-zero indices. There are 7 computing units (CUs) and 1 16b multiplier.
                     Each of the CUs has 32 8-bit reconfigurable comparator/adder circuits, so the CU can
                     execute 32 8-bit comparisons in parallel or can compute 16 16-bit addition which is
                     eventually configured to a 16-bit multiplier. When performing dense matrix multiplication,
                     we do not require the comparators. In this case, all the reconfigurable circuits are
                     used for the 16-bit multiplication without computing any comparison operations. 
                  
                  Fig. 5 shows the chip layout and die photography of the proposed design, and the test-chip
                     was fabricated using a 28 nm CMOS technology at 200 MHz. Synopsys Design Compiler
                     and IC Compiler were used for the synthesis. Timing information was analyzed using
                     Synopsys PrimeTime. The design has 64 reconfigurable CUs, 36 KB of register files
                     in the PEs, and 202 KB of global buffers, thereby it holds 238.5 KB on-chip memory
                     in total.
                  
                  
                        
                        
Fig. 4. (a) Top-level architecture of our design; (b) a processing element (PE). GLB
                           stands for global SRAM buffer.
                        
                      
                  
                        
                        
Fig. 5. (a) Chip layout; (b) die photography.
                      
                
             
            
                  IV. Results	
               In this Section, we evaluate the performance of the proposed design method comparing
                  it with previous design approaches [2,3]. Considering that all the designs have a different amount of hardware resources such
                  as MAC units and index-matching units, we adopted the configuration of each hardware
                  in our design for each comparison stage. We first evaluate the throughput on a variety
                  of matrix densities, and then analysis on the real benchmarks is performed.
               
               Fig. 6(a) shows the comparison of throughput between our design and SNAP [2]. In the SNAP architecture, each PE has a small number (3) of multipliers and 3 PEs
                  share a comparator array. To relieve the under-utilization in the low matrix density
                  region, they used the 32${\times}$32 for the comparator array size, thereby attaining
                  75% of the multiplier utilization for the typical workload in the high density region.
                  However, neural network usually shows different matrix densities across the layers,
                  so it still has much lower MAC utilization (30-35%) at the much sparse region (density:
                  0.1), and shows waste of the comparator resource at the dense matrix region (density:
                  0.6-1.0). Meanwhile, our design can adjust the ratio between comparators and multipliers,
                  so we can achieve the optimal utilization of hardware resources across all matrix
                  densities. As a result, our design shows higher throughput than the SNAP baseline
                  by 1.06-8.00${\times}$.
               
               In Fig. 6(b), we furthermore compare our design to the SparTen hardware accelerator [3]. Similar to the SNAP, SparTen uses a lot of (128) comparators for each multiplier
                  to achieve the high MAC utilization. However, such an approach leads to the high resource
                  overhead to implement the comparator array. As explained so far, we balance the number
                  of comparators and multipliers to find the optimal computation cost. Thus, our design
                  improves the throughput by 7-17${\times}$ compared to the SparTen baseline.
               
               In both cases, the performance improvements over the baselines in the density=1.0
                  are dramatically higher than other density cases. At the fully dense matrix, we do
                  not use any comparators and all the reconfigurable comparator/adder circuits are set
                  to the adder mode for the multiplication, because all the data are non-zero values
                  in the region.
               
               In our analysis, the input/weight densities on AlexNet and VGG-16 are 0.69/0.41 and
                  0.61/0.33. So, throughput improvements on benchmarks are 1.2-2.1${\times}$ and 12-14${\times}$.
               
               We moreover compare the computing area/power with SNAP. Our design shows 6%/4% larger
                  area/power than the SNAP because the SNAP uses fixed multiplier and our design increases
                  the area due to the reconfigurable multiplier-comparator logic. However, in the SNAP,
                  93% of the area comes from the comparator array and only 7% of the area is for multipliers,
                  so the reconfigurable multiplier does not lead to large area overhead. Similarly,
                  our design has larger area and power consumption by 3% and 2% than the SparTen baseline
                  where 3% of the area is occupied by the multiplier.
               
               
                     
                     
Fig. 6. Improvement of throughput over (a) SNAP; (b) SparTen.
                   
             
            
                  V. Conclusion	
               In this paper, we introduced a reconfigurable comparator-multiplier architecture to
                  design a high-performance sparsity-aware NPU across all the matrix densities. We adjusted
                  the number of comparators/ multipliers depending on the workloads, thereby maintaining
                  high hardware resources in the sparse neural network computation. We fabricated our
                  hardware in a 28 nm CMOS technology, and experimental results show that we improved
                  the throughput by 1.06-17.00${\times}$ compared to the baselines.
               
             
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This work was supported by the National Research Foundation of Korea (NRF) grant
                  funded by the Korea government (MSIT) (NRF-2022R1F1A1070414, 100%). The chip fabrication
                  and EDA tool were supported by the IC Design Education Center (IDEC), Korea (Samsung
                  28nm CMOS).
                  			
               
             
            
                  
                     References
                  
                     
                        
                        A. Parashar et al., “Scnn: An accelerator for compressed-sparse convolutional neural
                           networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture
                           (ISCA). IEEE, 2017, pp. 27-40.

 
                     
                        
                        J.-F. Zhang et al., “Snap: An efficient sparse neural acceleration processor for unstructured
                           sparse deep neural network inference,” IEEE Journal of Solid-State Circuits, vol.
                           56, no. 2, pp. 636-647, 2020.

 
                     
                        
                        A. Gondimalla et al., “Sparten: A sparse tensor accelerator for convolutional neural
                           networks,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture,
                           2019, pp. 151-165.

 
                     
                        
                        S. Ryu et al., “Sprite: sparsity-aware neural processing unit with constant probability
                           of index-matching,” in 2021 Design, Automation & Test in Europe Conference & Exhibition
                           (DATE). IEEE, 2021, pp. 663-666.

 
                   
                
             
            
            
               			Sungju Ryu is currently an Assistant Professor at Sogang University. He received
               the B.S. degree from Pusan National University, in 2015, and the Ph.D. degree from
               POSTECH in 2021. His current research interests include energy-efficient NPU and processing-in-memory.
               		
            
            
            
               			Jae-Joon Kim is currently a professor at Seoul National University, Seoul, South
               Korea. He received the B.S. and M.S. degrees in electronics engineering from Seoul
               National University, Seoul, South Korea, in 1994 and 1998, respectively, and the Ph.D.
               degree from the School of Electrical and Computer Engineering, Purdue University,
               West Lafayette, IN, USA, in 2004. From 2004 to 2013, he was a Research Staff Member
               with the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He was a
               Professor at the Pohang University of Science and Technology, Pohang, South Korea
               from 2013 to 2021. His current research interests include the design of deep learning
               hardware accelerators, neuromorphic processors, hardware security circuits, and circuits
               for exploratory devices.