Mobile QR Code QR CODE

  1. (Sungju Ryu is with the School of Electronic Engineering, Soongsil University, Seoul, Korea)



Hardware accelerator, MAC unit, neural processing unit, quantized neural networks, variable bit-precision

I. INTRODUCTION

Model complexity of neural networks has been rapidly increasing to meet the target accuracy of neural network applications. However, edge devices usually have limited computing capability due to power constraints, so meeting the target real-time computing latency required for the modern complex network models is challenging.

Various methods for making deep neural networks compact have been explored to ease the burden of real-time computing, which includes quantization, weight pruning, and separable convolution. Among them, the quantization makes the deep neural networks lighter by expressing the inputs and weights in lower bits. A disadvantage of such an approximation of the neural network parameters and activations via quantization is the inference accuracy loss. As a result, many quantization techniques have been suggested to reduce the error from the approximation.

Recently, open-source machine learning frameworks such as PyTorch [1] and TensorFlow [2] started to provide quantization APIs for researchers to make the quantization of neural networks easier, thereby reducing the service development time. As a result, the quantization method to compress neural network is becoming more popular.

Meanwhile, mapping the quantized neural networks to the conventional fixed bit-width-based hardware cannot maximize the computational efficiency. For example, an 8-bit input/weight multiplication on the 32-bit multiplier circuit has the same throughput and similar energy-efficiency to that of the a 32-bit input/weight multiplication. To maximize the performance of the quantized neural networks, previous works have proposed variable-bit multiply-accumulate (MAC) units. However, such variable-bit MAC microarchitectures have been implemented in different experimental conditions, so it is difficult to select the most suitable scheme for a target design space. Previous work [3] studied the precision-scalable MACs, but it did not perform the evaluation on the real benchmarks. Only ideal workloads were used for the simulation.

Our contribution to analyze and compare those variable-bit MAC units is as follows.

1) We review the variable bit-precision MAC microarchitectures. Subword-parallel, one-/two-sided bit-width flexible MAC arrays are studied.

2) We synthesize the MAC arrays using a 28 nm standard library cells. Area, energy consumption, and throughput are analyzed using the real neural network benchmarks.

II. REVIEW OF PRECISION-SCALABLE MAC MICROARCHITECTURES

1. One-sided Flexible Bit-width Designs

1) Stripes

In the neural networks, required bit-precision of neurons varies across the layers. The main concept of the Stripes [4] is that performance can be linearly improved if the computation time is scaled depending on the bit-width of the neurons. Fig. 1(a) shows the baseline fixed bit-width MAC array. In the baseline design, inputs and weights are first stored in the input/weight buffers. Inputs are multiplied by weights after loaded from the buffers. Then, the partial-sum is added and accumulated until an output number is constructed. Stripes accelerator proposed a serial inner product (SIP) unit which includes input/weight buffers, AND gates, adder tree, accumulator, and bit-shift logic. Considering that multipliers usually generate partial products using AND gates, SIP multiplies the weights by input bits using the AND logic. Fig. 1(b) shows a 2-bit multiplication example. First, 2-bit weights are AND-ed by LSBs of the 2-bit input numbers. Two partial sums are added and accumulated in the buffer. Second, the 2-bit weights are AND-ed by 2nd bits of the 2-bit inputs. After the AND and accumulate operations, the SIP unit finishes the 2-bit dot product computation. The numerical result is exactly the same as the baseline. If the SIP has the same number of AND gates as the baseline multiplier has, the throughput of the SIP becomes the same as the baseline inner product unit. For example, the baseline inner product unit includes two 2-bit multipliers. The SIP can have 8 AND gates, thereby achieving the same throughput as the baseline design.

Fig. 1. (a) Baseline fixed bit-width MAC unit; (b) serial inner product unit of Stripes[4].
../../Resources/ieie/JSTS.2022.22.5.353/fig1.png

2) UNPU

The processing engine in UNPU [5] deals with fully variable weight bits from 1 to 16-bit precision (Fig. 2). Input number is stored in the buffer and AND-ed by weight bits for `W' clock cycles (W: \# of weight bits). After the processing engine finishes the multiplications between inputs/weight pairs, the results are sent to the adder/subtractor tree. Furthermore, lookup table (LUT)-based bit-serial computation is adopted for energy-efficient matrix multiplication. Possible partial products are pre-stored in the partial product table. If the same bit-pattern is repeated, the partial product is only fetched from the table hence maximizing the energy-efficiency.

Fig. 2. Processing engine with fully variable weight bits in UNPU[5].
../../Resources/ieie/JSTS.2022.22.5.353/fig2.png

2. Two-sided Flexible Bit-width Designs

1) Envision

Envision [6] introduced a subword-parallel MAC design scheme. The MAC unit consists of 16 submultipliers. In the high bit-precision mode (16-bit, Fig. 3(a)), all the submultipliers are turned on to construct high-bit multiplication result. On the other hand, when targeting low bit widths, a few submultipliers are turned off by masking input signal for the part of the MAC unit. To improve the throughput and energy-efficiency of the MAC, the scalable arithmetic unit reuses the inactive submultiplier cells. In the 8-bit precision mode (Fig. 3(b)), 4 4x4 submultipliers are used for an 8-bit multiplication. In the case, 2 8-bit multiplications are performed in parallel. Thus, 8 out of 16 submultipliers are used in total. Moreover, when targeting the 4-bit precision (Fig. 3(c)), only 1 4x4 submultiplier is used. 4 4-bit multiplications are dealt with at the same time. Hence, 4 out of 16 submultipliers are used in the case.

When bit-width is scaled, the critical-path delay is shortened (Fig. 3(d)). By combining the subword-parallel MAC microarchitecture with voltage scaling, the precision-scaled arithmetic blocks show much higher energy-efficiency while maintaining the same throughput as the high bit-precision mode.

Fig. 3. Subword-parallel MAC engine proposed in Envision[6]: (a) 16-bit; (b) 8-bit; (c) 4-bit multiplication modes; (d) Critical paths at different bit-precision modes.
../../Resources/ieie/JSTS.2022.22.5.353/fig3.png

2) Bit Fusion

Bit Fusion [7] proposed a bit-level dynamically composable MAC unit called a fusion unit (Fig. 4(a)). Bit Fusion performed the 2-dimensional physical grouping of its submultipliers called BitBrick. The grouped BitBricks becomes a fused processing engine (fused-PE) that executes a multiplication with required bit-width. Depending on the target bit-precision, the fusion unit can have various numbers of fused-PEs. When an 8x8 multiplication is performed (Fig. 4(b)), all the BitBricks in the fusion unit constitute 1 fused-PE. For an 8x4 multiplication (Fig. 4(c)), 8 BitBricks are required. Considering that a fusion unit consists of 16 BitBricks, 2 8x4 multiplications are performed in parallel with the fusion unit. In the case of 2x2 multiplication (Fig. 4(d)), only 1 BitBrick is used for each multiplication. 16 2x2 multiplications are computed in a clock cycle using the fusion units. After 2-bit multiplications using the BitBricks, the partial multiplication results are shifted depending on the target bit-precision. For example, to construct an 8x8 multiplication, the 2-bit multiplication results from the 16 BitBricks are shifted by 0-to-12 bits depending on the bit position. In the same manner, for an 8x4 multiplication, outputs from the 8 BitBricks are shifted by 0-8 bits. However, no shift operations are performed in a 2x2 multiplication, because BitBrick can fully express the 2-bit multiplication by itself. Once the shift operations are finished, the results are added through the adder tree to complete the dot product computation.

Fig. 4. (a) Dynamically composable fusion unit of Bit Fusion[7]; (b) 8x8 multiplication; (c) 8x4 multiplications (2x parallelism); (d) 2x2 multiplications (16x parallelism).
../../Resources/ieie/JSTS.2022.22.5.353/fig4.png

3) BitBlade

To enable bit-precision flexibility in the Bit Fusion architecture, each BitBrick in the fusion unit requires dedicated variable bit-shift logic. However, the variable bit-shift logic leads to large area overhead. To mitigate the logic complexity, BitBlade [8] architecture proposed a bitwise summation method. When a dot product computation is performed, the inputs and weights are first divided into 2-bit numbers. The divided 2-bit input/weight pairs with the same index position from the different input/weight numbers are grouped. The grouped input/weight pairs always share the same bit-shift parameters. When a processing element dedicates a group, each processing element has only 1 variable shift logic. As a result, the area overhead to realize the variable-bit MAC unit is largely mitigated compared to the Bit Fusion architecture where each BitBrick requires its own shift logic.

Fig. 5(a) and (b) illustrate how the bitwise summation method works. For a simple description, a PE includes 4 BitBricks. In the 4x4 case (Fig. 5(a)), 4-bit numbers are divided into 2-bit partial numbers. The 2-bit partial numbers from the same index position of the different input/weight numbers are grouped and they are located at the same PE. Then, the 2-bit partial inputs are multiplied by the 2-bit partial weight numbers. The multiplication results are added using the intra-PE adder. The added numbers are shifted depending on the bit positions in each PE, and they become a dot product result. Considering that 16 BitBricks are used for 4 PEs in the example, 4 4x4 multiplications are performed in parallel. In the same manner, the PE array achieves 8x parallelism with the 4x2 multiplication mode.

Fig. 5. Bitwise summation scheme proposed in BitBlade[8]. For a simple explanation, it is assumed that a PE consists of 4 BitBricks. Examples of (a) 4x4 multiplication; (b) 4x2 multiplication.
../../Resources/ieie/JSTS.2022.22.5.353/fig5.png

III. ANALYSIS ON VARIABLE BIT-PRECISION MAC ARRAYS

In this Section, we perform the analysis on the precision-scalable MAC microarchitectures. One-sided and two-sided flexible bit-width designs, the utilization of the submultipliers, and variable bit-shift logic are compared.

1. Under-utilization of Submultipliers

1) Two-sided Bit-width Scaling on One-sided Flexible Bit-Width Designs

Stripes and UNPU only support the bit-width flexibility for either inputs or weights. However, most of the recent quantized neural networks require bit-width scaling for both inputs and weights. When low bits for both operands are used for the computation, a large portion of the multiplier logic remains idle. Fig. 6 shows an example of a 2x2 multiplication on UNPU hardware. Considering that one operand of the UNPU is expressed in a 16-bit, 16-bit accumulation is repeated for 2 clock cycles for the 2x2 multiplication. During the computation, 14 out of 16-bit positions are not used. A large part of the MAC unit remains idle.

Fig. 6. Two-sided low-bit quantized neural network on one-sided flexible bit-width design[5]. 14 out of 16 AND gates are not used.
../../Resources/ieie/JSTS.2022.22.5.353/fig6.png

2) Performance Loss at Low-bit Precision

The subword-parallel multiplier proposed in the Envision turns on or off its submultiplier blocks depending on the target bit-width. In the case of 16-bit multiplication (Fig. 3(a)), 16 out of 16 submultipliers are turned on. At the 8-bit multiplication (Fig. 3(b)), 4 out of 16 submultipliers are required. To perform 2 8-bit operations in parallel, 8 submultipliers are used, and the other 8 submultipliers are idle. When the 4-bit multiplication is computed (Fig. 3(c)), only 1 out of 16 submultipliers is required. To maximize the throughput of the MAC unit, 4 4-bit multiplications are performed in parallel, hence 4 submultipliers are used. At the 16-bit multiplication, all the submultipliers are fully used. However, half of the submultipliers are utilized at the 8-bit multiplication. At the 4-bit operation, only 1/4 of the submultipliers are used, and the other 3/4 of the submultipliers remain idle. As the bit-precision is scaled, the subword-parallel multiplier of the Envision linearly loses the throughput due to the under-utilization of the submultipliers.

3) Asymmetric Bit-width Between Operands

A limited set of input/weight precisions is supported in the Envision. For example, the bit-width of the inputs must be equal to the bit-width of weights such as 4(=input)/4(=weight)-bit, 8/8-bit, 16/16-bit. However, the optimal bit-width varies depending on the target accuracy of the neural network applications. When the target neural network requires 8x4 multiplications (Fig. 7(a)), both 8-bit and 4-bit operands are mapped to 8x8 multiplication mode. The MAC Performance of the 8x4 multiplication is equal to the 8x8 multiplication, which leads to under-utilization of the submultipliers and 2x performance degradation compared with the ideal performance evaluation. In the same manner, when 16x4 multiplication is necessary (Fig. 7(b)), it is mapped to 16x16 multiplication mode, which leads to 4x under-utilization of arithmetic resources.

Fig. 7. Asymmetric bit-width between operands on subword-parallel MAC of Envision: (a) 8x4 MULs at 8x8 computation mode; (b) 16x4 MULs at 16x16 computation mode.
../../Resources/ieie/JSTS.2022.22.5.353/fig7.png

2. Logic Complexity of Bit-shift Logic

Fusion unit of the Bit Fusion architecture can deal with 2-bit to 8-bit configuration for both inputs and weight. To implement such a dynamically composable scheme, a dynamic bit-shift logic must be dedicated to each BitBrick. For a simple example, if 4 BitBricks are included in a fusion unit and 4 Fusion units are used to perform a dot product, 4 variable bit-shift blocks are required to each fusion unit and 16 shift blocks are used in total. On the other hand, BitBlade design groups the BitBricks with the same variable-shift parameter from different input/weight pairs to a processing element. By doing so, each processing element requires 1 bit-shift block and 4 shift blocks are used in total, which is only 1/4 of the Bit Fusion design.

IV. EXPERIMENTAL RESULTS

Simulation Setup: We compare the variable-bit MAC microarchitectures in this Section. For a fair comparison, we fixed the bit-width of submultipliers to 2-bit. We assumed that 16384 dot product units (=4096 2-bit submultipliers) were used in the designs. All the microarchitectures were synthesized using a 28nm standard library targeting a clock frequency of 500MHz., we did not consider the voltage scaling on the subword-parallel MAC array. For the evaluation (Fig. 8), we first extracted the area and power consumption for each design. Depending on the bit-precision, the MAC array consumes different switching power. Therefore, we performed the power simulation for all the bit-width modes, and the results were stored in the look-up-table (LUT). Our simulator can read PyTorch-based [1] model definition, and hence we directly utilized the model definition classes for the analysis. For the low-bit quantization models, the first and the last layers still used 8-bit, and the rest layers were applied with low-bit widths.

We performed the analysis using weight stationary dataflow. Depending on dataflows, loops with tiled matrices show various performance. However, we focused on MAC array microarchitecture which is orthogonal to the dataflows, thereby we did not use other dataflows in this work. Both of the Stripes and the UNPU target the one-sided bit-width flexibility, so we only analyzed the Stripes design.

Area: Fig. 9 shows the area comparison between variable-bit MAC microarchitectures. Envision and Bit Fusion show a large area for bit-shift and accumulation logic to implement variable-bit MAC units. Envision supports a smaller number of bit-width modes, but subword-parallel MAC scheme leads to larger area for accumulators. BitBlade introduced a bitwise summation scheme thereby reducing the number of bit-shift circuits per processing element. Meanwhile, Stripes used a bit-serial computing method which is typically adopted in area-constraint small chip designs. Stripes shows the smallest logic area, but it cannot achieve the maximum performance due to one-sided bit-width flexibility, which will be discussed at throughput and energy analysis.

Energy Consumption: Fig. 10 shows the energy consumption of the MAC designs. To deal with the variable-bit cases, shift-add-acc logic accounts for the largest part of the energy consumption. The optimized versions of Bit Fusion and BitBlade (BitFusion_opt and BitBlade_opt) reduces the switching power of unused input buffer at high precisions by gating clock signals, and they also show reduced energy consumption at low-bit (2-bit) mode because we still used 8-bit precision at the first and last layers. The reconfigurable logic of BitBlade is much smaller than the Bit Fusion thanks to the bitwise summation scheme. Stripes achieves the comparable energy-efficiency to BitBlade at 8-bit mode because it has light reconfigurable logic due to the one-sided flexibility, but it shows energy-inefficiency when low-bit precisions (especially at low input bit widths) because it operates only 8-bit mode for the input numbers.

Throughput: Fig. 11 shows the comparison of throughput/area between MAC units. Optimized versions of Bit Fusion and BitBlade (BitFusion_opt and BitBlade_opt) do not show the improvement of the throughput over the original Bit Fusion and BitBlade designs, because the clock gating technique is not related to throughput/area efficiency. BitBlade shows higher throughput/area than Bit Fusion and Envision. Stripes support variable-bit precision only for inputs, thereby it cannot maximize the performance at the low bit cases for weights. Envision suffers processing element-level under-utilization at low precisions, and thereby its throughput/area is smaller than other schemes. Bit-serial computing with one-sided bit-flexibility shows similar energy efficiency to BitBlade_opt at extremely asymmetric bit-width, but the performance degrades at other modes because it cannot support low weight bits, thereby the MAC units always operate with 8-bit weight modes.

Selection of Microarchitecture: When a chip has to be designed with very limited area constraint (Fig. 9), Stripes can be an attractive solution with a smaller (27-57%) area than other microarchitectures thanks to the bit-serial computation. Furthermore, in the extremely asymmetric bit-width case (2x8b), the Stripes shows higher throughput/area (1.37-4.46x, Fig. 11) than others. In terms of energy consumption (Fig. 10), the Stripes at 2x8b outperforms other microarchitectures by 14-83%, but it is comparable to BitBlade_opt. On the other hand, BitBlade shows the highest performance at usual workloads due to light circuit overhead of variable-shift logic for the bitwise summation method.

Fig. 8. Experimental Setup.
../../Resources/ieie/JSTS.2022.22.5.353/fig8.png
Fig. 9. Area comparison between variable-bit MAC microarchitectures.
../../Resources/ieie/JSTS.2022.22.5.353/fig9.png
Fig. 10. Energy consumption of variable-bit MACs. Symmetric bit-width cases (left) and asymmetric bit-width cases (right).
../../Resources/ieie/JSTS.2022.22.5.353/fig10.png
Fig. 11. Comparison of throughput/area. Symmetric bit-width cases (up) and asymmetric bit-width cases (down).
../../Resources/ieie/JSTS.2022.22.5.353/fig11.png

V. CONCLUSION

In this paper, we reviewed and analyzed various variable bit-precision MAC units including subword-parallel scheme and one-/two-sided flexible bit-width designs. Those designs have been implemented in different experimental conditions, so it is difficult to compare those microarchitectures. We synthesized the MAC designs in the same design condition/constraints, and analyzed the area, effective throughput, and energy consumption. Our main contribution is to help researchers choose the most optimized microarchitecture for numerous design conditions.

ACKNOWLEDGMENTS

This work was supported by the Soongsil University Research Fund (New Professor Support Research) of 2021 (100%). The EDA tool was supported by the IC Design Education Center(IDEC), Korea.

References

1 
PyTorchGoogle Search
2 
TensorFlowGoogle Search
3 
Ibrahim E. M., et al. , 2022, Taxonomy and benchmarking of precision-scalable mac arrays under enhanced dnn dataflow representation, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 69, No. 5, pp. 2013-2024DOI
4 
Judd et al P., 2016, Stripes: Bit-serial deep neural network computing, in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp. 1-12DOI
5 
Lee J., et al. , 2018, Unpu: An energy-efficient deep neural network accelerator with fully variable weight bit precision, IEEE Journal of Solid-State Circuits, Vol. 54, No. 1, pp. 173-185DOI
6 
Moons B., et al. , 2017, envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm fdsoi, in 2017 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, pp. 246-247DOI
7 
Sharma H., et al. , 2018, Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network, in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, pp. 764-775DOI
8 
Ryu S., et al. , 2019, Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation, in Proceedings of the 56th Annual Design Automation Conference, pp. 1-6DOI
Sungju Ryu
../../Resources/ieie/JSTS.2022.22.5.353/au1.png

Sungju Ryu is an assistant professor at Soongsil University, Seoul, Korea. He was a Staff Researcher at Samsung Advanced Institute of Technology (SAIT) where he focused on the high-performance computer architecture design. He received the B.S. degree from Pusan National University in 2015, and the Ph.D. degree from POSTECH in 2021. His current research interests include energy-efficient hardware accelerators for deep neural networks.