Mobile QR Code QR CODE

  1. (Sogang University, Seoul, Korea)



FPGA, processing-in-memory, hardware accelerator, neural network, deep learning, SRAM

I. INTRODUCTION

Several SRAM-based PIM architectures [1,2] have been presented to mitigate the Von Neumann bottleneck effect. One of the well-known approaches is to perform tensor multiplications inside the memory array. Tensor multiplications in the array consist of two parts: 1) element-wise multiplication between an input and a weight and 2) summation of the partial products. Considering that an element-wise multiplication consists of several binary multiplications using AND operations, the element-wise multiplication can be performed in the memory cells. A popular approach for the SRAM PIM is to activate multiple wordlines of the SRAM array.

Meanwhile, one of the popular ways to verify the design models before the expensive chip fabrication is using FPGA chips. However, in the typical SRAM models, only one of the wordlines can be activated in a single clock cycle, which is different from the SRAM array model used for the PIM. In the FPGA chip, Block RAMs (BRAMs) replace the SRAMs. The BRAMs are pre-stored in the FPGA as built-in models, and designers cannot modify the behavior of the BRAMs. Hence, it is impossible to verify the PIM array on the FPGA due to the fixed behavior of the memory.

Our contribution in this work is to analyze the approaches for evaluation on FPGA of SRAM PIM accelerators. To the best of our knowledge, it is the first work to analyze the PIM mapping methods on an FPGA.

We analyze the following three approaches: 1) Weight mapping on a BRAM row. 2) Weight mapping on flip-flops. 3) Input enumeration-based dot product. We furthermore expand the three mapping schemes to the multi-FPGA evaluation case. The evaluation methods are validated on a real neural network benchmark.

II. PRELIMINARIES

1. Design Approach of Digital SRAM PIM

Fig. 1 shows the design method of digital SRAM PIM arrays, comparing it with the read operation of conventional SRAM arrays. In the SRAM array (Fig. 1(a)), a wordline is shared by SRAM cells located in multiple array columns, and a bitline is shared by SRAM cells located in multiple array rows. When a wordline is activated, we can simultaneously read all the memory cells attached to the wordline through the multiple bitlines at the array columns. Considering that only 1 wordline row can be accessed in the conventional SRAM array, `N' clock cycles are spent reading all the bits in the entire array when the array includes `N' wordline rows.

On the other hand, digital SRAM PIM schemes simultaneously activate multiple wordlines, and we can thereby read multiple memory cells attached to the bitline (Fig. 1(b)). Using the concept, the dot product computation to generate a partial sum (psum) consists of the following three steps. 1) Activate multiple wordlines: It is widely known that a binary multiplication (eg., XNOR, AND) can be performed in an SRAM cell [1,2]. If we assume that a weight is pre-stored in a cell and an input is provided by a wordline, the binary multiplication result is generated. 2) Accumulate partial products: The partial products from the SRAM cells are binary multiplication results, and they are added together in the backend adder-tree. The added psum is usually accumulated and finally becomes the output number. 3) Bit-serial/parallel computing for multiple bit-width: A read operation of the memory cell only supports 1-bit multiplication, we need to use multiple memory cells to construct a multi-bit multiplication result. The multi-bit result can be obtained from multiple cells distributed over the spatial domain (bit-parallel computing) [3]. Otherwise, we can select bit-serial computing by reusing the memory cell through the time domain.

Fig. 1. (a) Read operation of conventional SRAM/BRAM models; (b) Dot product in digital SRAM PIM array.
../../Resources/ieie/JSTS.2024.24.3.218/fig1.png

2. Limitation of SRAM PIM Evaluation on FPGA

FPGA typically realizes the data/control paths using LUT-based configurable logic blocks and the memory arrays using BRAMs. However, to make the memory behave as a PIM array, the circuit designers have to define and generate a new layout model for the PIM block. When we target the application-specific integrated circuit (ASIC) design, it is possible to make a custom layout model, but we cannot modify the layout of any component in the FPGA. As a result, the activation of multiple wordlines, the major feature of the PIM array, cannot be implemented on the built-in BRAM on the FPGA. Therefore, we aim to mitigate such a limitation of the analog PIM evaluation on FPGA and help designers to verify the functional correctness of the PIM array using the FPGA. Meanwhile, NullaNet [4] achieved an input enumeration-based neural network computation for the FPGA, but it did not target evaluation of the PIM array operation on the various mapping methods.

III. FUNCTIONAL VERIFICATION ON FPGA FOR DIGITAL SRAM PIM

In this Section, we analyze the three possible approaches to evaluate digital SRAM PIM array models on FPGA: 1) Weight mapping on a BRAM row. 2) Weight mapping on flip-flops. 3) Input enumeration-based dot product. These approaches are compared with each other and verified using a real neural network benchmark. Furthermore, we also present an evaluation method to evaluate large-sized PIM SoC on the FPGA framework and a top-level evaluation flow.

1. Weight Mapping on a BRAM Row

The first method for the PIM array mapping is to use built-in BRAM blocks (Fig. 2). The PIM array consists of memory cells in `n' rows and `m' columns. As explained in the Section 2.1, memory cells located in a single column contribute to generating a psum. On the other hand, BRAM cannot accumulate the partial products by activating multiple wordline rows (Section 2.2). Hence, the analog PIM array is split into independent rows, and the rows are distributed across multiple BRAM arrays (Fig. 2). First, weights are pre-stored in the BRAM array. The AND operations are replaced by read operation of the BRAM arrays, and inputs are fed to each wordline row, which is the same as the case of analog PIM operation. Generated partial products through the AND operations are accumulated outside the array. The partial products from the first index of the different BRAM arrays are sent to the adder with the first index. The adder constructs `Psum[0]'. Next, the partial products from the second index of the different BRAM arrays are sent to the second adder, and `Psum[1]' is generated. Furthermore, the partial products from the last(m-1) index of the different BRAM arrays are sent to the last adder, and `Psum[m-1]' is generated. Such a method can access the m${\times}$n cells simultaneously, thereby mimicking the behavior of the analog PIM array, but this scheme with the separated BRAM arrays can only activate a single row of the BRAM arrays, thereby reducing the BRAM utilization and requiring a large number of BRAM arrays.

Fig. 2. PIM mapping method 1: Weight mapping on a BRAM row.
../../Resources/ieie/JSTS.2024.24.3.218/fig2.png

2. Weight Mapping on Flip-flops

The second method for the PIM array mapping is to use flip-flops (Fig. 3). `n' weights located in each PIM array column are mapped to each flip-flop vector. If the size of PIM array is equal to `n'(row) ${\times}$`m'(col), `m' flip-flop vectors with each vector size of `n' are required. `n'${\times}$`m' weights are pre-stored in the flip-flop array, and an AND gate is dedicated to each flip-flop. After the weights stored in the flip-flops are AND-ed by input numbers, the partial product result is accumulated at a backend adder. In the same manner as a PIM array where inputs fed to the wordline row are broadcast to all the columns, `Input[n-1:0]' is shared by all the flip-flop vectors. Such a method can fully utilize all the instantiated flip-flops, but using flip-flops for the weight storage leads to a burden, considering that sequential elements typically require much larger hardware resources compared with the dense memory cells.

Fig. 3. PIM mapping method 2: Weight mapping on flip-flops.
../../Resources/ieie/JSTS.2024.24.3.218/fig3.png

3. Input Enumeration-based Dot Product

The third method for the PIM array mapping is to perform input enumeration-based dot product (Fig. 4). The motivation for this method is the point: Contrary to the ASIC where layout is fixed once the fabrication stage is finished, FPGA is programmable. Hence, we cannot modify the schematic of the ASIC after the fabrication. On the other hand, we can change both combinational logic on the LUT-based configurable logic blocks and routing information on the switch blocks of FPGA.

As described in Fig. 4, the memory cells in the PIM array hold weights. In Step 1. we can substitute the read operation of the cells with AND operation. After the weights are AND-ed by inputs, partial products are summed by adder tree, thereby generating psum value. If weights can be fixed during the inference considering that the FPGA is programmable and we can modify the weights for the other inference tasks, the AND operations are replaced by the input enumeration as described in Step 2. If the weight value is equal to `1', the corresponding input value is enumerated and it is passed to the backend adder tree. Otherwise, if the weight value is equal to `0', the AND operation for the binary multiplication by zero can be eliminated, thereby we do not care about the corresponding input value. Afterwards, the psum from the enumerated inputs are generated at the adder tree.

Fig. 4. PIM mapping method 3: Input enumeration-based dot product.
../../Resources/ieie/JSTS.2024.24.3.218/fig4.png

If the target neural network fits in the area of PIM system-on-chip (SoC), all the weights are stored in the PIM array of the chip and we do not need to modify the weight parameters. Therefore, all the weights with value of `1' can be mapped to the configurable logic blocks on FPGA, and the input enumeration-based dot product can be performed seamlessly at low cost of FPGA resources. However, if the target neural network does not fit in the chip area and the weight size is larger than the capacity of the PIM arrays, we have to modify the weights during the computation. To evaluate such a condition on the FPGA, we have to reprogram the FPGA. In real-time, it is not possible to modify the configurable logic blocks and the switch blocks of the FPGA due to large latency overhead. As a result, such an input enumeration-based dot product method can be used only if we aim to evaluate the condition where all weights can be uploaded on the PIM SoC and the weights do not need to be updated.

4. Evaluation of Large-sized PIM SoC

If the PIM SoC does not fit a single FPGA chip, multiple FPGA chips are required for the evaluation. Moreover, as modern custom hardware chips become larger, evaluation platforms using a large number of FPGAs must be considered. We first adopted a mapping method of a neural network on a FPGA cluster [5] (Fig. 5(a)). We use a TC-ResNet8 for an explanation. Among the 8 layers, first 6 layers (Layer#0-5) are mapped on FPGA#0, and the rest part (Layer#6-8) is computed on FPGA#1 because all the weights cannot be stored in a single FPGA chip. This is a simple example, and other networks with various mapping methods are available because implementation approaches of the neural network tiling and its mapping on FPGA is orthogonal to mimicking the PIM array using FPGA resources.

In addition to the clustering, we added extra components in each FPGA for the PIM array evaluation. Each PIM array consumes and outputs data with large bits by communicating other PIM arrays. If the PIM SoC datapath is clustered into several parts and distributed across multiple FPGAs, the communication bandwidth becomes limited. Therefore, the number of bits from/to the PIM array must be reduced for the off-chip communication. It is widely known that parallel-to-serial converter is an efficient example to reduce/increase the number of data bits (Fig. 5(b)). The FPGA chips usually consist of a bunch of input/output ports, and hence multiple parallel-to-serial converters can be simultaneously used. Furthermore, the limited communication bandwidth leads to the unbalance of the performance between the computation and the communication. If the communication data bits are much larger than the input/output interface ports, the PIM array datapath is stalled. To analyze the operation of the datapath, active clock cycles without the number of stalled clocks need to be checked. Hence, each FPGA includes a clock (CLK) counter to measure the effective computing clock cycles by eliminating the effect of the inter-FPGA communication. However, the latency degradation due to the stall does not cause any problem, because our purpose for this multi-FPGA system is not to realize the real-time computation but rather to simulate the PIM SoC during the active clock cycles, thereby verifying functional correctness of the PIM SoC. Additionally, the CLK counter and the parallel-to-serial converters utilize only a small part of the FPGA resources, so the logic overhead is negligible.

Fig. 5. Evaluation of large-sized PIM SoC: (a) Mapping example of a neural network (TC-ResNet8) on a FPGA cluster; (b) Extra components for the PIM array evaluation.
../../Resources/ieie/JSTS.2024.24.3.218/fig5.png

5. Top-level Evaluation Flow

This subsection describes the top-level evaluation flow of the analog PIM SoC. The components for the evaluation system include 4 parts: 1) Target neural networks are analyzed and trained using widely used machine learning frameworks such as PyTorch and TensorFlow. The graphs indicating the sizes/shapes of tensors, the connections between layers, and computational types for the inference tasks are extracted. 2) The source PIM SoC is characterized. The information of the PIM arrays, clock frequency, and peripheral circuits are analyzed. 3) The resource information of the target FPGA is analyzed. For example, the CLB flip-flops, the CLB LUTs, the capacity of BRAM, DSP slices, and the I/O widths/types can be analyzed. The information extracted from above 3 parts is mapped to the custom scheduler, and the scheduler finally generates the PIM array model information for the FPGA evaluation. The behavioral model (RTL) and multi-FPGA mapping information are mapped to target FPGA chip(s). Afterwards, the dataset and trained parameters are applied to the target FPGA(s) for the PIM SoC evaluation.

IV. RESULTS

1. Experimental Setup

In this Section, we analyze the various FPGA evaluation methods for SRAM PIM SoCs. For the comparison between the mapping methods, we synthesized the gate-level logics and compiled the BRAMs using Xilinx Vivado tool. Target clock frequency for the synthesis is 100MHz. A higher clock frequency can be applied, but our main contribution is not to realize a high throughput PIM array but only to verify the functional correctness of the PIM SoC. We first analyze the three FPGA mapping methods: 1) BRAM-based approach: Weight mapping on a BRAM row (`BRAM' in the figures and tables), 2) FF-based approach: Weight mapping on flip-flops (`FF' in the figures and tables), and 3) IE-based approach: Input enumeration-based dot product (`IE' in the figures and tables). Then, we expand such approaches to evaluate the large-scaled PIM array using multiple FPGA chips.

When we use the Xilinx Vivado tool, we simply select the specific FPGA boards containing the FPGA chips, and thereby we avoid the complex configuration steps to set up the experimental conditions. For the analysis we used the Xilinx VCU110 evaluation board (Table 1). In the FPGA chip, a BRAM tile is 36 Kb and it is up to 72~bits wide.

Table 1. Resources on FPGA chip

FPGA (Board)

LUTs

Registers

BRAM Tiles

XCVU190 (VCU110)

1074240

2148480

3780

2. Results

Resource Breakdown on Mapping Methods: Table 2 analyzes the utilization of FPGA resources depending on the 3 PIM array sizes and 3 mapping methods. BRAM-based approach (`BRAM') consumes a significantly large number of BRAM tiles, because only a BRAM row must be used for the parallel access of all the PIM array cells (Fig. 2). In this approach, reading memory already contains the binary multiplication, and hence LUTs for the AND-gates are not required. On the other hand, FF-based approach (`FF') replaces the memory cells by flip-flops (Fig. 3). Weights stored in the flip-flops are AND-ed by inputs for the binary multiplication. So, slice LUTs for the AND gates are utilized. BRAM-/FF-based approaches sum the partial products generated by AND gates using the adder tree. On the other hand, the IE-based approach (`IE') replaces the multiplication with selective input enumeration (Fig. 4). Such a method eliminates the flip-flops for the weight storage by fixing the weight status (0/1), and it reduces the width of the adder tree by eliminating a number of input/weight pairs. Therefore, the IE method does not use registers and AND gates, and it only requires the adder tree with reduced popcount width. In the IE method, we assumed that the density of `1' in the weight tensor is equal to 0.5.

Table 2. Breakdown of used FPGA resources depending on PIM array size. Methods – ‘BRAM’: Weight mapping on a BRAM row (Fig. 2). ‘FF’: Weight mapping on flip-flops (Fig. 3). ‘IE’ Input enumeration-based dot product (Fig. 4).

Size

Method

Slice

LUTs

(AND)

Slice LUTs

(Adder Tree)

Slice

Registers

BRAM

Tiles

128×128

BRAM

-

19584

-

256

128×128

FF

16384

19584

16384

-

128×128

IE

-

9984

-

-

256×256

BRAM

-

81664

-

1024

256×256

FF

65536

81664

65536

-

256×256

IE

-

39168

-

-

512×512

BRAM

-

358400

-

4096

512×512

FF

262144

358400

262144

-

512×512

IE

-

166328

-

-

Input Enumeration-based Dot Product: Fig. 6 analyzes the utilization of the slice LUTs on the FPGA chip. As explained in Section 3.3, the IE-based approach enumerates the input values which are multiplied by weight `1', which eliminates the AND-gate-based multiplication and minimizes the reduction width of the backend adder tree. The number of aggregated inputs is equal to the number of corresponding non-zero weights. In other words, the reduction width depends on the density of `1' in the weight tensor. To study the number of the resource utilization depending on the density of the non-zero weights (Fig. 6), we used 256${\times}$256 PIM array. The adder tree is implemented on the FPGA chip (Section 4.1) using the slice LUTs only, and thereby the number of the LUTs linearly increases depending on the density of `1' in the weight tensor.

Fig. 6. The number of Slice LUTs for a 256x256 PIM array with input enumeration-based dot product on FPGA. X-axis indicates the density of `1's in weights. Method - `IE': Input enumeration-based dot product (Fig. 4).
../../Resources/ieie/JSTS.2024.24.3.218/fig6.png

Real Benchmark Evaluation: We perform the PIM array evaluation with the simple binary CNN dataset described in [6]. The network consists of 9 layers, but we assumed that the first and the last layers are performed on the host CPU because the binary quantization [7,8] is usually applied to the middle layers. The accelerator [6] used the analog computation using capacitor-based accumulator near the memory array, but we slightly modified it to the PIM array design where the MAC computation is performed in the 256${\times}$256 PIM arrays. As a result, 14 256${\times}$256 PIM arrays are used for the computation which can hold all the weights for the network. We applied the IE-based method to the 14 PIM arrays which only account for the 49% of LUTs on the XCVU190 FPGA chip.

Multi-FPGA Benchmark: In the previous subsection, we assumed that the PIM SoC targets a small and simple neural network inference task, which is actually one well-known target application of in-memory computing. However, recent PIM SoC also targets complex neural networks, and hence PIM chips have a larger number of PIM arrays compared to the previous PIM SoC architectures.

We evaluated a PE on the PIMCA architecture [9], which consists of 18 256${\times}$128 PIM arrays. We used 3 XCVU190 FPGA chips with the FF-based method, because the PE size is much larger than the resource of a single FPGA chip. Table 3 shows the implementation result using the multi-FPGA evaluation case. Each FPGA chip includes 6 PIM arrays, 513Kb global buffer (GLB) for activations, and peripheral and interfaces (Fig. 5). The inter-PE adder tree sums the psums from different PIM array bunches and FPGAs. The psums are first sent to the FPGA chip #1, so the chip #1 only needs the inter-PE adder tree and other chips do not require the inter-PE adder tree.

Table 3. Evaluation of multi-FPGA scenario with 18 PIM arrays

FPGA

Module

Units

LUTs

[%]

Registers

[%]

BRAM

Tiles [%]

1

PIM 256×128

6

73.3

18.3

-

GLB

-

-

-

0.0

Peri.+

Interface

-

0.0

-

-

Inter-PE

Adder

256

0.0

-

-

2

PIM 256×128

6

73.3

18.3

-

GLB

-

-

-

0.0

Peri.+

Interface

-

0.0

-

-

3

PIM 256×128

6

73.3

18.3

-

GLB

-

-

-

0.0

Peri.+

Interface

-

0.0

-

-

V. CONCLUSIONS

In this paper, we analyzed the methods to evaluate the digital SRAM processing-in-memory hardware accelerators on FPGA. Based on the three mapping schemes including 1) BRAM-based method, 2) FF-based method, and 3) IE-based method, we analyzed the resource utilization on FPGA chip. We moreover expanded the mapping method to a larger PIM SoC case using multiple FPGA chips. Considering that the analog PIM array cannot be possible to be implemented in the BRAM tiles on FPGAs, our main contribution is to mimic the SRAM PIM array using FPGA resources, thereby achieving the verification of functional correctness of the PIM SoCs using FPGA chips before the expensive fabrication steps.

ACKNOWLEDGMENTS

This work was supported by the Sogang University Research Grant of 2023 (202310030.01) (10%) and partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2022R1F1A1070414, 90%). The EDA tool was supported by the IC Design Education Center(IDEC), Korea.

References

1 
Y.-D. Chih et al.: “An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications,” IEEE International Solid-State Circuits Conference (2021) 252.DOI
2 
H. Fujiwara et al.: “A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write,” IEEE International Solid-State Circuits Conference (2022) 186.DOI
3 
S. Ryu et al.: “BitBlade: Energy-efficient variable bit-precision hardware accelerator for quantized neural networks,” IEEE Journal of Solid-State Circuits (2022).DOI
4 
M. Nazemi et al.: “NullaNet: Training deep neural networks for reduced-memory-access inference,” arXiv (2018).DOI
5 
S. Biookaghazadeh et al., “Toward multi-FPGA acceleation of the neural networks,” ACM Journal on Emerging Technologies in Computing Systems (2021) 1.DOI
6 
D. Bankman et al., “An Always-On 3.8uJ/86% CIFAR-10 mixed-signal binary CNN accelerator with all memory on chip in 28-nm CMOS,” IEEE Journal of Solid-State Circuits (2018) 158.URL
7 
M. Rategari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks,” European conference on computer vision (2016).DOI
8 
A. Bulat et al., “XNOR-Net++: Improved Binary Neural Networks,” British Machine Vision Conference (2019).DOI
9 
B. Zhang et al.: “PIMCA: A programmable In-Memory Computing Accelerator for Energy-Efficient DNN Inference,” IEEE Journal of Solid-State Circuits (2022) 1436.DOI
Sungju Ryu
../../Resources/ieie/JSTS.2024.24.3.218/au1.png

Sungju Ryu is currently an Assistant Professor in the Department of System Semiconductor Engineering at Sogang University, Seoul, Republic of Korea. Before joining Sogang, he was an Assistant Professor in School of Electronic Engineering and Department of Next-Generation Semiconductor at Soongsil University from 2021 to 2023. At 2021, he was a Staff Researcher in the AI&SW Research Center of Samsung Advanced Institute of Technology (SAIT), Suwon, Republic of Korea. At SAIT, he focused on computer architecture design. He received the B.S. degree in Electrical Engineering from Pusan National University, Busan, Republic of Korea, in 2015, and the Ph.D. degree in Creative IT Engineering from Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea, in 2021. His current research interests include energy-efficient neural processing unit and processing-in-memory.

.