Mobile QR Code QR CODE

  1. (Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, Korea)
  2. (Department of Electronic Engineering, Hanyang University, Seoul, Korea)



Sparse matrix-vector multiplication (SpMV), FPGA accelerator, bandwidth utilization, load balancing, offline pre-processing

I. INTRODUCTION

Sparse matrix-vector multiplication (SpMV) is a core computational kernel that appears in scientific computing, machine learning, and graph analytics [1- 3]. While efficient hardware utilization is a universal challenge across diverse domain-specific accelerators, including matrix multiplication [4] and image processing engines [5], SpMV presents unique difficulties: its memory-intensive nature and irregular data access patterns make it difficult to fully exploit available memory bandwidth [6]. To address this, several FPGA-based SpMV accelerators have been proposed. Their performance is commonly evaluated using bandwidth utilization (BU), defined as the ratio of computational throughput to peak available memory bandwidth [6], as illustrated in Fig. 1. To intuitively compare how effectively each accelerator exploits its available bandwidth, the achieved BU can be expressed as a percentage of the theoretical maximum.

Under this metric, Fowers et al. [7] achieved approximately 50% on a custom FPGA PCIe card with DDR3. Subsequent works have improved this ratio through domain-specific architectures [8], matrix reordering [9], and vector duplication [10], reaching up to 77%. However, most of these accelerators were designed for specific hardware configurations with limited scalability to increasing DRAM bandwidth, and their achieved BU degrades significantly–by up to 50%–when processing large sparse matrices [12].

In our prior work [12], we presented an FPGA-based DRAM bandwidth-scalable SpMV accelerator that scales processing element (PE) lines proportionally with off-chip memory bandwidth. Through offline pre-processing, it eliminates bank conflicts and read-after-write (RAW) dependencies, achieving an average BU reaching 89% of the theoretical peak. However, evaluating a larger set of benchmark matrices revealed a previously unaddressed bottleneck: the column-index-based equal partitioning scheme. This scheme divides the column range equally among PE lines without considering the actual nonzero distribution. When nonzero elements are concentrated in a specific column range, some PE lines receive significantly more computation. Since a block completes only when all PE lines finish, the slowest PE line dictates the overall block execution time, leaving faster PE lines idle and wasting bandwidth. This problem is further amplified as the hardware scales, making the partitioning more susceptible to localized nonzero concentration.

To address this issue, this paper first analyzes PE line load imbalance caused by column-equal partitioning using 29 SuiteSparse benchmark matrices [11], demonstrating that non-uniform column-wise nonzero distributions lead to idle PE line cycles and measurable BU degradation. Based on this analysis, we propose a nonzero-aware PE line partitioning strategy that assigns column ranges based on the actual number of nonzero elements to equalize workloads across PE lines. While the offline pre-processing in [12] focused on the rearrangement stage to eliminate bank conflicts and RAW dependencies, this work targets the preceding partitioning stage, which was not optimized in [12]. Consequently, the proposed method requires no modifications to the on-chip hardware and is fully compatible with the existing rearrangement scheme. Cycle-accurate simulation on 29 matrices across 2, 4, and 8 PE line configurations shows average BU improvements of 17.2%, 65.4%, and 113.7%, respectively, with zero degradation across all cases, confirming that the method preserves the bandwidth scalability of the baseline accelerator.

Fig. 1. Background on SpMV: (a) SpMV and its memory-bound nature, (b) BU: the key metric for SpMV accelerators.

../../Resources/ieie/JSTS.2026.26.3.228/fig1.png

II. BASELINE SPMV ACCELERATOR ARCHITECTURE

1. Overall Architecture

Fig. 2 illustrates the architecture of the baseline SpMV accelerator. It consists of a data distributor that reads matrix and vector data from off-chip DRAM, multiple PE lines that independently perform multiply-accumulate (MAC) operations, and an output updater that collects partial sums (PSUM) into a dedicated buffer. Each PE line contains a matrix buffer, a two-bank vector buffer, and two parallel MAC units, and the number of PE lines scales proportionally with the available off-chip memory bandwidth (e.g., 2, 4, and 8 PE lines for 64, 128, and 256 B/cycle interfaces, respectively).

2. Offline Pre-processing

The accelerator relies on an offline pre-processing stage to transform the input sparse matrix into a format that maximizes hardware utilization during online execution. The pre-processing consists of two steps: matrix partitioning and matrix rearrangement.

In the matrix partitioning step, the input matrix is divided into blocks whose dimensions are constrained by the on-chip vector buffer and PSUM buffer sizes. Each block is then assigned to the PE lines based on its column range. In the baseline design, this assignment follows a column-equal partitioning scheme: the column range of each block is divided equally among the available PE lines, so that each PE line is responsible for a contiguous and equally sized column range, as shown in Fig. 3.

In the matrix rearrangement step, the nonzero elements within each PE line's assigned portion are rearranged to avoid RAW dependencies and bank conflicts. The algorithm prioritizes vector element reuse between the two MAC units to maximize PE utilization, ensuring that each pair accesses different banks in the multi-bank vector buffer.

Fig. 2. Overall architecture of the baseline SpMV accelerator.

../../Resources/ieie/JSTS.2026.26.3.228/fig2.png

Fig. 3. Column-equal PE line partitioning and matrix rearrangement example.

../../Resources/ieie/JSTS.2026.26.3.228/fig3.png

3. Bandwidth Utilization Formulation

The performance of the accelerator is evaluated using bandwidth utilization (BU), defined as the ratio of computational throughput to off-chip memory bandwidth (GB/s) [4, 10]. Since each nonzero element undergoes two floating-point operations, an SpMV task with $NNZ$ nonzero elements yields $2 \times NNZ$ floating-point operations. The BU is calculated as

(1)
$BU = \frac{2 \times NNZ}{Bandwidth \times C_{total}},$

where $Bandwidth$ is the off-chip memory bandwidth in bytes per cycle and $C_{total}$ is the total execution cycle count. The accelerator processes each SpMV task in three phases–vector loading, block execution, and output storing–so the total execution cycle count is decomposed as

(2)
$C_{total} = C_{ldv} + C_{exe} + C_{sty} + C_{PIPE},$

where $C_{ldv}$, $C_{exe}$, and $C_{sty}$ represent the cycles consumed for loading vector elements, processing all matrix blocks, and storing output results, respectively. $C_{PIPE}$ accounts for pipeline latency. In the vector loading phase, $N_v$ vector elements of $B_v$ bytes each are transferred from off-chip memory

(3)
$C_{ldv} = \frac{N_v \times B_v}{Bandwidth}.$

In the block execution phase, each block is processed in parallel across PE lines, and the block completes when the slowest PE line finishes. Therefore $C_{exe}$ is modeled as the sum of the critical paths across all blocks

(4)
$C_{exe} = \sum_{i=0}^{N_{block}-1} \max_{0 \le j \le (P-1)} (C_{i,j}),$

where $N_{block}$ is the total number of matrix blocks, $P$ is the total number of PE lines, and $C_{i,j}$ denotes the execution cycles of the $j$-th PE line in the $i$-th block. In the output storing phase, the results for $Block_{height}$ output rows are written back

(5)
$C_{sty} = \frac{Block_{height} \times B_v}{Bandwidth}.$

Ideally, when the workload is perfectly balanced, the accelerator minimizes the $C_{exe}$ term in the equation, thereby achieving the theoretical maximum BU. For example, given a 64 B/cycle interface and 4 MAC units, the minimum execution cycle is $C_{total} = NNZ/4$, yielding a maximum BU of 0.125 FLOP/Byte. However, any load imbalance among PE lines causes the slowest PE line to increase the maximum execution cycle disproportionately, directly inflating $C_{exe}$ and degrading the BU. This architectural vulnerability motivates the in-depth PE line imbalance analysis and the proposed optimization detailed in the following section.

III. PE LINE IMBALANCE ANALYSIS AND NONZERO-AWARE PARTITIONING

1. PE Line Imbalance under Column-equal Partitioning

To quantify the impact of PE line load imbalance on BU, we define a weighted PE line imbalance metric. Since the architecture operates on a synchronized block-level execution model, the difference between the maximum and minimum workload among PE lines directly translates to the hardware idle time experienced by the faster PE lines. For each matrix block $i$, the per-block imbalance ratio $IR_i$ is formulated as

(6)
$IR_i = \frac{\max_{0 \le j \le (P-1)} (NNZ_{i,j}) - \min_{0 \le j \le (P-1)} (NNZ_{i,j})}{\max_{0 \le j \le (P-1)} (NNZ_{i,j})},$

where $NNZ_{i,j}$ denotes the number of nonzero elements assigned to PE line $j$ in block $i$. For an empty block ($NNZ_i = 0$), $IR_i$ is defined as 0. An $IR_i$ of 0.0 indicates a perfectly balanced workload. To evaluate the overall imbalance of the entire matrix, these per-block ratios are aggregated into a weighted average, giving higher weight to blocks with larger nonzero counts since they dominate the total execution time

(7)
$IR_{weighted} = \sum_{i=0}^{N_{block}-1} \frac{NNZ_i}{NNZ} \times IR_i,$

where $NNZ_i$ is the total number of nonzero elements in block $i$.

As shown in Fig. 4, evaluating 29 SuiteSparse benchmark matrices [11] (detailed in Section IV-1) under column-equal partitioning reveals a clear negative correlation: as imbalance increases, BU decreases consistently. This confirms that PE line load imbalance is a primary source of BU degradation. Notably, most matrices exhibit more than 20% BU degradation from the theoretical maximum, indicating that this is not a marginal issue but a widespread problem across diverse matrix structures. Moreover, this imbalance worsens as the hardware scales. For the same matrix, the 8 PE line configuration generally exhibits higher imbalance and lower BU than the 2 PE line configuration. Architecturally, this occurs because increasing the PE line count narrows the column ranges assigned to each PE line, making the system highly susceptible to localized nonzero concentration. These observations collectively motivate the need for a workload-aware partitioning strategy that can maintain high BU across all PE line configurations.

Fig. 4. Weighted PE line imbalance ratio versus bandwidth utilization for 29 benchmark matrices under column-equal partitioning with 2, 4, and 8 PE line configurations.

../../Resources/ieie/JSTS.2026.26.3.228/fig4.png

2. Nonzero-aware PE Line Partitioning

To address this imbalance, we propose a nonzero-aware PE line partitioning strategy that sets partition boundaries based on the cumulative nonzero distribution within each block. The procedure constructs a column-wise nonzero histogram, computes its cumulative sum, and places the partition boundary at the column where the cumulative count reaches $NNZ_i/P$, ensuring that each PE line receives approximately the same number of nonzero elements. The boundary is then aligned to the memory access granularity of 8 elements to avoid misaligned vector accesses, and monotonicity is enforced to prevent degenerate partitions. Finally, each PE line's assigned column range is checked against the vector buffer limit (VBL), and boundaries are clamped if necessary to satisfy the constraint.

Fig. 5 illustrates the difference between column-equal and nonzero-aware partitioning for a representative matrix block. Column-equal partitioning yields an imbalance ratio of 0.91 (23 vs. 2 nonzero elements), whereas nonzero-aware partitioning reduces it to 0.08 (12 vs. 13). The proposed strategy modifies only the block partitioning step within the offline pre-processing stage, requiring no changes to the hardware architecture.

Fig. 5. Comparison of column-equal and nonzero-aware partitioning: NNZ distribution, PE line assignment, and execution timeline.

../../Resources/ieie/JSTS.2026.26.3.228/fig5.png

IV. EVALUATION RESULTS

1. Experimental Setup

The proposed strategy is evaluated using a cycle-accurate simulator modeling the accelerator, configured with a total buffer size of 512KB. All computations use FP64 values and INT32 indices in COO format. We evaluate three configurations with 2, 4, and 8 PE lines to examine the bandwidth scalability. Table 1 lists the 29 benchmark matrices from the SuiteSparse collection [11] used in the evaluation, covering diverse application domains and sparsity patterns.

Table 1. Benchmark matrices from the suite sparse matrix collection [11].

Matrix Rows Cols NNZ Kind
ASIC_680k 682,862 682,862 3,871,773 Circuit
audikw_1 943,695 943,695 77,651,847 Structural
bbmat 38,744 38,744 1,771,722 CFD
cant 62,451 62,451 4,007,383 2D/3D
Chebyshev4 68,121 68,121 5,377,761 Structural
com-Youtube 1,134,890 1,134,890 5,975,248 Graph
consph 83,334 83,334 6,010,480 2D/3D
cop20k_A 121,192 121,192 2,624,331 2D/3D
crankseg_2 63,838 63,838 14,148,858 Structural
eu-2005 862,664 862,664 19,235,140 Graph
gupta3 16,783 16,783 9,323,427 Optimization
human_gene1 22,283 22,283 24,669,643 Graph
in-2004 1,382,908 1,382,908 16,917,053 Graph
ldoor 952,203 952,203 46,522,475 Structural
mip1 66,463 66,463 10,352,819 Optimization
pdb1HYS 36,417 36,417 4,344,765 Graph
PR02R 161,070 161,070 8,185,136 CFD
radiation 223,104 223,104 7,637,688 Semiconductor
raefsky1 3,242 3,242 294,276 CFD
rail4284 4,284 1,096,894 11,284,032 LP
rajat30 643,994 643,994 6,175,377 Circuit
rma10 46,835 46,835 2,374,001 CFD
scircuit 170,998 170,998 958,936 Circuit
Stanford_Berkeley 683,446 683,446 7,583,376 Graph
thermomech_dK 204,316 204,316 2,846,228 Thermal
TSOPF_RS_b2383 38,120 38,120 16,171,169 Power Network
TSOPF_RS_b300_c3 42,138 42,138 4,413,449 Power Network
webbase-1M 1,000,005 1,000,005 3,105,536 Graph
wikipedia-20051105 1,634,989 1,634,989 19,753,078 Graph

2. BU Improvement

Fig. 6 presents the BU before and after applying the proposed nonzero-aware partitioning for (a) 2, (b) 4, (c) 8 PE line configurations. The average BU improves by 17.2% with 2 PE lines, 65.4% with 4 PE lines, and 113.7% with 8 PE lines. The improvement scales with PE line count because the baseline column-equal partitioning produces increasingly severe imbalance as the column range per PE line shrinks, amplifying the benefit of nonzero-aware partitioning. Notably, no matrix exhibits BU degradation under the proposed strategy in any configuration.

Fig. 6. BU comparison before and after applying nonzero-aware partitioning for (a) 2, (b) 4, and (c) 8 PE line configurations.

../../Resources/ieie/JSTS.2026.26.3.228/fig6.png

3. Relationship between Imbalance Reduction and BU Improvement

Fig. 7 plots the imbalance reduction (defined as $Imbalance_{before} - Imbalance_{after}$) against BU improvement for each matrix across all three PE line configurations. A positive correlation is observed: matrices with larger imbalance reduction achieve greater BU gains, and the trend steepens with more PE lines.

Notable outliers deviate from the general trend for distinct structural reasons. For TSOPF_RS_b2383 with 8 PE lines, the baseline imbalance ratio reaches 0.9997 and the baseline BU is only 0.024, so even a moderate imbalance reduction yields a BU gain exceeding 300%, due to the low baseline. In contrast, com-Youtube exhibits a large imbalance reduction (0.69) but only a 43% BU gain with 8 PE lines, because its large dimension (1.1M $\times$ 1.1M) and extreme sparsity ($\sim$5.3 nonzeros per row) produce many blocks with few nonzeros each. Since each block loads vector elements up to the vector buffer capacity regardless of its nonzero count, $C_{ldv}$ dominates the total execution time and limits the BU impact of reducing $C_{exe}$. For mip1, the weighted PE line imbalance remains high (0.695) even after nonzero-aware partitioning despite a baseline imbalance of 0.990, because its highly skewed nonzero distribution forces the partition boundaries to shift beyond the VBL. The algorithm clamps these boundaries to satisfy the hardware constraint, preventing further imbalance reduction.

Fig. 7. Relationship between imbalance reduction and BU improvement across 2, 4, and 8 PE line configurations.

../../Resources/ieie/JSTS.2026.26.3.228/fig7.png

4. Discussion

The BU improvement originates from reduced PE line idle time: balancing the nonzero count across PE lines decreases $C_{exe}$ while the vector loading cycles ($C_{ldv}$) remain nearly unchanged. The VBL constraint similarly limits improvement for eu-2005, whose large column dimension (862K) causes partition boundaries to be clamped at the VBL. These results suggest that further BU gains for VBL-constrained matrices would require either a larger on-chip vector buffer or a partitioning strategy that subdivides the column range into multiple passes per PE line.

V. CONCLUSIONS

This paper identified PE line imbalance caused by column-equal partitioning as a key factor in bandwidth utilization degradation for the bandwidth-scalable SpMV accelerator presented in [12]. The proposed nonzero-aware PE line partitioning strategy complements [12] by optimizing the partitioning stage of the offline pre-processing, which was not addressed in the original design, and achieves average BU improvements of 17.2%, 65.4%, and 113.7% for 2, 4, and 8 PE line configurations, respectively, with no degradation cases across 29 benchmark matrices. The growing benefit at higher PE line counts confirms that the method preserves and enhances the bandwidth scalability of the baseline architecture without requiring any on-chip hardware modifications.

ACKNOWLEDGEMENTS

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2025-02214649, Development of Chiplet-Based Interface Technology for Enhancing Data Processing Efficiency in AI Semiconductors and Verification Technology for Large-Scale AI Chiplet Semiconductor Modules) and partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2025-25442469, Development of Edge AI Server Technology with High-Performance and High-Reliability). The EDA tool was supported by the IC Design Education Center.

REFERENCES

1 
M. Manguoglu, A. H. Sameh, O. Schenk, 2011, A domain-decomposing parallel sparse linear system solver, Journal of Computational and Applied Mathematics, Vol. 236, No. 3, pp. 319-325DOI
2 
W. L. Hamilton, 2020, Graph Representation Learning, Morgan & ClaypoolGoogle Search
3 
L. Page, S. Brin, R. Motwani, T. Winograd, 1999, The PageRank citation ranking: Bringing order to the web, Stanford InfoLabGoogle Search
4 
S. Park, H. Kim, J. Kim, 2025, A fault-tolerant GEMM accelerator with online error detection and correction based on systolic array architecture, Journal of Semiconductor Technology and Science, Vol. 25, No. 3DOI
5 
S. Choi, K. J. Lee, Y. Kim, H.-J. Yoo, 2020, A 9.52 ms latency, and low-power streaming depth-estimation processor with shifter-based pipelined architecture for smart mobile devices, Journal of Semiconductor Technology and Science, Vol. 20, No. 3, pp. 255-266DOI
6 
B. Lu, X. Qiao, J. Bhatt, 2019, Redesk: A reconfigurable dataflow engine for sparse kernels on heterogeneous platforms, Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pp. 1-8DOI
7 
J. Fowers, G. Brown, P. Cooke, G. Stitt, 2014, A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Proceedings of the IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 36-43DOI
8 
A. K. Jain, H. Omidian, H. Fraisse, M. Benipal, L. Liu, D. Gaitonde, 2020, A domain-specific architecture for accelerating sparse matrix vector multiplication on FPGAs, Proceedings of the 30th International Conference on Field-Programmable Logic and Applications, pp. 127-132DOI
9 
S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, Y. Xie, 2021, Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs, Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pp. 1-9DOI
10 
C. Liu, H. Yin, C. Cheng, X. Li, 2023, Towards high-bandwidth-utilization SpMV on FPGAs via partial vector duplication, Proceedings of the 28th Asia and South Pacific Design Automation Conference, pp. 198-203Google Search
11 
T. A. Davis, Y. Hu, 2011, The University of Florida sparse matrix collection, ACM Transactions on Mathematical Software, Vol. 38, No. 1, pp. 1-25DOI
12 
H. Kim, E. Ham, S. Park, H. Kim, J. Kim, 2023, A DRAM bandwidth-scalable sparse matrix-vector multiplication accelerator with 89% bandwidth utilization efficiency for large sparse matrix, Proceedings of the IEEE Asian Solid-State Circuits Conference, pp. 1-3DOI
Hyunji Kim
../../Resources/ieie/JSTS.2026.26.3.228/au1.png

Hyunji Kim received her B.S. and M.S. degrees in electronic and electrical engineering from Ewha Womans University, Seoul, South Korea, in 2019 and 2021, respectively. She is currently pursuing a Ph.D. degree at the same university with Digital System Architecture Lab. Her current research interests include domain-specific SoC architecture.

Ji-Hoon Kim
../../Resources/ieie/JSTS.2026.26.3.228/au2.png

Ji-Hoon Kim received his B.S. (summa cum laude) and Ph.D. degrees in electrical engineering and computer science from KAIST, Daejeon, South Korea, in 2004 and 2009, respectively. In 2009, he joined Samsung Electronics, Suwon, South Korea, as a Senior Engineer, and worked on next-generation architecture for 4G communication modem system-on-chip (SoC). From 2018 to 2025, he was a professor in the Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, South Korea. Since 2025, he has been with the Department of Electronic Engineering, Hanyang University, Seoul, South Korea. His current research interests include CPU microarchitecture, domain-specific SoC, and deep neural network accelerators. Dr. Kim served on the Technical Program Committee and Organizing Committee for various international conferences, including the IEEE International Conference on Computer Design (ICCD), the IEEE Asian Solid-State Circuits Conference (A-SSCC), and the IEEE International Solid-State Circuits Conference (ISSCC). He was a co-recipient of the Distinguished Design Award at the 2019 IEEE A-SSCC, and a recipient of the Best Design Award at 2007 Dongbu HiTek IP Design Contest, the First Place Award at 2008 International SoC Design Conference (ISOCC) Chip Design Contest, and the IEEE/IEIE Joint Award for Young Scientist and Engineer. He also serves as an Associate Editor for the IEEE Transactions on Circuits and Systems II: Express Briefs.