Nonzero-aware PE Line Partitioning for Improved Bandwidth Utilization in DRAM Bandwidth-scalable
SpMV Accelerators
Hyunji Kim1
Ji-Hoon Kim2,*
-
(Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul,
Korea)
-
(Department of Electronic Engineering, Hanyang University, Seoul, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Sparse matrix-vector multiplication (SpMV), FPGA accelerator, bandwidth utilization, load balancing, offline pre-processing
I. INTRODUCTION
Sparse matrix-vector multiplication (SpMV) is a core computational kernel that appears
in scientific computing, machine learning, and graph analytics [1-
3]. While efficient hardware utilization is a universal challenge across diverse domain-specific
accelerators, including matrix multiplication [4] and image processing engines [5], SpMV presents unique difficulties: its memory-intensive nature and irregular data
access patterns make it difficult to fully exploit available memory bandwidth [6]. To address this, several FPGA-based SpMV accelerators have been proposed. Their
performance is commonly evaluated using bandwidth utilization (BU), defined as the
ratio of computational throughput to peak available memory bandwidth [6], as illustrated in Fig. 1. To intuitively compare how effectively each accelerator exploits its available bandwidth,
the achieved BU can be expressed as a percentage of the theoretical maximum.
Under this metric, Fowers et al. [7] achieved approximately 50% on a custom FPGA PCIe card with DDR3. Subsequent works
have improved this ratio through domain-specific architectures [8], matrix reordering [9], and vector duplication [10], reaching up to 77%. However, most of these accelerators were designed for specific
hardware configurations with limited scalability to increasing DRAM bandwidth, and
their achieved BU degrades significantly–by up to 50%–when processing large sparse
matrices [12].
In our prior work [12], we presented an FPGA-based DRAM bandwidth-scalable SpMV accelerator that scales
processing element (PE) lines proportionally with off-chip memory bandwidth. Through
offline pre-processing, it eliminates bank conflicts and read-after-write (RAW) dependencies,
achieving an average BU reaching 89% of the theoretical peak. However, evaluating
a larger set of benchmark matrices revealed a previously unaddressed bottleneck: the
column-index-based equal partitioning scheme. This scheme divides the column range
equally among PE lines without considering the actual nonzero distribution. When nonzero
elements are concentrated in a specific column range, some PE lines receive significantly
more computation. Since a block completes only when all PE lines finish, the slowest
PE line dictates the overall block execution time, leaving faster PE lines idle and
wasting bandwidth. This problem is further amplified as the hardware scales, making
the partitioning more susceptible to localized nonzero concentration.
To address this issue, this paper first analyzes PE line load imbalance caused by
column-equal partitioning using 29 SuiteSparse benchmark matrices [11], demonstrating that non-uniform column-wise nonzero distributions lead to idle PE
line cycles and measurable BU degradation. Based on this analysis, we propose a nonzero-aware
PE line partitioning strategy that assigns column ranges based on the actual number
of nonzero elements to equalize workloads across PE lines. While the offline pre-processing
in [12] focused on the rearrangement stage to eliminate bank conflicts and RAW dependencies,
this work targets the preceding partitioning stage, which was not optimized in [12]. Consequently, the proposed method requires no modifications to the on-chip hardware
and is fully compatible with the existing rearrangement scheme. Cycle-accurate simulation
on 29 matrices across 2, 4, and 8 PE line configurations shows average BU improvements
of 17.2%, 65.4%, and 113.7%, respectively, with zero degradation across all cases,
confirming that the method preserves the bandwidth scalability of the baseline accelerator.
Fig. 1. Background on SpMV: (a) SpMV and its memory-bound nature, (b) BU: the key
metric for SpMV accelerators.
II. BASELINE SPMV ACCELERATOR ARCHITECTURE
1. Overall Architecture
Fig. 2 illustrates the architecture of the baseline SpMV accelerator. It consists of a data
distributor that reads matrix and vector data from off-chip DRAM, multiple PE lines
that independently perform multiply-accumulate (MAC) operations, and an output updater
that collects partial sums (PSUM) into a dedicated buffer. Each PE line contains a
matrix buffer, a two-bank vector buffer, and two parallel MAC units, and the number
of PE lines scales proportionally with the available off-chip memory bandwidth (e.g.,
2, 4, and 8 PE lines for 64, 128, and 256 B/cycle interfaces, respectively).
2. Offline Pre-processing
The accelerator relies on an offline pre-processing stage to transform the input sparse
matrix into a format that maximizes hardware utilization during online execution.
The pre-processing consists of two steps: matrix partitioning and matrix rearrangement.
In the matrix partitioning step, the input matrix is divided into blocks whose dimensions
are constrained by the on-chip vector buffer and PSUM buffer sizes. Each block is
then assigned to the PE lines based on its column range. In the baseline design, this
assignment follows a column-equal partitioning scheme: the column range of each block
is divided equally among the available PE lines, so that each PE line is responsible
for a contiguous and equally sized column range, as shown in Fig. 3.
In the matrix rearrangement step, the nonzero elements within each PE line's assigned
portion are rearranged to avoid RAW dependencies and bank conflicts. The algorithm
prioritizes vector element reuse between the two MAC units to maximize PE utilization,
ensuring that each pair accesses different banks in the multi-bank vector buffer.
Fig. 2. Overall architecture of the baseline SpMV accelerator.
Fig. 3. Column-equal PE line partitioning and matrix rearrangement example.
3. Bandwidth Utilization Formulation
The performance of the accelerator is evaluated using bandwidth utilization (BU),
defined as the ratio of computational throughput to off-chip memory bandwidth (GB/s)
[4,
10]. Since each nonzero element undergoes two floating-point operations, an SpMV task
with $NNZ$ nonzero elements yields $2 \times NNZ$ floating-point operations. The BU
is calculated as
where $Bandwidth$ is the off-chip memory bandwidth in bytes per cycle and $C_{total}$
is the total execution cycle count. The accelerator processes each SpMV task in three
phases–vector loading, block execution, and output storing–so the total execution
cycle count is decomposed as
where $C_{ldv}$, $C_{exe}$, and $C_{sty}$ represent the cycles consumed for loading
vector elements, processing all matrix blocks, and storing output results, respectively.
$C_{PIPE}$ accounts for pipeline latency. In the vector loading phase, $N_v$ vector
elements of $B_v$ bytes each are transferred from off-chip memory
In the block execution phase, each block is processed in parallel across PE lines,
and the block completes when the slowest PE line finishes. Therefore $C_{exe}$ is
modeled as the sum of the critical paths across all blocks
where $N_{block}$ is the total number of matrix blocks, $P$ is the total number of
PE lines, and $C_{i,j}$ denotes the execution cycles of the $j$-th PE line in the
$i$-th block. In the output storing phase, the results for $Block_{height}$ output
rows are written back
Ideally, when the workload is perfectly balanced, the accelerator minimizes the $C_{exe}$
term in the equation, thereby achieving the theoretical maximum BU. For example, given
a 64 B/cycle interface and 4 MAC units, the minimum execution cycle is $C_{total}
= NNZ/4$, yielding a maximum BU of 0.125 FLOP/Byte. However, any load imbalance among
PE lines causes the slowest PE line to increase the maximum execution cycle disproportionately,
directly inflating $C_{exe}$ and degrading the BU. This architectural vulnerability
motivates the in-depth PE line imbalance analysis and the proposed optimization detailed
in the following section.
III. PE LINE IMBALANCE ANALYSIS AND NONZERO-AWARE PARTITIONING
1. PE Line Imbalance under Column-equal Partitioning
To quantify the impact of PE line load imbalance on BU, we define a weighted PE line
imbalance metric. Since the architecture operates on a synchronized block-level execution
model, the difference between the maximum and minimum workload among PE lines directly
translates to the hardware idle time experienced by the faster PE lines. For each
matrix block $i$, the per-block imbalance ratio $IR_i$ is formulated as
where $NNZ_{i,j}$ denotes the number of nonzero elements assigned to PE line $j$ in
block $i$. For an empty block ($NNZ_i = 0$), $IR_i$ is defined as 0. An $IR_i$ of
0.0 indicates a perfectly balanced workload. To evaluate the overall imbalance of
the entire matrix, these per-block ratios are aggregated into a weighted average,
giving higher weight to blocks with larger nonzero counts since they dominate the
total execution time
where $NNZ_i$ is the total number of nonzero elements in block $i$.
As shown in Fig. 4, evaluating 29 SuiteSparse benchmark matrices [11] (detailed in Section IV-1) under column-equal partitioning reveals a clear negative
correlation: as imbalance increases, BU decreases consistently. This confirms that
PE line load imbalance is a primary source of BU degradation. Notably, most matrices
exhibit more than 20% BU degradation from the theoretical maximum, indicating that
this is not a marginal issue but a widespread problem across diverse matrix structures.
Moreover, this imbalance worsens as the hardware scales. For the same matrix, the
8 PE line configuration generally exhibits higher imbalance and lower BU than the
2 PE line configuration. Architecturally, this occurs because increasing the PE line
count narrows the column ranges assigned to each PE line, making the system highly
susceptible to localized nonzero concentration. These observations collectively motivate
the need for a workload-aware partitioning strategy that can maintain high BU across
all PE line configurations.
Fig. 4. Weighted PE line imbalance ratio versus bandwidth utilization for 29 benchmark
matrices under column-equal partitioning with 2, 4, and 8 PE line configurations.
2. Nonzero-aware PE Line Partitioning
To address this imbalance, we propose a nonzero-aware PE line partitioning strategy
that sets partition boundaries based on the cumulative nonzero distribution within
each block. The procedure constructs a column-wise nonzero histogram, computes its
cumulative sum, and places the partition boundary at the column where the cumulative
count reaches $NNZ_i/P$, ensuring that each PE line receives approximately the same
number of nonzero elements. The boundary is then aligned to the memory access granularity
of 8 elements to avoid misaligned vector accesses, and monotonicity is enforced to
prevent degenerate partitions. Finally, each PE line's assigned column range is checked
against the vector buffer limit (VBL), and boundaries are clamped if necessary to
satisfy the constraint.
Fig. 5 illustrates the difference between column-equal and nonzero-aware partitioning for
a representative matrix block. Column-equal partitioning yields an imbalance ratio
of 0.91 (23 vs. 2 nonzero elements), whereas nonzero-aware partitioning reduces it
to 0.08 (12 vs. 13). The proposed strategy modifies only the block partitioning step
within the offline pre-processing stage, requiring no changes to the hardware architecture.
Fig. 5. Comparison of column-equal and nonzero-aware partitioning: NNZ distribution,
PE line assignment, and execution timeline.
IV. EVALUATION RESULTS
1. Experimental Setup
The proposed strategy is evaluated using a cycle-accurate simulator modeling the accelerator,
configured with a total buffer size of 512KB. All computations use FP64 values and
INT32 indices in COO format. We evaluate three configurations with 2, 4, and 8 PE
lines to examine the bandwidth scalability. Table 1 lists the 29 benchmark matrices from the SuiteSparse collection [11] used in the evaluation, covering diverse application domains and sparsity patterns.
Table 1. Benchmark matrices from the suite sparse matrix collection [11].
|
Matrix
|
Rows
|
Cols
|
NNZ
|
Kind
|
|
ASIC_680k
|
682,862
|
682,862
|
3,871,773
|
Circuit
|
|
audikw_1
|
943,695
|
943,695
|
77,651,847
|
Structural
|
|
bbmat
|
38,744
|
38,744
|
1,771,722
|
CFD
|
|
cant
|
62,451
|
62,451
|
4,007,383
|
2D/3D
|
|
Chebyshev4
|
68,121
|
68,121
|
5,377,761
|
Structural
|
|
com-Youtube
|
1,134,890
|
1,134,890
|
5,975,248
|
Graph
|
|
consph
|
83,334
|
83,334
|
6,010,480
|
2D/3D
|
|
cop20k_A
|
121,192
|
121,192
|
2,624,331
|
2D/3D
|
|
crankseg_2
|
63,838
|
63,838
|
14,148,858
|
Structural
|
|
eu-2005
|
862,664
|
862,664
|
19,235,140
|
Graph
|
|
gupta3
|
16,783
|
16,783
|
9,323,427
|
Optimization
|
|
human_gene1
|
22,283
|
22,283
|
24,669,643
|
Graph
|
|
in-2004
|
1,382,908
|
1,382,908
|
16,917,053
|
Graph
|
|
ldoor
|
952,203
|
952,203
|
46,522,475
|
Structural
|
|
mip1
|
66,463
|
66,463
|
10,352,819
|
Optimization
|
|
pdb1HYS
|
36,417
|
36,417
|
4,344,765
|
Graph
|
|
PR02R
|
161,070
|
161,070
|
8,185,136
|
CFD
|
|
radiation
|
223,104
|
223,104
|
7,637,688
|
Semiconductor
|
|
raefsky1
|
3,242
|
3,242
|
294,276
|
CFD
|
|
rail4284
|
4,284
|
1,096,894
|
11,284,032
|
LP
|
|
rajat30
|
643,994
|
643,994
|
6,175,377
|
Circuit
|
|
rma10
|
46,835
|
46,835
|
2,374,001
|
CFD
|
|
scircuit
|
170,998
|
170,998
|
958,936
|
Circuit
|
|
Stanford_Berkeley
|
683,446
|
683,446
|
7,583,376
|
Graph
|
|
thermomech_dK
|
204,316
|
204,316
|
2,846,228
|
Thermal
|
|
TSOPF_RS_b2383
|
38,120
|
38,120
|
16,171,169
|
Power Network
|
|
TSOPF_RS_b300_c3
|
42,138
|
42,138
|
4,413,449
|
Power Network
|
|
webbase-1M
|
1,000,005
|
1,000,005
|
3,105,536
|
Graph
|
|
wikipedia-20051105
|
1,634,989
|
1,634,989
|
19,753,078
|
Graph
|
2. BU Improvement
Fig. 6 presents the BU before and after applying the proposed nonzero-aware partitioning
for (a) 2, (b) 4, (c) 8 PE line configurations. The average BU improves by 17.2% with
2 PE lines, 65.4% with 4 PE lines, and 113.7% with 8 PE lines. The improvement scales
with PE line count because the baseline column-equal partitioning produces increasingly
severe imbalance as the column range per PE line shrinks, amplifying the benefit of
nonzero-aware partitioning. Notably, no matrix exhibits BU degradation under the proposed
strategy in any configuration.
Fig. 6. BU comparison before and after applying nonzero-aware partitioning for (a)
2, (b) 4, and (c) 8 PE line configurations.
3. Relationship between Imbalance Reduction and BU Improvement
Fig. 7 plots the imbalance reduction (defined as $Imbalance_{before} - Imbalance_{after}$)
against BU improvement for each matrix across all three PE line configurations. A
positive correlation is observed: matrices with larger imbalance reduction achieve
greater BU gains, and the trend steepens with more PE lines.
Notable outliers deviate from the general trend for distinct structural reasons. For
TSOPF_RS_b2383 with 8 PE lines, the baseline imbalance ratio reaches 0.9997 and the
baseline BU is only 0.024, so even a moderate imbalance reduction yields a BU gain
exceeding 300%, due to the low baseline. In contrast, com-Youtube exhibits a large
imbalance reduction (0.69) but only a 43% BU gain with 8 PE lines, because its large
dimension (1.1M $\times$ 1.1M) and extreme sparsity ($\sim$5.3 nonzeros per row) produce
many blocks with few nonzeros each. Since each block loads vector elements up to the
vector buffer capacity regardless of its nonzero count, $C_{ldv}$ dominates the total
execution time and limits the BU impact of reducing $C_{exe}$. For mip1, the weighted
PE line imbalance remains high (0.695) even after nonzero-aware partitioning despite
a baseline imbalance of 0.990, because its highly skewed nonzero distribution forces
the partition boundaries to shift beyond the VBL. The algorithm clamps these boundaries
to satisfy the hardware constraint, preventing further imbalance reduction.
Fig. 7. Relationship between imbalance reduction and BU improvement across 2, 4, and
8 PE line configurations.
4. Discussion
The BU improvement originates from reduced PE line idle time: balancing the nonzero
count across PE lines decreases $C_{exe}$ while the vector loading cycles ($C_{ldv}$)
remain nearly unchanged. The VBL constraint similarly limits improvement for eu-2005,
whose large column dimension (862K) causes partition boundaries to be clamped at the
VBL. These results suggest that further BU gains for VBL-constrained matrices would
require either a larger on-chip vector buffer or a partitioning strategy that subdivides
the column range into multiple passes per PE line.
V. CONCLUSIONS
This paper identified PE line imbalance caused by column-equal partitioning as a key
factor in bandwidth utilization degradation for the bandwidth-scalable SpMV accelerator
presented in [12]. The proposed nonzero-aware PE line partitioning strategy complements [12] by optimizing the partitioning stage of the offline pre-processing, which was not
addressed in the original design, and achieves average BU improvements of 17.2%, 65.4%,
and 113.7% for 2, 4, and 8 PE line configurations, respectively, with no degradation
cases across 29 benchmark matrices. The growing benefit at higher PE line counts confirms
that the method preserves and enhances the bandwidth scalability of the baseline architecture
without requiring any on-chip hardware modifications.
ACKNOWLEDGEMENTS
This work was partly supported by Institute of Information & communications Technology
Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2025-02214649,
Development of Chiplet-Based Interface Technology for Enhancing Data Processing Efficiency
in AI Semiconductors and Verification Technology for Large-Scale AI Chiplet Semiconductor
Modules) and partly supported by Institute of Information & communications Technology
Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2025-25442469,
Development of Edge AI Server Technology with High-Performance and High-Reliability).
The EDA tool was supported by the IC Design Education Center.
REFERENCES
M. Manguoglu, A. H. Sameh, O. Schenk, 2011, A domain-decomposing parallel sparse
linear system solver, Journal of Computational and Applied Mathematics, Vol. 236,
No. 3, pp. 319-325

W. L. Hamilton, 2020, Graph Representation Learning, Morgan & Claypool

L. Page, S. Brin, R. Motwani, T. Winograd, 1999, The PageRank citation ranking:
Bringing order to the web, Stanford InfoLab

S. Park, H. Kim, J. Kim, 2025, A fault-tolerant GEMM accelerator with online
error detection and correction based on systolic array architecture, Journal of Semiconductor
Technology and Science, Vol. 25, No. 3

S. Choi, K. J. Lee, Y. Kim, H.-J. Yoo, 2020, A 9.52 ms latency, and low-power
streaming depth-estimation processor with shifter-based pipelined architecture for
smart mobile devices, Journal of Semiconductor Technology and Science, Vol. 20, No.
3, pp. 255-266

B. Lu, X. Qiao, J. Bhatt, 2019, Redesk: A reconfigurable dataflow engine for
sparse kernels on heterogeneous platforms, Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design, pp. 1-8

J. Fowers, G. Brown, P. Cooke, G. Stitt, 2014, A high memory bandwidth FPGA
accelerator for sparse matrix-vector multiplication, Proceedings of the IEEE 22nd
Annual International Symposium on Field-Programmable Custom Computing Machines, pp.
36-43

A. K. Jain, H. Omidian, H. Fraisse, M. Benipal, L. Liu, D. Gaitonde, 2020,
A domain-specific architecture for accelerating sparse matrix vector multiplication
on FPGAs, Proceedings of the 30th International Conference on Field-Programmable Logic
and Applications, pp. 127-132

S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, Y. Xie, 2021, Optimized
data reuse via reordering for sparse matrix-vector multiplication on FPGAs, Proceedings
of the IEEE/ACM International Conference on Computer-Aided Design, pp. 1-9

C. Liu, H. Yin, C. Cheng, X. Li, 2023, Towards high-bandwidth-utilization
SpMV on FPGAs via partial vector duplication, Proceedings of the 28th Asia and South
Pacific Design Automation Conference, pp. 198-203

T. A. Davis, Y. Hu, 2011, The University of Florida sparse matrix collection,
ACM Transactions on Mathematical Software, Vol. 38, No. 1, pp. 1-25

H. Kim, E. Ham, S. Park, H. Kim, J. Kim, 2023, A DRAM bandwidth-scalable
sparse matrix-vector multiplication accelerator with 89% bandwidth utilization efficiency
for large sparse matrix, Proceedings of the IEEE Asian Solid-State Circuits Conference,
pp. 1-3

Hyunji Kim received her B.S. and M.S. degrees in electronic and electrical engineering
from Ewha Womans University, Seoul, South Korea, in 2019 and 2021, respectively. She
is currently pursuing a Ph.D. degree at the same university with Digital System Architecture
Lab. Her current research interests include domain-specific SoC architecture.
Ji-Hoon Kim received his B.S. (summa cum laude) and Ph.D. degrees in electrical engineering
and computer science from KAIST, Daejeon, South Korea, in 2004 and 2009, respectively.
In 2009, he joined Samsung Electronics, Suwon, South Korea, as a Senior Engineer,
and worked on next-generation architecture for 4G communication modem system-on-chip
(SoC). From 2018 to 2025, he was a professor in the Department of Electronic and Electrical
Engineering, Ewha Womans University, Seoul, South Korea. Since 2025, he has been with
the Department of Electronic Engineering, Hanyang University, Seoul, South Korea.
His current research interests include CPU microarchitecture, domain-specific SoC,
and deep neural network accelerators. Dr. Kim served on the Technical Program Committee
and Organizing Committee for various international conferences, including the IEEE
International Conference on Computer Design (ICCD), the IEEE Asian Solid-State Circuits
Conference (A-SSCC), and the IEEE International Solid-State Circuits Conference (ISSCC).
He was a co-recipient of the Distinguished Design Award at the 2019 IEEE A-SSCC, and
a recipient of the Best Design Award at 2007 Dongbu HiTek IP Design Contest, the First
Place Award at 2008 International SoC Design Conference (ISOCC) Chip Design Contest,
and the IEEE/IEIE Joint Award for Young Scientist and Engineer. He also serves as
an Associate Editor for the IEEE Transactions on Circuits and Systems II: Express
Briefs.