Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 26, No. 3, p.228-234

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 7 Apr. 2026Revised : 29 Apr. 2026Accepted : 29 Apr. 2026

DOI :

https://doi.org/10.5573/JSTS.2026.26.3.228

Nonzero-aware PE Line Partitioning for Improved Bandwidth Utilization in DRAM Bandwidth-scalable SpMV Accelerators

Hyunji Kim¹ Ji-Hoon Kim^2,^*

(Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, Korea)
(Department of Electronic Engineering, Hanyang University, Seoul, Korea)

^* E-mail: jhoonkim@hanyang.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Sparse matrix-vector multiplication (SpMV) is a memory-bound kernel widely used in scientific computing, machine learning, and graph analytics, where bandwidth utilization (BU) serves as a key performance metric for hardware accelerators. In our prior work, we presented a DRAM bandwidth-scalable SpMV accelerator whose processing element (PE) line count scales with DRAM bandwidth, combined with offline pre-processing to eliminate bank conflicts and data dependencies, achieving an average BU of 89% of the theoretical maximum. However, its column-index-based equal partitioning can cause significant PE line load imbalance when nonzero elements are unevenly distributed across columns, leaving faster PE lines idle and degrading BU–an effect that intensifies as PE line count increases with higher off-chip bandwidth. In this paper, we analyze the relationship between column-wise nonzero distribution and PE line imbalance using 29 SuiteSparse benchmark matrices and propose a nonzero-aware partitioning strategy that assigns column ranges based on the actual number of nonzero elements to equalize workloads across PE lines. While the offline pre-processing in our prior work focused on conflict-free data rearrangement, the proposed method targets the preceding partitioning stage, requiring no modification to the on-chip accelerator hardware. Cycle-accurate simulation on 29 matrices across three configurations (2, 4, and 8 PE lines) shows average BU improvements of 17.2%, 65.4%, and 113.7% for 2, 4, and 8 PE line configurations, respectively, with zero degradation across all matrices and configurations, confirming that the method preserves the bandwidth scalability of the baseline accelerator.

Index Terms

Sparse matrix-vector multiplication (SpMV), FPGA accelerator, bandwidth utilization, load balancing, offline pre-processing

I. INTRODUCTION

Sparse matrix-vector multiplication (SpMV) is a core computational kernel that appears in scientific computing, machine learning, and graph analytics ^[1- ^3]. While efficient hardware utilization is a universal challenge across diverse domain-specific accelerators, including matrix multiplication ^[4] and image processing engines ^[5], SpMV presents unique difficulties: its memory-intensive nature and irregular data access patterns make it difficult to fully exploit available memory bandwidth ^[6]. To address this, several FPGA-based SpMV accelerators have been proposed. Their performance is commonly evaluated using bandwidth utilization (BU), defined as the ratio of computational throughput to peak available memory bandwidth ^[6], as illustrated in Fig. 1. To intuitively compare how effectively each accelerator exploits its available bandwidth, the achieved BU can be expressed as a percentage of the theoretical maximum.

Under this metric, Fowers et al. ^[7] achieved approximately 50% on a custom FPGA PCIe card with DDR3. Subsequent works have improved this ratio through domain-specific architectures ^[8], matrix reordering ^[9], and vector duplication ^[10], reaching up to 77%. However, most of these accelerators were designed for specific hardware configurations with limited scalability to increasing DRAM bandwidth, and their achieved BU degrades significantly–by up to 50%–when processing large sparse matrices ^[12].

In our prior work ^[12], we presented an FPGA-based DRAM bandwidth-scalable SpMV accelerator that scales processing element (PE) lines proportionally with off-chip memory bandwidth. Through offline pre-processing, it eliminates bank conflicts and read-after-write (RAW) dependencies, achieving an average BU reaching 89% of the theoretical peak. However, evaluating a larger set of benchmark matrices revealed a previously unaddressed bottleneck: the column-index-based equal partitioning scheme. This scheme divides the column range equally among PE lines without considering the actual nonzero distribution. When nonzero elements are concentrated in a specific column range, some PE lines receive significantly more computation. Since a block completes only when all PE lines finish, the slowest PE line dictates the overall block execution time, leaving faster PE lines idle and wasting bandwidth. This problem is further amplified as the hardware scales, making the partitioning more susceptible to localized nonzero concentration.

To address this issue, this paper first analyzes PE line load imbalance caused by column-equal partitioning using 29 SuiteSparse benchmark matrices ^[11], demonstrating that non-uniform column-wise nonzero distributions lead to idle PE line cycles and measurable BU degradation. Based on this analysis, we propose a nonzero-aware PE line partitioning strategy that assigns column ranges based on the actual number of nonzero elements to equalize workloads across PE lines. While the offline pre-processing in ^[12] focused on the rearrangement stage to eliminate bank conflicts and RAW dependencies, this work targets the preceding partitioning stage, which was not optimized in ^[12]. Consequently, the proposed method requires no modifications to the on-chip hardware and is fully compatible with the existing rearrangement scheme. Cycle-accurate simulation on 29 matrices across 2, 4, and 8 PE line configurations shows average BU improvements of 17.2%, 65.4%, and 113.7%, respectively, with zero degradation across all cases, confirming that the method preserves the bandwidth scalability of the baseline accelerator.

Fig. 1. Background on SpMV: (a) SpMV and its memory-bound nature, (b) BU: the key metric for SpMV accelerators.

II. BASELINE SPMV ACCELERATOR ARCHITECTURE

1. Overall Architecture

Fig. 2 illustrates the architecture of the baseline SpMV accelerator. It consists of a data distributor that reads matrix and vector data from off-chip DRAM, multiple PE lines that independently perform multiply-accumulate (MAC) operations, and an output updater that collects partial sums (PSUM) into a dedicated buffer. Each PE line contains a matrix buffer, a two-bank vector buffer, and two parallel MAC units, and the number of PE lines scales proportionally with the available off-chip memory bandwidth (e.g., 2, 4, and 8 PE lines for 64, 128, and 256 B/cycle interfaces, respectively).

2. Offline Pre-processing

The accelerator relies on an offline pre-processing stage to transform the input sparse matrix into a format that maximizes hardware utilization during online execution. The pre-processing consists of two steps: matrix partitioning and matrix rearrangement.

In the matrix partitioning step, the input matrix is divided into blocks whose dimensions are constrained by the on-chip vector buffer and PSUM buffer sizes. Each block is then assigned to the PE lines based on its column range. In the baseline design, this assignment follows a column-equal partitioning scheme: the column range of each block is divided equally among the available PE lines, so that each PE line is responsible for a contiguous and equally sized column range, as shown in Fig. 3.

In the matrix rearrangement step, the nonzero elements within each PE line's assigned portion are rearranged to avoid RAW dependencies and bank conflicts. The algorithm prioritizes vector element reuse between the two MAC units to maximize PE utilization, ensuring that each pair accesses different banks in the multi-bank vector buffer.

Fig. 2. Overall architecture of the baseline SpMV accelerator.

Fig. 3. Column-equal PE line partitioning and matrix rearrangement example.

3. Bandwidth Utilization Formulation

The performance of the accelerator is evaluated using bandwidth utilization (BU), defined as the ratio of computational throughput to off-chip memory bandwidth (GB/s) ^[4, ^10]. Since each nonzero element undergoes two floating-point operations, an SpMV task with $NNZ$ nonzero elements yields $2 \times NNZ$ floating-point operations. The BU is calculated as

(1)

$BU = \frac{2 \times NNZ}{Bandwidth \times C_{total}},$

where $Bandwidth$ is the off-chip memory bandwidth in bytes per cycle and $C_{total}$ is the total execution cycle count. The accelerator processes each SpMV task in three phases–vector loading, block execution, and output storing–so the total execution cycle count is decomposed as

(2)

$C_{total} = C_{ldv} + C_{exe} + C_{sty} + C_{PIPE},$

where $C_{ldv}$, $C_{exe}$, and $C_{sty}$ represent the cycles consumed for loading vector elements, processing all matrix blocks, and storing output results, respectively. $C_{PIPE}$ accounts for pipeline latency. In the vector loading phase, $N_v$ vector elements of $B_v$ bytes each are transferred from off-chip memory

(3)

$C_{ldv} = \frac{N_v \times B_v}{Bandwidth}.$

In the block execution phase, each block is processed in parallel across PE lines, and the block completes when the slowest PE line finishes. Therefore $C_{exe}$ is modeled as the sum of the critical paths across all blocks

(4)

$C_{exe} = \sum_{i=0}^{N_{block}-1} \max_{0 \le j \le (P-1)} (C_{i,j}),$

where $N_{block}$ is the total number of matrix blocks, $P$ is the total number of PE lines, and $C_{i,j}$ denotes the execution cycles of the $j$-th PE line in the $i$-th block. In the output storing phase, the results for $Block_{height}$ output rows are written back

(5)

$C_{sty} = \frac{Block_{height} \times B_v}{Bandwidth}.$

Ideally, when the workload is perfectly balanced, the accelerator minimizes the $C_{exe}$ term in the equation, thereby achieving the theoretical maximum BU. For example, given a 64 B/cycle interface and 4 MAC units, the minimum execution cycle is $C_{total} = NNZ/4$, yielding a maximum BU of 0.125 FLOP/Byte. However, any load imbalance among PE lines causes the slowest PE line to increase the maximum execution cycle disproportionately, directly inflating $C_{exe}$ and degrading the BU. This architectural vulnerability motivates the in-depth PE line imbalance analysis and the proposed optimization detailed in the following section.

III. PE LINE IMBALANCE ANALYSIS AND NONZERO-AWARE PARTITIONING

1. PE Line Imbalance under Column-equal Partitioning

To quantify the impact of PE line load imbalance on BU, we define a weighted PE line imbalance metric. Since the architecture operates on a synchronized block-level execution model, the difference between the maximum and minimum workload among PE lines directly translates to the hardware idle time experienced by the faster PE lines. For each matrix block $i$, the per-block imbalance ratio $IR_i$ is formulated as

(6)

$IR_i = \frac{\max_{0 \le j \le (P-1)} (NNZ_{i,j}) - \min_{0 \le j \le (P-1)} (NNZ_{i,j})}{\max_{0 \le j \le (P-1)} (NNZ_{i,j})},$

where $NNZ_{i,j}$ denotes the number of nonzero elements assigned to PE line $j$ in block $i$. For an empty block ($NNZ_i = 0$), $IR_i$ is defined as 0. An $IR_i$ of 0.0 indicates a perfectly balanced workload. To evaluate the overall imbalance of the entire matrix, these per-block ratios are aggregated into a weighted average, giving higher weight to blocks with larger nonzero counts since they dominate the total execution time

(7)

$IR_{weighted} = \sum_{i=0}^{N_{block}-1} \frac{NNZ_i}{NNZ} \times IR_i,$

where $NNZ_i$ is the total number of nonzero elements in block $i$.

As shown in Fig. 4, evaluating 29 SuiteSparse benchmark matrices ^[11] (detailed in Section IV-1) under column-equal partitioning reveals a clear negative correlation: as imbalance increases, BU decreases consistently. This confirms that PE line load imbalance is a primary source of BU degradation. Notably, most matrices exhibit more than 20% BU degradation from the theoretical maximum, indicating that this is not a marginal issue but a widespread problem across diverse matrix structures. Moreover, this imbalance worsens as the hardware scales. For the same matrix, the 8 PE line configuration generally exhibits higher imbalance and lower BU than the 2 PE line configuration. Architecturally, this occurs because increasing the PE line count narrows the column ranges assigned to each PE line, making the system highly susceptible to localized nonzero concentration. These observations collectively motivate the need for a workload-aware partitioning strategy that can maintain high BU across all PE line configurations.

Fig. 4. Weighted PE line imbalance ratio versus bandwidth utilization for 29 benchmark matrices under column-equal partitioning with 2, 4, and 8 PE line configurations.

2. Nonzero-aware PE Line Partitioning

To address this imbalance, we propose a nonzero-aware PE line partitioning strategy that sets partition boundaries based on the cumulative nonzero distribution within each block. The procedure constructs a column-wise nonzero histogram, computes its cumulative sum, and places the partition boundary at the column where the cumulative count reaches $NNZ_i/P$, ensuring that each PE line receives approximately the same number of nonzero elements. The boundary is then aligned to the memory access granularity of 8 elements to avoid misaligned vector accesses, and monotonicity is enforced to prevent degenerate partitions. Finally, each PE line's assigned column range is checked against the vector buffer limit (VBL), and boundaries are clamped if necessary to satisfy the constraint.

Fig. 5 illustrates the difference between column-equal and nonzero-aware partitioning for a representative matrix block. Column-equal partitioning yields an imbalance ratio of 0.91 (23 vs. 2 nonzero elements), whereas nonzero-aware partitioning reduces it to 0.08 (12 vs. 13). The proposed strategy modifies only the block partitioning step within the offline pre-processing stage, requiring no changes to the hardware architecture.

Fig. 5. Comparison of column-equal and nonzero-aware partitioning: NNZ distribution, PE line assignment, and execution timeline.

IV. EVALUATION RESULTS

1. Experimental Setup

The proposed strategy is evaluated using a cycle-accurate simulator modeling the accelerator, configured with a total buffer size of 512KB. All computations use FP64 values and INT32 indices in COO format. We evaluate three configurations with 2, 4, and 8 PE lines to examine the bandwidth scalability. Table 1 lists the 29 benchmark matrices from the SuiteSparse collection ^[11] used in the evaluation, covering diverse application domains and sparsity patterns.

Table 1. Benchmark matrices from the suite sparse matrix collection ^[11].

Matrix	Rows	Cols	NNZ	Kind
ASIC_680k	682,862	682,862	3,871,773	Circuit
audikw_1	943,695	943,695	77,651,847	Structural
bbmat	38,744	38,744	1,771,722	CFD
cant	62,451	62,451	4,007,383	2D/3D
Chebyshev4	68,121	68,121	5,377,761	Structural
com-Youtube	1,134,890	1,134,890	5,975,248	Graph
consph	83,334	83,334	6,010,480	2D/3D
cop20k_A	121,192	121,192	2,624,331	2D/3D
crankseg_2	63,838	63,838	14,148,858	Structural
eu-2005	862,664	862,664	19,235,140	Graph
gupta3	16,783	16,783	9,323,427	Optimization
human_gene1	22,283	22,283	24,669,643	Graph
in-2004	1,382,908	1,382,908	16,917,053	Graph
ldoor	952,203	952,203	46,522,475	Structural
mip1	66,463	66,463	10,352,819	Optimization
pdb1HYS	36,417	36,417	4,344,765	Graph
PR02R	161,070	161,070	8,185,136	CFD
radiation	223,104	223,104	7,637,688	Semiconductor
raefsky1	3,242	3,242	294,276	CFD
rail4284	4,284	1,096,894	11,284,032	LP
rajat30	643,994	643,994	6,175,377	Circuit
rma10	46,835	46,835	2,374,001	CFD
scircuit	170,998	170,998	958,936	Circuit
Stanford_Berkeley	683,446	683,446	7,583,376	Graph
thermomech_dK	204,316	204,316	2,846,228	Thermal
TSOPF_RS_b2383	38,120	38,120	16,171,169	Power Network
TSOPF_RS_b300_c3	42,138	42,138	4,413,449	Power Network
webbase-1M	1,000,005	1,000,005	3,105,536	Graph
wikipedia-20051105	1,634,989	1,634,989	19,753,078	Graph

2. BU Improvement

Fig. 6 presents the BU before and after applying the proposed nonzero-aware partitioning for (a) 2, (b) 4, (c) 8 PE line configurations. The average BU improves by 17.2% with 2 PE lines, 65.4% with 4 PE lines, and 113.7% with 8 PE lines. The improvement scales with PE line count because the baseline column-equal partitioning produces increasingly severe imbalance as the column range per PE line shrinks, amplifying the benefit of nonzero-aware partitioning. Notably, no matrix exhibits BU degradation under the proposed strategy in any configuration.

Fig. 6. BU comparison before and after applying nonzero-aware partitioning for (a) 2, (b) 4, and (c) 8 PE line configurations.

3. Relationship between Imbalance Reduction and BU Improvement

Fig. 7 plots the imbalance reduction (defined as $Imbalance_{before} - Imbalance_{after}$) against BU improvement for each matrix across all three PE line configurations. A positive correlation is observed: matrices with larger imbalance reduction achieve greater BU gains, and the trend steepens with more PE lines.

Notable outliers deviate from the general trend for distinct structural reasons. For TSOPF_RS_b2383 with 8 PE lines, the baseline imbalance ratio reaches 0.9997 and the baseline BU is only 0.024, so even a moderate imbalance reduction yields a BU gain exceeding 300%, due to the low baseline. In contrast, com-Youtube exhibits a large imbalance reduction (0.69) but only a 43% BU gain with 8 PE lines, because its large dimension (1.1M $\times$ 1.1M) and extreme sparsity ($\sim$5.3 nonzeros per row) produce many blocks with few nonzeros each. Since each block loads vector elements up to the vector buffer capacity regardless of its nonzero count, $C_{ldv}$ dominates the total execution time and limits the BU impact of reducing $C_{exe}$. For mip1, the weighted PE line imbalance remains high (0.695) even after nonzero-aware partitioning despite a baseline imbalance of 0.990, because its highly skewed nonzero distribution forces the partition boundaries to shift beyond the VBL. The algorithm clamps these boundaries to satisfy the hardware constraint, preventing further imbalance reduction.

Fig. 7. Relationship between imbalance reduction and BU improvement across 2, 4, and 8 PE line configurations.

4. Discussion

The BU improvement originates from reduced PE line idle time: balancing the nonzero count across PE lines decreases $C_{exe}$ while the vector loading cycles ($C_{ldv}$) remain nearly unchanged. The VBL constraint similarly limits improvement for eu-2005, whose large column dimension (862K) causes partition boundaries to be clamped at the VBL. These results suggest that further BU gains for VBL-constrained matrices would require either a larger on-chip vector buffer or a partitioning strategy that subdivides the column range into multiple passes per PE line.

V. CONCLUSIONS

This paper identified PE line imbalance caused by column-equal partitioning as a key factor in bandwidth utilization degradation for the bandwidth-scalable SpMV accelerator presented in ^[12]. The proposed nonzero-aware PE line partitioning strategy complements ^[12] by optimizing the partitioning stage of the offline pre-processing, which was not addressed in the original design, and achieves average BU improvements of 17.2%, 65.4%, and 113.7% for 2, 4, and 8 PE line configurations, respectively, with no degradation cases across 29 benchmark matrices. The growing benefit at higher PE line counts confirms that the method preserves and enhances the bandwidth scalability of the baseline architecture without requiring any on-chip hardware modifications.

ACKNOWLEDGEMENTS

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2025-02214649, Development of Chiplet-Based Interface Technology for Enhancing Data Processing Efficiency in AI Semiconductors and Verification Technology for Large-Scale AI Chiplet Semiconductor Modules) and partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2025-25442469, Development of Edge AI Server Technology with High-Performance and High-Reliability). The EDA tool was supported by the IC Design Education Center.

REFERENCES

M. Manguoglu, A. H. Sameh, O. Schenk, 2011, A domain-decomposing parallel sparse linear system solver, Journal of Computational and Applied Mathematics, Vol. 236, No. 3, pp. 319-325

W. L. Hamilton, 2020, Graph Representation Learning, Morgan & Claypool

L. Page, S. Brin, R. Motwani, T. Winograd, 1999, The PageRank citation ranking: Bringing order to the web, Stanford InfoLab

S. Park, H. Kim, J. Kim, 2025, A fault-tolerant GEMM accelerator with online error detection and correction based on systolic array architecture, Journal of Semiconductor Technology and Science, Vol. 25, No. 3

S. Choi, K. J. Lee, Y. Kim, H.-J. Yoo, 2020, A 9.52 ms latency, and low-power streaming depth-estimation processor with shifter-based pipelined architecture for smart mobile devices, Journal of Semiconductor Technology and Science, Vol. 20, No. 3, pp. 255-266

B. Lu, X. Qiao, J. Bhatt, 2019, Redesk: A reconfigurable dataflow engine for sparse kernels on heterogeneous platforms, Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pp. 1-8

J. Fowers, G. Brown, P. Cooke, G. Stitt, 2014, A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Proceedings of the IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 36-43

A. K. Jain, H. Omidian, H. Fraisse, M. Benipal, L. Liu, D. Gaitonde, 2020, A domain-specific architecture for accelerating sparse matrix vector multiplication on FPGAs, Proceedings of the 30th International Conference on Field-Programmable Logic and Applications, pp. 127-132

S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, Y. Xie, 2021, Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs, Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pp. 1-9

C. Liu, H. Yin, C. Cheng, X. Li, 2023, Towards high-bandwidth-utilization SpMV on FPGAs via partial vector duplication, Proceedings of the 28th Asia and South Pacific Design Automation Conference, pp. 198-203

T. A. Davis, Y. Hu, 2011, The University of Florida sparse matrix collection, ACM Transactions on Mathematical Software, Vol. 38, No. 1, pp. 1-25

H. Kim, E. Ham, S. Park, H. Kim, J. Kim, 2023, A DRAM bandwidth-scalable sparse matrix-vector multiplication accelerator with 89% bandwidth utilization efficiency for large sparse matrix, Proceedings of the IEEE Asian Solid-State Circuits Conference, pp. 1-3

Hyunji Kim

Hyunji Kim received her B.S. and M.S. degrees in electronic and electrical engineering from Ewha Womans University, Seoul, South Korea, in 2019 and 2021, respectively. She is currently pursuing a Ph.D. degree at the same university with Digital System Architecture Lab. Her current research interests include domain-specific SoC architecture.

Ji-Hoon Kim

Ji-Hoon Kim received his B.S. (summa cum laude) and Ph.D. degrees in electrical engineering and computer science from KAIST, Daejeon, South Korea, in 2004 and 2009, respectively. In 2009, he joined Samsung Electronics, Suwon, South Korea, as a Senior Engineer, and worked on next-generation architecture for 4G communication modem system-on-chip (SoC). From 2018 to 2025, he was a professor in the Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, South Korea. Since 2025, he has been with the Department of Electronic Engineering, Hanyang University, Seoul, South Korea. His current research interests include CPU microarchitecture, domain-specific SoC, and deep neural network accelerators. Dr. Kim served on the Technical Program Committee and Organizing Committee for various international conferences, including the IEEE International Conference on Computer Design (ICCD), the IEEE Asian Solid-State Circuits Conference (A-SSCC), and the IEEE International Solid-State Circuits Conference (ISSCC). He was a co-recipient of the Distinguished Design Award at the 2019 IEEE A-SSCC, and a recipient of the Best Design Award at 2007 Dongbu HiTek IP Design Contest, the First Place Award at 2008 International SoC Design Conference (ISOCC) Chip Design Contest, and the IEEE/IEIE Joint Award for Young Scientist and Engineer. He also serves as an Associate Editor for the IEEE Transactions on Circuits and Systems II: Express Briefs.