Mobile QR Code QR CODE

  1. (Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Republic of Korea)
  2. (School of Electrical Engineering, Korea University, Seoul 02841, Republic of Korea)



GPU, register file, register compression, entropy.

I. INTRODUCTION

The number of registers in a Graphics Processing Unit (GPU) has been continuously grown over the last decade. A driving force of such growth is that although the number of cores has to be increased for better performance, the number of registers per core has to be maintained to provide sufficient thread parallelism for hiding memory latency, which is generally considered as challenging even with better process technology in a near future. Another aspect is that as GPUs are covering broader areas from graphics to artificial intelligence, applications will perform more complex work in a thread, thus will likely demand more registers.

For GPU registers, the data should be resident on chip as possible to enable fast context switching of threads as well as to feed massively parallel execution units. As a result, GPU register files consume substantial energy as well as on-chip storage. For instance, an NVIDIA GV100 GPU has a 256 KB register file per Streaming Multiprocessor (SM) and there are 80 SMs in the GPU, which corresponds to 20MB of registers [1]. According to previous studies, the register file consumes about 15 percent of SM energy [2]. Deep learning workloads, which are widely used in recent GPUs, also show similar energy consumption tendencies [3]. The energy consumption of register files for various deep learning algorithms is between 15 and 20 percent of SM energy.

A recent idea to improve both the energy and storage efficiency of the GPU register file is to compress register data [4,5]. There are unique aspects of GPUs that make register compression as a feasible option beyond widely adopted memory traffic compression [1,6]. A typical GPU executes a group of 32 threads on a vector execution unit. The grouped threads are collectively called a warp. In the vector unit, each scalar lane runs a thread. A vector register used in the vector unit is called a warp register. Previous studies have observed that the thread registers in a warp register have a strong similarity in values with each other [5,7]. Because of this value similarity, simple but efficient data compression algorithms such as base-delta-immediate (BDI) [5] are expected to be quite effective for compressing a warp register as a whole. Furthermore, GPUs' great latency tolerance enables them to allow additional compression stages in register access path while minimizing performance overheads due to increased pipeline depth. Compressed registers consume less energy for data transfer, and also require fewer bits to be stored in the register file which provides power gating opportunities for register banks.

Although previous studies have been explored register compression to some degree, a limit study is required to investigate its potential gain, given that the compression ratio varies significantly according to the choice of algorithm. As an indicator to guide how efficient a register compression algorithm is, this paper proposes an analytical model to calculate the ideal compression ratio which is known to be mathematical upper bound according to entropy theory. Our evaluation shows the theoretical maximum compression ratio can be obtained by 3.99, which is 2.33 times more than a previously proposed BDI based register compression algorithm [5].

II. RELATED WORK

Previous papers have proposed several data compression techniques. Cache or memory system have been known to be good locations to compress data for many reasons, such as energy reduction [8], throughput improvement [9], or better cache or memory efficiency [10]. There are papers to leverage compression for improving efficiency or performance in the context of GPUs. For instance, Pekhimenko et al. proposed a toggle-aware compression to improve the memory bandwidth efficiency in GPUs [11]. Lal et al. utilized the Huffman coding for the GPU memory compression [12].

Lee et al. [5] proposed to compress the register data for GPU energy efficiency improvement. The proposed com- pression uses a variant of base-delta-immediate (BDI) compression, which was originally designed for cache compression [13]. GPU register compression is motivated by similarity between threads within a warp, which lead to an observation that BDI-like algorithms are well suited to exploit value similarity across thread lanes. Compressing registers can be leveraged to reduce power or to improve energy efficiency by reducing the number of register banks to store or access a warp register, thus saves dynamic energy for register fetching as well as static energy when combined with power gating unused register banks.

There have also been studies using STT-MRAM to improve energy efficiency. Li et al. [14] proposed a STT-MRAM register file architecture with SRAM buffer. Won et al. [15] proposed the STT-MRAM register file with SRAM in the form of a cache. GPU register file implementation using STT-MRAM can be used with other data compression techniques and allows for further energy efficiency improvements.

III. ANALYSIS OF REGISTER DATA COMPRESSION ON GPUS

1. Lossless Compression

According to Shannon`s source coding theorem [16], the compression ratio of lossless data compression is closely related to the entropy value of data which can be derived from how likely each source symbol to appear. The theorem basically says for a vector consists of N elements where each element is from the source symbol set E, if the distribution of occurrence of source symbols in the vector is independent and identically distributed random and the entropy of the information the vector has is $H(X)$ bits, then the lower bound of the data size when the vector is lossless compressed is $N$ $H(X)$ bits as $N \to \infty$.

Fig. 1 is an example to illustrate how this principle can be applied to obtain an ideal compression ratio for register accesses. The first row shows an access stream consists of seven 2-bit thread register accesses. Thus 14 bits are needed to store the entire trace without compression. The second row shows only 13 bits are needed to store when the trace is compressed, with a hypothetical algorithm which maps from the source symbol $E \in \{0$, $1$, $2$, $3\}$ to the code word $c(X) = \{$``0'': 110, ``1'': 0, ``2'': 10, ``3'': 111$\}$. The lower bound of a code word based compression can be obtained from the results of entropy theory. Assuming the entropy $H(X)$ is given as a prior knowledge, the upper bound of the compression ratio, $C(X)$, can be derived from $E$ and $H(X)$ as follows:

(1)
$ C(x) \le \frac{\log_2 \left|E\right|}{H(X)} $

where $\left| E \right|$ represents the number of elements in the set $E$. The last row shows how to calculate the lower bound of the code word length is 12.897 bits from the given $H(X)$ as shown in Fig. 1, which corresponds to the compression ratio of 1.086.

Fig. 1. Example of the lower bound of the code word length.

../../Resources/ieie/JSTS.2025.25.3.228/image1.png

2. Entropy Model

The key of our model is how to obtain $H(X)$ for register values. We first need assumptions about the compression granularity. The choice in this paper is to view each thread register value as a symbol. The insight under this decision is that there is structural value similarity between thread registers within a warp register [5], thus our selection can capture such similarity as well as benefits with optimal entropy coding. Further, we assume each thread register is 32-bit wide which is quite a common choice as most GPUs are optimized for 32-bit operations. Regardless of our choices, we believe the presented methodology is general enough to work with any other parameters.

Assume that $E$ is the set of all possible register values, thus $E=\{0$, $1$, $2$, $\dots$, $2^{32}-1 \}$ for a 32-bit wide register. We also define the sample space $\Omega$, which is a trace of register values observed during program execution. $\Omega$ can be constructed by collecting the register values written back to the register file while running applications on a simulator.

A discrete random variable for a register value observation event $X$ can be defined as $X :$ $\Omega \to E$. Since the register values are discrete by nature, we can define the probability mass function (PMF) $f_X(x)$ of $X$ which is defined as

(2)
$ f_X(x)=P(\{o\in {\Omega } :\ X(o)=x\} $

$f_X(x)$ can be obtained by counting the number of occurrences for each register value $e \in E$ in $\Omega$. To derive $H(X)$, the information content $I(x)$ which is defined as the following:

(3)
$ I(x)=-{\log}_2 f_X(x) $

$I(x)$ means the amount of information in bits when a particular register value $x$ is observed. Finally, the entropy $H(X)$ can be calculated from $I(x)$ as follows.

(4)
$ H(X)=E[I(X)]=\sum^{2^{32}-1}_{x=0}{f_X(x)I(x)} $

3. Optimizing for Application and Data Types

Eqs. (1) and (4) imply that the ideal compression efficiency is determined from $f_X(x)$, and as the distribution of $f_X(x)$ is closer to uniform distribution the maximum achievable compression ratio $C(X)$ will approach to 1.0 as the entropy $H(X)$ will increase. At extreme, if $f_X(x)$ is a discrete uniform distribution, all register values will occur equally likely. In this case, the entropy is calculated as the following:

(5)
$ H(X)=-\sum^{2^{32}-1}_{x=0}{\frac{1}{2^{32}}{\log}_2 \frac{1}{2^{32}}}=32 $

This is identical to the register width, thus no benefit is expected even with a hypothetical algorithm with an ideal entropy coding. In contrast, if $f_X(x)$ is a binomial distribution with $p = 0.5$, the entropy $H(X)$ becomes

(6)

$H(X)=-\sum^{2^{32}-1}_{x=0}{{2^{32}}\choose{x}}2^{-2^{32}} {\log}_2 {{2^{32}}\choose{x}}2^{-2^{32}}$

$ \cong 17.047 $

In this case, the ideal compression ratio is $32/17.047 = 1.87$. Because entropy $H(X)$ varies according to $f_X(x)$, then it is needed to use different $f_X(x)$ for each application to better model ideal performance, as $f_X(x)$ will vary according to application characteristics. Our model uses different $f_X(x)$ for each application by collecting register access trace separately for each benchmark.

In similar reasons, our model can be further improved by accounting for data types, which leverages the fact that $f_X(x)$ also varies according to registers' data types. Fig. 2 shows the portion of register values for each of the four different data types: Signed integer (S32), unsigned integer (U32), floating point (F32), and untyped bit (B32). Registers with different data types will have different value patterns, thus optimizing the compression algorithm for each data type will likely lead to a higher compression ratio. Also, this result shows the most frequent data type is different in each of the applications, which further supports the argument that the optimal algorithm has to consider $f_X(x)\ $variation across applications.

Our model uses four conditional entropy values for the four data types, $H(X\mid S)$, $H(X\mid U)$, $H(X\mid F)$, and $H(X\mid B)$, where $T \in \{ S$, $U$, $F$, $B \}$. Each element in T means S32, U32, F32, and B32, respectively. The aggregated entropy $H(X)$ can be derived from the conditional entropy values and the portion of register accesses to each data type. The derivation is as follows:

(7)

$H(X)=-\sum_T{\sum_x{f_{X|T}(x)P(T){{\mathrm{log}}_2 \left[f_{X|T}(x)P(T)\right]\ }}}$

$=-\sum_T{P(T)}\!\left[\!-\!H\left(X|T\right)\!+\!{\log}_2P(T)\!\sum_x{f_{X|T}(x)}\!\right]$

$ =\sum_T{P(T)H (X\mid T)}+\sum_T{-P(T){\log}_2 P(T)} $

Fig. 2. Ratio of register value data types (a) Fermi configuration, (b) Volta configuration.

../../Resources/ieie/JSTS.2025.25.3.228/image2.png

4. Latency Model

The program characteristic can be classified to compute-bounded or memory-bounded by analyzing the average stalled cycle of SM [2]. Under the phase of compute-bounded workloads, the core's average stalled cycle decreases, since execution units of SM are fully utilized. In this section, capability of SM directly influences the overall performance. On the contrary, SM's average stalled cycle increases when GPU encounters memory-bounded phase of the application. Because of the latency of the memory access or preparing the data for computation, the pipeline should wait until the access or preparation is finished.

Fig. 3 shows the ratio of the average stalled cycle during the heartwall application which is in Rodinia benchmark suite [17]. As shown in the figure, the characteristic of the ratio of the average stalled can be divided into two types, highly stalled phase and nearly non-stalled phase. These two phases appear one after the other during the execution of the application. To classify the application's characteristics whether the compute-bounded or memory-bounded, we use measured values of the stalled cycle's ratio. When the average ratio of compute-intensity is larger than memory-intensity, we can classify this phase as compute-bounded phase.

Register compression introduces additional latency while fetching values from registers in compute-bounded phase. On the other hand, memory work is the bottleneck of the memory-bounded phase. Therefore, the memory-bounded phase is not influenced by register compression. Equation 8 shows the relative performance of the GPU that Amdahl's law [18] is applied to compute-bounded phase and memory-bounded phase.

(8)
$ Perf_{Relative} (p)=\frac{1}{(1-pt)+\frac{p}{(1+l)}} $

where $p$ is the portion of compute-bounded phase ($0 \le p\le 1$) and $l$ is the additional latency of register compression.

Fig. 3. Average stall cycle.

../../Resources/ieie/JSTS.2025.25.3.228/image3.png

IV. EVALUATION

1. Methodology

We used 16 GPU applications selected from Rodinia [17], Parboil [19], and GPGPU-Sim [20] benchmark suites. We captured register access trace from GPGPU-Sim [20] and run a post-simulation on the data. The post simulation includes our compression model as well as an implementation of BDI-based register compression [5]. Our simulation approximates NVIDIA Fermi architecture [21] and Volta architecture [1] with the microarchitectural parameters listed in Table 1.

Table 1. GPU microarchitectural parameters.

Configuration

Fermi

Volta

Clock Frequency

1.4GHz

1.2GHz

SMs / GPU

15

80

Warp Scheduling Policy

Greedy-The-Oldest

SIMT lane width

32

32

Max # Warps / SM

48

64

Max # Threads / SM

1536

2048

Register File Size

128KB

256KB

# of 32-bit Registers / SM

32768

65536

# of Register Banks

32

32

2. Entropy Values and Upper Bound of Compression Ratio

Fig. 4 shows the entropy values of the 32-bit thread register values measured from 16 applications. $Y$-axis means the entropy value in bits. We first measured the entropy values for each of the four data types explained in Section III-1. The modeled upper bound number is derived from Equation and the collected per-type entropy values and $f_X(x)$ data. The entropy bits can be understood how many bits are required to store a register.

The entropy value when all data types are globally averaged is 9.45 and 9.08 bits, and there is significant application variation, ranging from the minimum of 5.22 (LPS) and 5.40 (hot) and the maximum entropy of 19.59 (sgemm), and 17.14 (sgemm), respectively. The entropy values of the integer data type and the untyped bit type are $2\times$ smaller than that of the floating point data type, which is around 5. The average entropy value for each data type is 5.30 bits, 8.76 bits, 4.59 bits, and 11.09 bits on Fermi configuration and 7.32 bits, 5.31 bits, 5.16 bits, and 14.6 bits on Volta configuration for S32, U32, B32, and F32, respectively. Because a typical use case of 32-bit floating point data is for processing of rational or large integer numbers, the entropy value of the F32 appears relatively larger than the entropy value of the other data types, which is used for control or indexing data structures thus expected to show more regular value patterns. However, this data also shows such trend can be difficult to be generalized, for instance, in LPS integer type registers have higher entropy than the float registers. Fig. 5 shows per-type upper bounds of compression efficiency, which is derived from 1. The global average is 3.39x and 3.52x, respectively. This bound is proportional to the entropy values in 3, for instance, the minimum is 1.63x (sgemm) and 1.87x (sgemm) and the maximum is 6.13x (LPS) and 5.9x (back), respectively.

Fig. 4. Measured entropy value of the GPU register (a) Fermi configuration, (b) Volta configuration.

../../Resources/ieie/JSTS.2025.25.3.228/image4.png

Fig. 5. Upper bound for the register value compression (a) Fermi configuration, (b) Volta configuration.

../../Resources/ieie/JSTS.2025.25.3.228/image5.png

3. Compression Ratio with Existing Technique

Fig. 6 compares compression ratio for the integer type registers of the evaluated applications. GPU microarchitectural parameters from BDI compression [5] were used. The bar labeled BDI shows the compression ratio when BDI compression algorithm is used for register file [5]. All benchmarks combined, the calculated ideal case compression ratio is $3.99\times$, which is significantly higher than $2.33\times$ of the compared BDI register compression. We noticed LIB is the most outstanding case where BDI efficiency is close to the ideal case. On a closer look of the actual register values, we observed the majority of register values are zero, which is the case where the BDI compression can achieve highest compression efficiency close to ideal. In most other applications, the efficiency of BDI appears to be from 21% to 57% compared with the upper bound from our ideal compression model. A hypothetical algorithm may be able to perform 2% to 623% better than the BDI compression, which implies headroom for further improvement.

Fig. 6. Compression ratio comparison for the integer data type with existing GPU register compression method.

../../Resources/ieie/JSTS.2025.25.3.228/image6.png

4. Energy Efficiency

Fig. 7 represents dynamic energy of individual hardware components of the LIB application from GPUWattch [2]. GPU microarchitectural parameters from NVIDIA Fermi architecture [21] were used. Dynamic energy was measured for each kernel within the application. The percentage of energy consumed by the register file with ideal compression ratio in the LIB application was measured to be 1.27% and 0.9%, respectively. Since the register file consumes about 15 percent of SM energy, the limit of energy efficiency improvement cannot exceed the energy consumed by the register file.

Fig. 7. Breakdown of average dynamic energy consumption.

../../Resources/ieie/JSTS.2025.25.3.228/image7.png

V. CONCLUSIONS

This paper proposed an entropy-based compression model to study theoretical limits of GPU register compression algorithms. We provided a method to obtain the entropy function from empirically profiled register value traces. The model is then further refined to account for application and data type variations. Experimental results show that an existing algorithm is still underperforming compared with our model. A hypothetical algorithm may achieve higher compression ratio which would lead to more energy efficient GPUs.

ACKNOWLEDGMENTS

This work has been supported by Global Education and Research Center for Convergence Technology, a Brain Korea 21 four program, Yonsei University.

References

1 
NVIDIA, NVIDIA Tesla V100 GPU Architecture, 2017.URL
2 
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, ``GPUWattch: Enabling energy optimizations in GPGPUs,'' ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 487-498, 2013.DOI
3 
V. Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas, ``AccelWattch: A power modeling framework for modern GPUs,'' Proc. of MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 738-753, 2021.DOI
4 
N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, ``A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps,'' Proc. of the 42nd Annual International Symposium on Computer Architecture, pp. 41-53, 2015.DOI
5 
S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, ``Warped-compression: Enabling power efficient gpus through register compression,'' Proc. of the 42nd Annual International Symposium on Computer Architecture, pp. 502-514, 2015.DOI
6 
AMD, The Polaris Architecture, 2016.URL
7 
D. Wong, N. S. Kim, and M. Annavaram, ``Approximating warps with intra-warp operand value similarity,'' Proc. of 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016.DOI
8 
S. Sardashti and D. A. Wood, ``Decoupled compressed cache: Exploiting spatial locality for energy-optimized compressed caching,'' Proc. of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 62-73, 2013.DOI
9 
T. M. Nguyen and D. Wentzlaff, ``MORC: A manycore-oriented compressed cache,'' Proc. of the 48th International Symposium on Microarchitecture, pp. 76-88, 2015.DOI
10 
S. Hong, B. Abali, A. Buyuktosunoglu, M. B. Healy, and P. J. Nair, “Touché: Towards ideal and efficient cache compression by mitigating tag area overheads,” Proc. of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 453-465, 2019.DOI
11 
G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W. Keckler, ``A case for toggle-aware compression for GPU systems,'' Proc. of 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016.DOI
12 
S. Lal, J. Lucas, and B. Juurlink, ``E$^2$MC: Entropy encoding based memory compression for GPUs,'' Proc. of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017.DOI
13 
G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, ``Base-delta-immediate compression: Practical data compression for on-chip caches,'' Proc. of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 377-388, 2012.DOI
14 
G. Li, X. Chen, G. Sun, H. Hoffmann, Y. Liu, Y. Wang, and H. Yang, ``A STT-RAM-based low-power hybrid register file for GPGPUs,'' Proc. of the 52nd Annual Design Automation Conference, pp. 1-6, 2015.DOI
15 
W. Jeon, J. H. Park, Y. Kim, G. Koo, and W. W. Ro, ``Hi-End: Hierarchical, endurance-aware STT-MRAM-based register file for energy-efficient GPUs,'' IEEE Access, vol. 8, pp. 127768-127780, 2020DOI
16 
C. E. Shannon, ``A mathematical theory of communication,'' Bell System Technical Journal, vol. 27, no. 3, July 1948.DOI
17 
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, ``Rodinia: A benchmark suite for heterogeneous computing,'' Proc. of 2009 IEEE International Symposium on Workload Characterization (IISWC), 2009.DOI
18 
G. M. Amdahl, ``Validity of the single-processor approach to achieving large scale computing capabilities,'' Proc. of Spring Joint Computer Conference, pp. 483-485, 1967DOI
19 
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, ``Parboil: A revised benchmark suite for scientific and commercial throughput computing,'' Center for Reliable and High-Performance Computing, vol. 127, 2012.URL
20 
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, ``Analyzing CUDA Workloads using a Detailed GPU Simulator,'' Proc. of 2009 IEEE International Symposium on Performance Analysis of Systems and Software, 2009DOI
21 
NVIDIA, NVIDIA’s Fermi: The First Complete GPU Computing Architecture, 2009.URL
Minsik Kim
../../Resources/ieie/JSTS.2025.25.3.228/author1.png

Minsik Kim received his B.S. degree in electrical and electronic engineering from Yonsei University, Seoul, South Korea, in 2013. He received his M.S. and Ph.D. degrees in 2019, both at once, through the integrated Ph.D. program, in electrical and electronic engineering, Yonsei University. He was a Senior Researcher with the Supercomputing Infrastructure Center, Korea Institute of Science and Technology Information, Daejeon, South Korea. He is currently working with Global Education and Research Center for Convergence Technology, Yonsei University, as a Research Professor. His research interests include GPU microarchitectures, high performance computing, and quantum computing.

Yunho Oh
../../Resources/ieie/JSTS.2025.25.3.228/author2.png

Yunho Oh received his B.S., M.S., and Ph.D. degrees from the School of Electrical and Electronic Engineering, Yonsei University, Seoul, South Korea, in 2009, 2011, and 2018, respectively. From 2011 to 2014, he was a Software Engineer in the mobile communications business with Samsung Electronics. From 2019 to 2021, he was a Postdoctoral Researcher with the Parallel Systems Architecture Laboratory (PARSA), EPFL, Switzerland. He is currently an Assistant Professor with the School of Electrical Engineering, Korea University. Prior to joining Korea University, he was an Assistant Professor with Sungkyunkwan University. His research interests include hardware and software architectures for energy-efficient data centers, processor architectures (CPUs, GPUs, and neural network accelerators), in-storage processing, memory systems, and high-performance computing.

Won Woo Ro
../../Resources/ieie/JSTS.2025.25.3.228/author3.png

Won Woo Ro received his BS degree in electrical engineering from Yonsei University, Seoul, Korea, in 1996, and his M.S. and Ph.D. degrees in electrical engineering from the University of Southern California, in 1999 and 2004, respectively. He worked as a research scientist with the Electrical Engineering and Computer Science Department, University of California, Irvine. He currently works as a professor with the School of Electrical and Electronic Engineering, Yonsei University. Prior to joining Yonsei University, he worked as an assistant professor with the Department of Electrical and Computer Engineering, California State University, Northridge. His industry experience includes a college internship with Apple Computer, Inc. and a contract software engineer with ARM, Inc. His current research interests include high-performance microprocessor design, GPU microarchitectures, neural network accelerators, and memory hierarchy design. He is a senior member of the IEEE.