Mobile QR Code QR CODE

  1. (School of Electronic and Electrical Engineering, Hongik University, 72-1, Sangsu-Dong, Mapo-gu, Seoul, 04066, Korea)
  2. (Samsung Display, 1, Samsung-ro, Kiheung-gu, Yongin-si, Kyunggi-do, Korea)



LUT compression, LUT input pixel partitioning, error LUT, image data compression, ROM count

I. INTRODUCTION

Image processing generally requires high computational complexity, high throughput, and often demands real-time execution. A lookup table (LUT), implemented by ROM, is highly advantageous due to its ability to provide precomputed results for all possible input values, for high throughput with low hardware complexity, making it widely applicable. With growing demands for improved image quality, LUTs are increasingly employed for various image processing such as gamma correction, light distribution, gain adjustment, and super-resolution [1- 3]. Meanwhile, as color depth continues to increase, the required LUT size also grows, emphasizing the importance of compressing LUT ROM size. Methodologies for minimizing LUT ROM size are not only effective for conventional CMOS-based display systems but also become more effective in systems such as system-on-panel (SOP) implementations [4- 6], where low processing yield inherently limits the amount of realizable hardware. Accordingly, various approaches have been proposed to reduce LUT ROM size [7- 10]. For instance, one common method stores only a portion of the computed results in the LUT and handles the remaining computations through interpolation [7, 8]. However, this approach requires additional interpolators, which introduces computation errors and reduces accuracy. Another conventional technique divides LUT data into a quantized ROM and an error pattern ROM [9, 10]. Nevertheless, this method is only applicable in limited cases of repetitive error patterns in the target function, and it suffers from increased hardware complexity due to the use of multiple multiplexers.

In this paper, methodologies are presented to reduce the LUT ROM count without incurring large computation errors, along with a LUT design approach that enables efficient image processing with image memory compression in display systems. Section 2 introduces a technique to reduce the LUT ROM size by partitioning the input pixels. In Section 3, a methodology is presented to reduce the size of the error LUT further, which is used to compensate for the errors in case of nonlinear functions. Section 4 describes a scheme that integrates the proposed LUT with compressed image memory, while Section 5 details the hardware implementation of the proposed design. In Section 6, the operation of the proposed LUT architecture is verified using actual images. Finally, Section 7 discusses a comprehensive performance evaluation.

II. LUT COMPRESSION VIA INPUT BIT PARTITIONING

Let the LUT ROM count be denoted by LUTC, which is determined by the bit widths of the input and output data as in Eq. (1):

(1)
$LUTC = 2^{input bit width} \times output bit width.$

As shown in Eq. (1), LUTC increases exponentially with the input bit width. Thus, reducing the input bit width can significantly decrease the LUT ROM count. One of the ways to realize the input bit reduction is input pixel partition, which will be called a partitioned bit LUT. Let $x$ be an input pixel with a depth of $m$ bits. If partitioning is done at the $n$th bit position ($0 \le n \le m - 1$), the upper bit group is denoted by $x_{upper}$ and the lower bit group by $x_{lower}$. These are defined in Eqs. (2) and (3), and the partition structure of the pixel $x$ is shown in Fig. 1.

(2)
$x_{upper} = x [m-1 : n],$
(3)
$x_{lower} = x[n-1 : 0].$

Fig. 1. Pixel partitioning.

../../Resources/ieie/JSTS.2026.26.1.36/fig1.png

As shown in Fig. 1, through partitioning, $x_{upper}$ is positioned $n$ bits higher than $x_{lower}$. Therefore, their relationship to the original pixel $x$ can be expressed as Eq. (4). When the characteristic function representing the input-output relation of the original (non-partitioned) LUT is denoted by $f(x)$, and if $f(x)$ is linear, it satisfies Eq. (5).

(4)
$x = x_{upper} \times 2^n + x_{lower},$
(5)
$f(x) = f(x_{upper} \times 2^n + x_{lower}) \\ = f(x_{upper} \times 2^n) + f(x_{lower}).$

Since the partitioned pixels, $x_{upper}$ and $x_{lower}$ can be processed by sharing a single partitioned bit LUT, denote the unified input form and the characteristic function of the partitioned bit LUT can be as $x_{partitioned}$ and $f_{pbl} (x_{partitioned})$, respectively. Since the partitioned bit LUT receives the upper and lower partitioned bits as inputs, the characteristic function can be defined as the $f(x_{upper} \times 2^n)$ for computation result of the upper bit group, $f(x_{lower})$ for that of the lower bit group. If the characteristic function is given by $f(x_{upper} \times 2^n)$, then Eq. (6) holds. By substituting Eq. (6) into Eq. (5), the relationship between the partitioned bit LUT characteristic function and the original LUT characteristic function $f(x)$ can be expressed as Eq. (7).

(6)
$f_{pbl}(x_{partitioned}) = f(x_{partitioned} \times 2^n),$
(7)
$f(x) = f_{pbl}(x_{upper}) + f_{pbl}(x_{lower})/2^n.$

From Eq. (7), if the LUT function $f(x)$ is linear, the same operation as the original LUT can be achieved using a partitioned bit LUT whose characteristic function represents the computation of the upper bit group. Likewise, if the characteristic function of the partitioned bit LUT is given by $f(x_{lower})$, similarly, it can be described by Eqs. (8) and (9). From Eq. (9), it follows that when $f(x)$ is linear, the same computation as the original LUT can be performed using a partitioned bit LUT whose characteristic function represents the computation of the lower bit group. Thus, as shown by Eqs. (7) and (9), when the LUT characteristic function is linear, any partitioned bit LUT, regardless of whether it uses the upper or lower partitioned pixel function as its characteristic function, can execute the computation without error.

(8)
$f_{pbl}(x_{partitioned}) = f(x_{partitioned}),$
(9)
$f(x) = f_{pbl}(x_{upper}) \times 2^n + f_{pbl}(x_{lower}).$

When implementing a LUT in hardware, if the bit width of the input pixel differs from the LUT’s input bit width, the operation cannot be directly computed. When the bit widths of $x_{upper}$ and $x_{lower}$ differ, both of their bit widths are made equal to that of the wider partitioned pixel. A fractional portion is appended at the position of the least significant bits of the narrower partitioned pixel to match the bit width. If the fractional value is zero, the expanded partitioned pixel retains the same value as the original, thereby ensuring that the computation results remain unchanged. In this manner, the bit widths of the partitioned bits are made identical, and denote these adjusted partitioned pixels as $X_{upper}$ and $X_{lower}$. The relationship between $x_{upper}$, $x_{lower}$, and their aligned forms is shown in detail in Fig. 2.

Fig. 2. Input bit width expansion. (a) in case of $n < \frac{m}{2}$, (b) in case of $n > \frac{m}{2}$.

../../Resources/ieie/JSTS.2026.26.1.36/fig2.png

With the pixel partitioning as shown in Fig. 1, the bit widths of $x_{upper}$ and $x_{lower}$ become $m - n$ bits and $n$ bits, respectively, resulting in a difference of $|m - 2n|$ bits. As shown in Fig. 2(a), when $x_{lower}$ has a smaller bit width, a fractional extension $ext_l$ is padded below the least significant bits by $m - 2n$ bits to create $X_{lower}$, with the same bit width as $X_{upper}$. In this case, the relationship between $x_{lower}$ and the extended $X_{lower}$ is given by Eq. (10), and since $X_{upper}$ is not extended, it remains identical to $x_{upper}$.

(10)
$X_{lower} = x_{lower} + ext_l \times 2^{2n-m}.$

Similarly, as shown in Fig. 2(b), when $x_{upper}$ has a smaller bit width, a fractional extension $ext_u$ is padded below the least significant bits by $2n - m$ bits to generate $X_{upper}$ with the same bit width as $X_{lower}$. The relationship between $x_{upper}$ and the extended $X_{upper}$ is expressed by Eq. (11).

(11)
$X_{upper} = x_{upper} + ext_u \times 2^{m-2n}.$

If the value of the padded fraction is zero for either of the extended partitioned pixels, the relationships $X_{upper} = x_{upper}$ and $X_{lower} = x_{lower}$ hold. Furthermore, as shown in Fig. 2, $X_{upper}$ is always positioned $(m - n)$ bits higher than $X_{lower}$, satisfying the relationship in Eq. (12). Eqs. (7) and (9), which show that computations using a partitioned bit LUT, can be performed without error, can be restated for the extended partitioned pixels $X_{upper}$ and $X_{lower}$ as Eqs. (13) and (14), respectively.

(12)
$x = X_{upper} \times 2^{m-n} + X_{lower},$
(13)
$f(x) = f_{pbl} (X_{upper}) + f_{pbl} (X_{lower})/2^{m-n},$
(14)
$f(x) = f_{pbl} (X_{upper}) \times 2^{m-n} + f_{pbl}(X_{lower}).$

Eqs. (13) and (14) show the case when the characteristic function of the partitioned bit is for the computation for $X_{upper}$, and $X_{lower}$ respectively. By utilizing the bit width extension method proposed in Fig. 2, the hardware limitations of the LUT can be effectively overcome. Moreover, by the relationships in Eqs. (7) and (9), the linear computations can be performed without error using the partitioned bit LUT.

III. ERROR REDUCTION FOR MINIMIZING ERROR LUT ROM SIZE

As previously discussed in Eqs. (13) and (14), the error-free computation using the proposed partitioned bit LUT is limited to cases where the LUT characteristic function is linear. When computing nonlinear functions with a partitioned bit LUT, errors inevitably occur. A separate error LUT is introduced for compensation. Additionally, a method is proposed to reduce the magnitude of the errors, thereby decreasing the size of the error LUT. Let the error range be the difference between the minimum and maximum errors comparing the results obtained from the partitioned bit LUT and those from the original full LUT. The size of the error LUT is then given by Eq. (15).

(15)
$LUTC_{error} = 2^{input bit width} \times \lceil \log_2(error range) \rceil.$

Since the error LUT must compensate for errors corresponding to all input bit values, the input bit width in Eq. (15) is identical to that of the non-partitioned original pixel. The error range represents the range of errors to be compensated, which tends to increase as the deviation from a linear LUT characteristic function is increased, i.e., the more nonlinear the function is, the larger the error range becomes. Therefore, to reduce the size of the error LUT, it is essential to minimize the error range that arises from computations using the partitioned bit LUT. This can be achieved by appropriately selecting the characteristic function of the partitioned bit LUT to reduce errors. As shown in Eqs. (13) and (14), when the LUT characteristic function is linear, either the computation result of the upper or lower bit group can be used as the characteristic function of the partitioned bit LUT without introducing any errors. In contrast, for a nonlinear function, it is required to select the computation result that minimizes the error range of the characteristic function. Since $X_{upper}$ corresponds to the MSBs of pixel $x$, it has more representative value across the entire input range compared to $X_{lower}$, and the computation result $f(X_{upper} \times 2^{m-n})$ tends to approximate $f(x)$ more closely than $f(X_{lower})$, the computation result of $X_{upper}$ is used for characteristic function for the partitioned-bit LUT for both $X_{upper}$ and $X_{lower}$. Let $f_{error}(x)$ denote the characteristic function of the error LUT. Then, the equation for computing a nonlinear function with error compensation can be shown by extending Eq. (13) to Eq. (16) as follows:

(16)
$f(x) = f_{pbl} (X_{upper}) + f_{pbl}(X_{lower})/2^{m-n} + f_{error}(x).$

When the partitioned bit LUT uses the computation result of the upper bit group as its characteristic function, it cannot accurately compute the lower bit group. As a result, $f_{pbl}(X_{lower})/2^{m-n}$ in (16) introduces error. To mitigate this error, a scale factor is proposed for the lower bit group output. Denoting the proposed scale factor as $i$, the lower bit group can be computed as $f_{pbl}(X_{lower})/2^i$. By optimizing the value of $i$, the error can be minimized, thereby improving the overall accuracy of the partitioned bit LUT for non-linear function computation.

Fig. 3 shows the differences in computation results when employing a scale factor in the computation of a partitioned bit LUT. When the same scale factor is applied uniformly to all input values, as in Eq. (16), it is referred to as a common scale factor, corresponding to Figs. 3(a) and 3(b). Because the same scaling by $1/2^3$ is applied across all input values, the dynamic range of the lower bit group computation, $f_{pbl}(X_{lower})/2^3$, remains constant throughout the entire range. As shown in Fig. 3(a), when the LUT characteristic function is linear, the step size of the upper bit group result, $f_{pbl}(X_{upper})$ is identical for all input values and matches the lower bit group computation results (i.e. $f_{pbl}(X_{lower})/2^3$) in dynamic range, which introduces no error in their summation. However, when dealing with a nonlinear function, as shown in Fig. 3(b), the step size of $f_{pbl}(X_{upper})$ varies depending on the inputs, and it does not match the dynamic range of $f_{pbl}(X_{lower})/2^3$. Consequently, computation of a nonlinear function in the same manner as the linear function case leads to errors. A closer examination of Fig. 3(b) reveals that large errors occur when the step size of $f_{pbl}(X_{upper})$ differs from the dynamic range of $f_{pbl}(X_{lower})$ for the same pixel input value. Ultimately, by adjusting the scale factor so that the step size of $f_{pbl}(X_{upper})$ matches the dynamic range of $f_{pbl}(X_{lower})/2^i$, the error can be significantly reduced, as shown in Fig. 3(c). This approach, where different scale factors are applied to $f_{pbl}(X_{lower})$ depending on the step size of $f_{pbl}(X_{upper})$, will be referred to as a stepwise scale factor. Unlike the case of a common scale factor, where the same scaling is used irrespective of input, the partitioned bit LUT using the stepwise scale factor can achieve computation results that are highly consistent with the original LUT curve, as shown in Fig. 3(c). The stepwise scale factor is used to match the dynamic range of the computation result $f_{pbl}(X_{lower})$ to the step size of $f_{pbl}(X_{upper})$. Conversely, given the step size of $f_{pbl}(X_{upper})$ and the dynamic range of $f_{pbl} (X_{partitioned})$, an optimal scale factor can be determined for each pixel input. Let $step\_size(f_{pbl}(X_{upper}))$ denote the step size of $f_{pbl}(X_{upper})$, which can be expressed as Eq. (17). Defining $DR(g(x))$, as the dynamic range of a function $g(x)$, the optimal stepwise scale factor, $SF_{stepwise}$, is given by Eq. (18).

Fig. 3. Characteristic function curve of partitioned bit LUT according to the scale factor. (a) Linear function using the common scale factor, (b) Non-linear function using the common scale factor, (c) Non-linear function using the stepwise scale factor.

../../Resources/ieie/JSTS.2026.26.1.36/fig3.png
(17)
$step\_size(f_{pbl}(X_{upper})) \\ = | f_{pbl}(x_{upper}) - f_{pbl}(x_{upper} + 1)|,$
(18)
$SF_{stepwise} = \log_2 \frac{DR(f_{pbl} (X_{partitioned}))}{step\_size (f_{pbl} (X_{upper}))}.$

For hardware implementation efficiency, the scale factor is converted into an integer form as shown in Eq. (19). While 10% of error range reduction is wasted in this case, not only can substantial savings be obtained in scale factor LUT size, but also simple shifters can be used for the efficient implementation, which makes the overall tradeoff highly advantageous. The stepwise scale factor, denoted by $k$, is stored in the scale factor LUT, which takes $X_{upper}$ as its input. Since an $m$ bit data can be shifted from 0 to $m$ bits, the size of the scale factor LUT is given by Eq. (20).

(19)
$k = \lceil SF_{stepwise} \rceil,$
(20)
$LUTC_{sf} = 2^{upper bit width} \times \lceil \log_2(m+1) \rceil.$

This structure, where a stepwise scale factor is applied to the partitioned bit LUT, is hereafter referred to as a step-size-scaled partitioned bit LUT. The computation process of the step-size-scaled partitioned bit LUT, incorporating both the stepwise scale factor and error compensation via the error LUT, is expressed in Eq. (21).

(21)
$f(x) = f_{pbl} (X_{upper}) + f_{pbl} (X_{lower})/2^k + f_{error}(x).$

As shown in Eq. (21), the proposed computation scheme requires a single partitioned bit LUT, an error LUT, a shifter, and a scale factor LUT. Examining the computation process of the step-size-scaled partitioned bit LUT shown in Fig. 4, the input pixel is first partitioned into $X_{upper}$ and $X_{lower}$ which are then processed sequentially through the same partitioned bit LUT. The scale factor LUT takes the upper bit group $X_{upper}$ as input and outputs the stepwise scale factor $k$. The shifter then shifts the computation result $f_{pbl}(X_{lower})$ by $k$ bits, generating the scaled lower bit computation result $f_{pbl}(X_{lower})/2^k$. Finally, the computation results from the error LUT, $f_{error}(x)$, the upper bit group, $f_{pbl}(X_{upper})$, and the scaled lower bit group, $f_{pbl}(X_{lower})/2^k$, are summed to produce the final output.

Fig. 4. Processes of step size scaled partitioned bit LUT.

../../Resources/ieie/JSTS.2026.26.1.36/fig4.png

The total LUT ROM count of the proposed step-size-scaled partitioned bit LUT is equal to the sum of the partitioned bit LUT given in Eq. (1), the error LUT in Eq. (15), and the scale factor LUT in Eq. (20). All of them are expressed as Eq. (22). Here, the bit width of the input pixel $x$ is $m$ bits, and the bit width of the partitioned pixel is denoted by $W_p$.

(22)
$LUTC_{total} = LUTC_{pbl} + LUTC_{error} + LUTC_{sf} \\ = 2^{W_p} \times m + 2^m \times \lceil \log_2 (error range) \rceil \\ + 2^{W_p} \times \lceil \log_2 (m+1) \rceil.$

The values of $LUTC_{pbl}$ and $LUTC_{sf}$ depend on the bit width of the partitioned pixel, whereas $LUTC_{error}$ increases as the error range widens or as the function deviates further from a linear form. Based on these characteristics, the average LUT ROM reduction rates is evaluated using six different arithmetic functions introduced in [9], under various partitioned pixel bit widths.

As shown in Figs. 5(a) and 5(b), when the partitioned bit LUT and error LUT are used, the total LUT ROM count becomes $LUTC_{pbl} + LUTC_{error}$. The proposed scheme shown in Fig. 5(c) additionally uses a scale factor LUT, resulting in a total LUT ROM count as given by Eq. (22). As shown in Fig. 5, the stepwise scale factor achieves the highest reduction rate. This is because the use of the stepwise scale factor significantly reduces the size of the error LUT. Moreover, when the bit width of $X_{upper}$ is set to 5 bits during pixel partitioning, the reduction rate reaches its maximum at 35.8%, demonstrating the most efficient configuration.

Fig. 5. LUT size according to $X_{upper}$ bit width.

../../Resources/ieie/JSTS.2026.26.1.36/fig5.png

IV. STEP SIZE SCALED PARTITIONED BIT LUT WITH COMPRESSED IMAGE MEMORY

In many image processing applications, image data compression algorithms are used to prevent increased costs arising from the growing amount of image memory. Following this trend, the authors of this paper have previously proposed an image memory compression algorithm [11]. The Partial Pixel Compression (PPC) image memory compression algorithm [11] performs compression based on the similarity between neighboring pixels.

Fig. 6 shows a simplified illustration of the compression and decompression processes of PPC. During compression, each pixel is partitioned into an upper bit group and a lower bit group, and the upper bit groups of adjacent pixels are compared. If the upper bit groups are identical, only the lower bit group is stored; otherwise, the entire pixel is stored. A flag is appended to the least significant bit to distinguish these cases.

Fig. 6. Partial pixel compression.

../../Resources/ieie/JSTS.2026.26.1.36/fig6.png

This way, image memory compression algorithms partition pixels into upper and lower bit groups for comparison and storage. Similarly, the proposed LUT ROM reduction method also uses pixel partitioning. Utilizing these algorithmic similarities, the pixel partitioning and the image memory compression can be seamlessly integrated to achieve synergistic reductions in both memory access rates and efficient LUT-based computation.

Fig. 7 conceptually illustrates the seamless integration of the compressed image memory and the step-size-scaled partitioned bit LUT based on the common pixel partitioning. Importantly, since the compressed pixel is inherently composed of a combination of partitioned pixels, the computations can be performed using the partitioned bit LUT without any additional decompression process required in other works.

Fig. 7. Interoperation methodologies of compressed image memory and step size scaled partitioned bit LUT.

../../Resources/ieie/JSTS.2026.26.1.36/fig7.png

Fig. 8 illustrates the integration of the compressed image memory with the proposed step-size-scaled partitioned bit LUT. To enable this integration, a partitioned bit aligner is added, while the remaining structure and computational process remain identical to those described in Fig. 4. The partitioned bit aligner generates the partial pixels $X_{upper}$ and $X_{lower}$, for the computation, as well as the full pixel $x$, using the compressed pixels. It identifies the data type using the flag located at the least significant bit of the compressed pixel. The flag, 0, indicates that the compressed pixel contains a full pixel; thus, both the upper and lower bit groups are output and processed through the partitioned bit LUT. Conversely, if the flag is 1, the compressed pixel consists only of the lower bits of two pixels that share the same upper bits with a previously processed pixel. In this case, the computation leverages the previously stored upper bit result in the upper bit result latch. This approach not only ensures that the computation yields the same result as an uncompressed pixel but also reduces redundant computations of identical upper bits across neighboring pixels.

Fig. 8. Step size scaled partitioned bit LUT with image memory.

../../Resources/ieie/JSTS.2026.26.1.36/fig8.png

Fig. 9 shows the compression ratios of the PPC algorithm according to different pixel partitioning schemes. This allows for selecting an appropriate bit width for the partitioned pixel when integrating with the step-size-scaled partitioned bit LUT. As shown in Fig. 9, partitioning the upper bit group into 4 bits (equivalent to half of the pixel) yields the highest compression ratio.

Fig. 9. Compression rate according to the upper bit group size.

../../Resources/ieie/JSTS.2026.26.1.36/fig9.png

However, as shown in Fig. 5, the step-size-scaled partitioned bit LUT achieves its highest reduction rate when the upper bit group is 5 bits. Generally, since the volume of image memory is substantially larger compared to the size of the LUT ROM, partitioning into half pixels, which offers a high memory compression ratio, is more effective. In this case, the reduction rate of the step-size-scaled partitioned bit LUT also remains relatively high at 35.4%.

Moreover, because the bit widths of $x_{upper}$ and $x_{lower}$ are identical using half-pixel partitioning, there is no need for additional bit width expansion as described in Fig. 2, simplifying the overall computation.

V. HARDWARE ARCHITECTURE

Based on the integration methods discussed in Section 4, Fig. 10 shows the hardware architecture of the compressed image memory combined with the proposed step-size-scaled partitioned bit LUT. The compressed pixel output from the compressed image memory is fed directly into the partitioned bit aligner without passing through a decompression unit. At this stage, the partitioned bit aligner identifies the data type by examining the flag located at the least significant bit of the compressed pixel. If the flag is 0, indicating that partitioned pixel 1 corresponds to the upper bit group, it is first stored in the upper latch. Then, the partitioned pixels 1 and 2 are sequentially output for the sequential partitioned bit LUT computations. The computation result of the upper bit group is stored in the upper bit result latch so that it can be subsequently combined with the computation result of the lower bit group. Conversely, if the flag is 1, both partitioned pixels 1 and 2 belong to the lower bit group, thus, they are sequentially processed by the partitioned bit aligner without being stored in the upper latch. Meanwhile, the error LUT receives the combined full pixel of the upper and lower bit groups and computes the error value. The scale factor LUT takes the upper bit group as input and produces a scale factor, which is then used by the shifter to divide the computation result of the lower bit group. Finally, the outputs of the partitioned bit LUT and the error LUT are added to generate the final outputs.

Fig. 10. Architecture of step size scaled partitioned bit LUT.

../../Resources/ieie/JSTS.2026.26.1.36/fig10.png

VI. SIMULATION RESULTS

Based on the hardware architecture shown in Fig. 10, RTL-level simulations were performed using actual image data. The results from these simulations were subsequently evaluated in Section 7 through eye tests and PSNR measurements, which are discussed in detail later.

Fig. 11 shows the simulation results of the step-size-scaled partitioned bit LUT operation using compressed pixels processed by the PPC compression algorithm. The compressed pixel, read from the compressed image memory, is fed into the partitioned bit LUT through the cdata_in. Since the least significant bit, flag, is 0, it can be identified as an uncompressed type, with the upper and lower bit groups being 6 and 15, respectively. Accordingly, the upper bit group value 6 is first input to the partitioned bit LUT through part_bit, generating an output of 47 on pb_out, which is then stored in the up_latch. Next, the lower bit group value, 15, is input through the part_bit, generating an output of 255 on pb_out. The scale factor LUT outputs 3 through sft_out, which makes the shifter shift the data by 3 bits, resulting in a final output of 31 on shter_out. The error LUT receives 111 through full_bit and error value 3, is output on err_out. The upper bit group result, 47, and the scaled lower bit group, 31, are then combined to generate a final computation result of 81 on final_data. These operations are repeated to compute a compressed pixel. The above simulations verify that the partitioned bit LUT operation using the compressed pixels and the seamless integration of the compressed image memory and the partitioned bit LUT.

Fig. 11. Logic simulation waveforms.

../../Resources/ieie/JSTS.2026.26.1.36/fig11.png

VII. PERFORMANCE EVALUATIONS

In this section, three cases of the proposed partitioned-bit LUT in Fig. 5 are first evaluated to determine the most optimal case. Subsequently, the selected case is compared with REP-ROM [9] in terms of chip area, throughput, and power consumption.

Fig. 12 shows the comparisons among the intermediate, final results, and LUT counts for three partitioned-bit LUT configuration cases: without a scale factor, with a common scale factor, and with the proposed stepwise scale factor. The comparison is performed for histogram equalization on Kodak image dataset, which is widely accepted and commonly used in image processing performance evaluations. Fig. 12(a) presents the results without applying a scale factor. In this case, large errors are generated during the intermediate computation stage prior to the compensation by the error LUT, as discussed in Section 3, which consequently leads to a significant increase in the size of the error LUT. Fig. 12(b) shows the results with a common scale factor. Although the error range is reduced compared to Fig. 12(a), a substantial deviation from the original curve is still observed. In contrast, Fig. 12(c) shows the results with the proposed stepwise scale factor. Even before applying error compensation, the intermediate computation closely follows the original curve, resulting in a significant reduction of $LUTC_{error}$ as defined in Eq. (15).

Fig. 12. LUT reduction amounts according to scale factor methods for Kodak image dataset, (a) No scale factor, (b) Common scale factor, (c) Stepwise scale factor.

../../Resources/ieie/JSTS.2026.26.1.36/fig12.png

Among the three partitioned-bit LUT approaches, the proposed method with the stepwise scale factor achieves the highest reduction in LUTC.

Moreover, even when considering only the intermediate computation results without the error compensation, the proposed method achieves an average PSNR of approximately 32.8 dB, which is substantially higher than that of the no scale factor case in Fig. 12(a) with 22.0 dB and the common scale factor case in Fig. 12(b) with 27.2 dB. These results show that the proposed stepwise scale factor method provides superior image quality with smaller error ranges. Overall, the proposed method achieves an average LUTC reduction rate of 23.4% as shown in Fig. 12.

The proposed partitioned bit LUT architecture with the stepwise scale factor is further evaluated by comparing its performance with REP-ROM [9] in terms of chip area. In addition to the ratio of the LUT compression of the proposed method, the chip area required for the additional hardware to be implemented should be considered in comparison to non-compressed case. Total chip area reduction is evaluated in Eq. (23) as the sum of the LUTC reduction, which represents the reduction in ROM area achieved by the compression relative to the original LUT ROM area, and the additional hardware reduction, which represents the chip area overhead of the required hardware components relative to the original LUT ROM area, which will be defined below.

(23)
$Total \;chip \;area \;ratio (\%) \\ = \left( \frac{LUTC \cdot A_{ROMcell}}{2^A \cdot L \cdot A_{ROMcell}} \right) \times 100\% \\ + \left( \frac{A_{add}}{2^A \cdot L \cdot A_{ROMcell}} \right) \times 100\%.$

In Eq. (23), $A$ denotes the input bit width and $L$ denotes the output bit width. $A_{ROMcell}$ represents the typical chip area of a single ROM cell, and $A_{add}$ represents the typical chip area of additional hardware in additions to the ROM. $2^A \cdot L \cdot A_{ROMcell}$ represents the typical chip area of a original LUT of A-bit input and L-bit output. Since 8-bit pixels are commonly used in image processing applications, 8-bit input and 8-bit output case is assumed. In this case, the original LUT contains $2^8 \times 8 = 2,048$ ROM cells. Based on Eq. (23), the total chip area reduction ratio of the proposed method relative to REP-ROM is evaluated using Eq. (24).

(24)
$Total \;chip \;area \;reduction \;ratio(\%) \\ = \left( \frac{Total \;chip\; area\; reduction_{proposed}}{Total \;chip \;area\; reduction_{REP-ROM}} \right) \times 100\%.$

To evaluate the additional hardware chip area required by both methods relative to the chip area of a ROM cell, the typical chip areas of a ROM cell and the additional required hardware components of each method are summarized in Table 1.

Table 1. Typical chip area of the modules used.

Category Area [$\lambda^2$]
ROM cell [12] $16 \times 8$
SRAM cell [12] $26 \times 45$
Barrel shifter cell [12] $16 \times 18$
1 bit adder [13] 1224
4:1 MUX [14] $100 \times 128$

In addition to the partitioned bit LUT, error LUT, and scale factor LUT required for the computation, the proposed step-size-scaled partitioned bit LUT also needs modules such as a partitioned bit aligner, adder, shifter, and upper bit result latch, as shown in Fig. 10.

The largest area within the partitioned bit aligner is occupied by the upper bit latch, which can be implemented using SRAM. Likewise, the upper bit result latch can also be realized with SRAM. Given an input pixel with a bit width of $m$, the upper bit latch requires storage of $0.5m$ bits, while the upper bit result latch requires $m$ bits, amounting to a total storage requirement of $1.5m$ bits. When evaluated in terms of LUT ROM, this is expressed as Eq. (25).

(25)
$\frac{SRAM \;size}{ROM\; cell\; size} = \frac{26\lambda \times 45\lambda \times 1.5 \times m}{16\lambda \times 8\lambda}.$

If the bit width is 8, the increased latch size corresponds to approximately 110 LUT ROM cells. Next, as shown in Fig. 10, the proposed shifter performs only right shift operations, allowing it to be implemented at half the size of a conventional barrel shifter. Consequently, the size of the shifter, when converted to an equivalent LUT ROM count, can be expressed as Eq. (26).

(26)
$\frac{C \;bit \;barrel \;shifter}{ROM \;cell \;size} = \frac{16\lambda \times 18\lambda \times m^2}{16\lambda \times 8\lambda} \times \frac{1}{2}.$

According to Eq. (26), the shifter is required to process pixels with an 8-bit pixel width corresponding to approximately 72 LUT ROM cells. An 8-bit adder, based on Table 1, occupies a layout area of approximately 9,792$\lambda^2$, which is equivalent to about 77 LUT ROM cells. Therefore, the additional hardware introduced by the proposed step-size-scaled partitioned bit LUT to handle 8-bit pixel width amounts to a total of 72 + 110 + 77 = 259 LUT ROM cells. Given that a conventional 8-bit LUT ROM count is approximately 2,048 cells, this represents an increase of only about 12%, which is relatively small compared to the 35.4% reduction in LUT ROM size that can be achieved by the proposed LUT compression algorithm. Moreover, since this estimate is based on module sizes evaluated using standard cells, the additional hardware overhead is expected to be even smaller when custom-designed circuits are used.

In REP-ROM [9], the tone curve is partitioned into $2^{(A-R)}$ regions based on the selected value of $R$. Accordingly, a $2^{(A-R)} : 1$ MUX is required for the region selection, and an additional adder is required for error compensation [9]. For example, when $R = 7$, $R = 6$, and $R = 5$, the required region-selection MUX becomes 2:1, 4:1, and 8:1 MUX, respectively. For the case of $R = 6$, as shown in Table 1, the area of a 4:1 MUX is $100\lambda \times 128\lambda$ [14]. Thus, the equivalent additional hardware chip area relative to the LUT ROM cell chip area can be expressed as Eq. (27).

(27)
$\frac{4:1 \;MUX \;size}{ROM \;cell\; size} = \frac{100\lambda \times 128\lambda \times m}{16\lambda \times 8\lambda} Cell.$

According to Eq. (26), the total additional hardware in REP-ROM [9], including both the required 4:1 MUX and the adder, is equivalent to 77 + 800 = 877 ROM cells for an 8-bit pixel width. Compared to conventional 8-bit LUT, which requires 2,048 ROM cells, this corresponds to 42.8%, representing a substantial additional hardware overhead. 2:1 MUX for the case of $R = 7$ contains one-third the number of transistors of a 4:1 MUX [15], while 8:1 MUX for the case of $R = 5$ can be implemented by combining two 4:1 MUXs and one 2:1 MUX [16]. Therefore, the areas of 2:1, 4:1, and 8:1 MUXs can be evaluated by a ratio of 1:3:7. Therefore, the additional hardware overhead of REP-ROM [9], including the adder, can be estimated as 16.7%, 42.8%, and 94.9% for each $R = 7, 6, 5$, respectively.

For the total area estimation in Eq. (23), the LUTC of the proposed method is compared with that of REP-ROM [9]. The LUTC of REP-ROM [9] is given by $\sum_{i=1}^{A-R} (M_i + I) \cdot 2^S + 2^I \cdot (2^{R-S} \cdot E)$, where $A$ denotes the input bit width, and $R$ and $S$ are the parameters to be chosen to minimize the resulting LUTC. $I$ (the number of distinct error patterns) and $E$(the dynamic range of an error pattern) are derived from the error patterns corresponding to the given tone curve and the selected $R$ and $S$. For irregular tone curves in which the error shape varies for the segments spaced by $2^{(R-S)}$, the stored error patterns can not be reused. Therefore, a new error pattern must be stored in the error LUT for each segment. This increases the number of bits, $I$, required to represent the error pattern index, which in turn increases the overall LUTC. However, the LUTC of the proposed method in Eq. (22) is independent of error reuse pattern. Hence, the compression efficiency of REP-ROM [9] decreases as the error patterns for the segments become more irregular, whereas the proposed method remains unaffected. In contrast, regardless of such irregularities, as discussed in Eq. (6), the size of the error LUT required in the proposed method may decrease as the tone curve becomes closer to globally linear shape, even if local nonlinearites remain.

Such irregular tone curves are widely used in modern image processing applications that require image-specific optimization. Applications such as low-light image enhancement [17], HDR image tone mapping [18], and AI-based content-adaptive image enhancement [19] demand optimal nonlinear tone curves that vary with the scenes and image contents. In these applications, simple monotonic functions or predefined canonical curves are often ineffective to achieve the desired image quality and visual representation. For generality, in addition to the regular curves used for evaluation in REP-ROM [9], evaluations are also conducted on irregular curves in which the error patterns differ for every segment. The curves used in the evaluation are shown in Fig. 13. Figs. 13(a)-13(c) represent $\sin(x)$, $x^{1/2}$, $x^2$ respectively, which are used in REP-ROM [9]. Figs. 13(d)-13(f) represent irregularly shaped tone curves based on real datasets used for low-light image enhancement [17] as an example.

Fig. 13. Regular and irregular tone curves.

../../Resources/ieie/JSTS.2026.26.1.36/fig13.png

Based on the evaluations, The LUTC reduction, additional hardware reduction, and total chip area reduction in Eq. (23) were evaluated for the proposed method and REP-ROM [9] under both regular and irregular tone curves. The results and total chip area reduction ratio between proposed and REP-ROM [9] in Eq. (24) is summarized in Table 2.

Table 2. Comparison of chip area between REP-ROM and the proposed method.

REP-ROM [9] Proposed Total chip area reduction ratio (%)
(R, S, I, E) LUTC reduction (%) Additional hardware reduction (%) Total chip area reduction (%) LUTC reduction (%) Additional hardware reduction (%) Total chip area reduction (%)
sin(x) (6, 4, 4, 3) 43 42.8 85.8 58.6 12.6 71.2 82.9
x(1/2) (6, 4, 4, 5) 50 42.8 92.8 58.6 12.6 71.2 76.7
x2 (6, 4, 4, 3) 41.4 42.8 84.2 58.6 12.6 71.2 84.5
curve A (6, 5, 3, 5) 63.3 42.8 106.1 71.1 12.6 83.7 78.8
curve B (6, 5, 3, 3) 64.9 42.8 107.7 71.1 12.6 83.7 77.7
curve C (6, 4, 5, 4) 61.8 42.8 104.6 58.6 12.6 71.2 68
Average - 54.1 42.8 96.9 62.8 12.6 75.4 77.8

Since an 8-bit pixel width is commonly used in image processing applications, all evaluations are performed assuming an 8-bit pixel depth. For REP-ROM [9], smaller values of $R$ generally lead to better LUTC reduction [9]. However, as discussed in Table 1, when $R = 5$, an 8:1 MUX is required, which increases the additional hardware overhead to as high as 94.9%. Therefore, comparisons were conducted only for cases with $R > 5$, where the hardware overhead remains within a reasonable range.

As shown in Table 2, for all six tone curves in Fig. 13, REP-ROM [9] achieves its minimum LUTC when $R = 6$. In this configuration, REP-ROM [9] exhibits an additional hardware reduction of 42.8% for all six tone curves as described in Table 1, whereas the proposed method requires only 12.6%. As summarized in Table 2, the proposed method achieves an average total chip area reduction of 75.4% for both regular and irregular curves, corresponding to a 24.9% reduction compared to the uncompressed LUT size. In contrast, REP-ROM [9] shows an average total chip area reduction of 96.9% for the same set of curves. According to the total chip area ratio defined in Eq. (24), the proposed method requires 77.8% of the total chip area of REP-ROM [9], on the average, as summarized in Table 2.

The power consumption and throughput of the proposed method are also evaluated and compared with those of REP-ROM [9]. For a fair comparison, REP-ROM [9] is configured with $R = 7$, which minimizes power consumption by minimizing hardware overhead for required MUXs. Both are synthesized with the Ultra96v2 Evaluation Board using Vivado 2022.1, under identical synthesis options. Power consumption is evaluated using post-synthesis power analysis, and throughput is evaluated based on the maximum operating frequency obtained from post-synthesis simulations. To provide a more comprehensive evaluation of energy efficiency, the power-delay product (PDP) is also analyzed. According to the evaluation results, the proposed method achieves a total power consumption of 0.28 W, whereas REP-ROM [9] shows a total power consumption of 7.53 W. The proposed method yields PDP$_{proposed}$ = 0.28 W $\times \left( \frac{1}{4.9\text{Gpixels/s}} \right)$ = $5.71 \times 10^{-11}$ J, while REP-ROM [9] yields PDP$_{REP-ROM}$ = 7.53 W $\times \left( \frac{1}{8.33\text{Gpixels/s}} \right)$ = $9.04 \times 10^{-10}$ J. These results indicate that the proposed method achieves a significantly lower PDP, demonstrating superior energy efficiency compared to REP-ROM [9]. The evaluation results show that the proposed method achieves 15.8 times less PDP performance than REP-ROM [9]. Furthermore, REP-ROM [9] requires finding suitable parameters of R and S that minimize the LUT count for each tone curve. In contrast, the proposed method does not require such hardware reconfiguration, which is more suitable for programmable LUT implementations.

The proposed step-size-scaled partitioned bit LUT not only reduces the LUT ROM size but also decreases the computational load through the integration with the compressed image memory. The number of computations performed by the proposed LUT equals the number of compressed pixels, allowing a reduction in computation that can be evaluated by comparing the number of original pixels to the number of compressed pixels. According to [11], the PPC algorithm achieves an average compression ratio of 22%, with each compressed pixel having a size of 9 bits. Letting $PC_{org}$ denote the total number of original pixels and $PC_{comp}$ the total number of compressed pixels after compression, their relationship can be expressed as Eq. (28).

(28)
$\frac{8PC_{org} - 9PC_{comp}}{8PC_{org}} \times 100 = 22\%.$

Rearranging Eq. (28) yields $PC_{comp} = 0.69PC_{org}$, indicating that the number of compressed pixels to be stored is reduced by 31% compared to the original. Consequently, this compression process, along with the step-size-scaled partitioned bit LUT, results in a 31% reduction in image memory access rate as well as a 31% reduction in the computation load within the proposed LUT.

VIII. CONCLUSION

LUT compression algorithms and hardware design methodologies are proposed, which can be seamlessly integrated with image memory data compression. By LUT input pixel partitioning, the LUT ROM count can be reduced. By the proposed stepwise scale factor, which adjusts the dynamic range of the lower bit group computations based on the step size of each upper bit group, the errors can be compensated with a minimal amount of error LUT. Simulation results using actual image show that the proposed LUT compression approach achieves approximately 23.4% reduction in LUT ROM count without compromising image quality for Kodak image dataset.

For both arithmetic functions and irregular tone curves, the total chip area reduction ratio, accounting for both the LUTC and the additional hardware reduction, needs 77.8% on the average, relative to the REP-ROM. Furthermore, by integrating the proposed LUT with compressed image memory, the computation amount and memory access rate can be reduced by 31% compared with the uncompressed image memory case.

ACKNOWLEDGMENTS

This work was supported by 2025 Hongik University Innovation Support Program Fund. Furthermore, we express our gratitude to Donghun Cho, who made great contributions to experimental result analysis and manuscript organization.

REFERENCES

1 
De Greef P. , Hulze H. G. , 2007, 39.1: Adaptive dimming and boosting backlight for LCD-TV systems, Proc. of SID Symposium Digest of Technical Papers, Vol. 38, No. 1, pp. 1332-1335Google Search
2 
Zhao X. , Hu Z. , Chang L. , 2024, USR-LUT: A high-efficient universal super resolution accelerator with lookup table, Proc. of the IEEE International Symposium on Circuits and Systems, pp. 1-5Google Search
3 
Yan L. , Cengiz K. , Sharma A. , 2021, An improved image processing algorithm for automatic defect inspection in TFT-LCD TCON, Nonlinear Engineering, Vol. 10, No. 1, pp. 293-303DOI
4 
Matsueda Y. , Park Y.-S. , Choi S.-M. , Chung H.-K. , 2005, Trend of system on panel, Proc. of the Korean Information Display Society Annual Meeting, pp. 841-844Google Search
5 
He L.-F. , Zhou Y. , Wang Z. , Wang J. , Guo Y. , Wang H. , 2017, Light-erasable embedded charge-trapping memory based on MoS2 for system-on-panel applications, Applied Physics Letters, Vol. 111, No. 22DOI
6 
Zhang K. , Peng D. , Lau K. M. , Liu Z. , 2017, Fully integrated active matrix programmable UV and blue micro-LED display system-on-panel (SoP), Journal of the Society for Information Display, Vol. 25, No. 4, pp. 240-248DOI
7 
Han D. , 2004, Real-time color gamut mapping method for digital TV display quality enhancement, IEEE Transactions on Consumer Electronics, Vol. 50, No. 2, pp. 691-698DOI
8 
Han W. , Cho H. , Kim D. , Kim J.-Y. , 2025, SAL-PIM: A subarray-level processing-in-memory architecture with LUT-based linear interpolation for transformer-based text generation, IEEE Transactions on Computers, Vol. 74, No. 9, pp. 2909-2922DOI
9 
Yang B.-D. , Kim L.-S. , 2004, An error pattern ROM compression method for continuous data, Proceedings of the IEEE International Symposium on Circuits and Systems, Vol. 2, pp. II-845-II-848Google Search
10 
Xie Y. , Chen J. , Xu X. , Wang Y. , Yang H. , 2020, A twofold lookup table architecture for efficient approximation of activation functions, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 28, No. 12, pp. 2540-2550DOI
11 
Lim S. , Cho D. , You J. , 2024, Image data compression and decompression unit considering brightness sensitivity, Journal of Semiconductor Technology and Science, Vol. 24, No. 5, pp. 459-472DOI
12 
Weste N. H. E. , Harris D. , 2015, CMOS VLSI Design: A Circuits and Systems PerspectiveGoogle Search
13 
Vesterbacka M. , 1999, A 14-transistor CMOS full adder with full voltage-swing nodes, Proc. of the IEEE Workshop on Signal Processing Systems, pp. 713-722Google Search
14 
Transistor count, WikipediaGoogle Search
15 
Multiplexers in digital logic, GeeksforGeeksGoogle Search
16 
Heinbuch D. V. , 1988, CMOS3 Cell LibraryGoogle Search
17 
Chen W. , Wang W. , Yang W. , Liu J. , 2018, Deep retinex decomposition for low-light enhancement, Porc. of the British Machine Vision Conference, pp. 1-12Google Search
18 
Duan J. , Bian Z. , Li S. , Zhang X. , 2010, Tone-mapping high dynamic range images by novel histogram adjustment, Pattern Recognition, Vol. 43, No. 5, pp. 1847-1862DOI
19 
Zhao L. , Abdelhamed A. , Brown M. S. , 2022, Learning tone curves for local image enhancement, IEEE Access, Vol. 10, pp. 60099-60111DOI
Jeonghun Lee
../../Resources/ieie/JSTS.2026.26.1.36/au1.png

Jeonghun Lee received the B.S. degree in electronic and electrical engineering from Hongik University, Seoul, Korea in 2023. He is currently pursuing an M.S. degree in electronic and electrical engineering from Hongik University, Seoul, Korea since 2023. His research interest includes image signal processing, integrated display system design.

Sanghyun Lim
../../Resources/ieie/JSTS.2026.26.1.36/au2.png

Sanghyun Lim received B.S., and M.S. degrees in the Department of Electronic and Electrical Engineering from Hongik University, Seoul, Korea, in 2007, and 2009, respectively. In 2010, he joined at Samsung Display inc., where he has been working in the area of display electronics development. His current interests include a touch sensor, stylus driving technologies, sensor structure, and high dynamic range image processing.

Seongjo Youn
../../Resources/ieie/JSTS.2026.26.1.36/au3.png

Seongjo Youn received the B.S. degree in electronic and electrical engineering from Hongik University, Seoul, Korea in 2025. He is currently pursuing an M.S. degree in electronic and electrical engineering from Hongik University, Seoul, Korea since 2025. His research interest includes image signal processing, system semiconductor design.

Jaehee You
../../Resources/ieie/JSTS.2026.26.1.36/au4.png

Jaehee You received his B.S. degree in electronics engineering from Seoul National University, Seoul, Korea, in 1985. He received his M.S. and Ph.D. degrees in electrical engineering from Cornell University, Ithaca, NY, in 1987 and 1990, respectively. In 1990, he joined Texas Instruments, Dallas, TX, as a Member of Technical Staff. In 1991, he joined the faculty of the School of Electrical Engineering, Hongik University in Seoul, Korea, where he is now supervising the Semiconductor Integrated System Laboratory. He is currently Vice President of the Institute of Semiconductor Engineers and has served as an Executive Director of the Drive technology and System research group of the Korean Information Display Society. He was a recipient of the Korean Ministry of Strategy and Finance, KEIT Chairman Award for Excellence, in 2011. He has worked as a technical consultant for many companies, such as Samsung Semiconductor, SK Hynix, Global Communication Technologies, P&K, Penta Micro, Nexia device, and Primenet. His current research interests include integrated system design for display image signal processing, and perceptual image quality enhancements.