I. INTRODUCTION
Image processing generally requires high computational complexity, high throughput,
and often demands real-time execution. A lookup table (LUT), implemented by ROM, is
highly advantageous due to its ability to provide precomputed results for all possible
input values, for high throughput with low hardware complexity, making it widely applicable.
With growing demands for improved image quality, LUTs are increasingly employed for
various image processing such as gamma correction, light distribution, gain adjustment,
and super-resolution [1-
3]. Meanwhile, as color depth continues to increase, the required LUT size also grows,
emphasizing the importance of compressing LUT ROM size. Methodologies for minimizing
LUT ROM size are not only effective for conventional CMOS-based display systems but
also become more effective in systems such as system-on-panel (SOP) implementations
[4-
6], where low processing yield inherently limits the amount of realizable hardware.
Accordingly, various approaches have been proposed to reduce LUT ROM size [7-
10]. For instance, one common method stores only a portion of the computed results in
the LUT and handles the remaining computations through interpolation [7,
8]. However, this approach requires additional interpolators, which introduces computation
errors and reduces accuracy. Another conventional technique divides LUT data into
a quantized ROM and an error pattern ROM [9,
10]. Nevertheless, this method is only applicable in limited cases of repetitive error
patterns in the target function, and it suffers from increased hardware complexity
due to the use of multiple multiplexers.
In this paper, methodologies are presented to reduce the LUT ROM count without incurring
large computation errors, along with a LUT design approach that enables efficient
image processing with image memory compression in display systems. Section 2 introduces
a technique to reduce the LUT ROM size by partitioning the input pixels. In Section
3, a methodology is presented to reduce the size of the error LUT further, which is
used to compensate for the errors in case of nonlinear functions. Section 4 describes
a scheme that integrates the proposed LUT with compressed image memory, while Section
5 details the hardware implementation of the proposed design. In Section 6, the operation
of the proposed LUT architecture is verified using actual images. Finally, Section
7 discusses a comprehensive performance evaluation.
II. LUT COMPRESSION VIA INPUT BIT PARTITIONING
Let the LUT ROM count be denoted by LUTC, which is determined by the bit widths of
the input and output data as in Eq. (1):
As shown in Eq. (1), LUTC increases exponentially with the input bit width. Thus, reducing the input
bit width can significantly decrease the LUT ROM count. One of the ways to realize
the input bit reduction is input pixel partition, which will be called a partitioned
bit LUT. Let $x$ be an input pixel with a depth of $m$ bits. If partitioning is done
at the $n$th bit position ($0 \le n \le m - 1$), the upper bit group is denoted by
$x_{upper}$ and the lower bit group by $x_{lower}$. These are defined in Eqs. (2) and (3), and the partition structure of the pixel $x$ is shown in Fig. 1.
Fig. 1. Pixel partitioning.
As shown in Fig. 1, through partitioning, $x_{upper}$ is positioned $n$ bits higher than $x_{lower}$.
Therefore, their relationship to the original pixel $x$ can be expressed as Eq. (4). When the characteristic function representing the input-output relation of the original
(non-partitioned) LUT is denoted by $f(x)$, and if $f(x)$ is linear, it satisfies
Eq. (5).
Since the partitioned pixels, $x_{upper}$ and $x_{lower}$ can be processed by sharing
a single partitioned bit LUT, denote the unified input form and the characteristic
function of the partitioned bit LUT can be as $x_{partitioned}$ and $f_{pbl} (x_{partitioned})$,
respectively. Since the partitioned bit LUT receives the upper and lower partitioned
bits as inputs, the characteristic function can be defined as the $f(x_{upper} \times
2^n)$ for computation result of the upper bit group, $f(x_{lower})$ for that of the
lower bit group. If the characteristic function is given by $f(x_{upper} \times 2^n)$,
then Eq. (6) holds. By substituting Eq. (6) into Eq. (5), the relationship between the partitioned bit LUT characteristic function and the
original LUT characteristic function $f(x)$ can be expressed as Eq. (7).
From Eq. (7), if the LUT function $f(x)$ is linear, the same operation as the original LUT can
be achieved using a partitioned bit LUT whose characteristic function represents the
computation of the upper bit group. Likewise, if the characteristic function of the
partitioned bit LUT is given by $f(x_{lower})$, similarly, it can be described by
Eqs. (8) and (9). From Eq. (9), it follows that when $f(x)$ is linear, the same computation as the original LUT
can be performed using a partitioned bit LUT whose characteristic function represents
the computation of the lower bit group. Thus, as shown by Eqs. (7) and (9), when the LUT characteristic function is linear, any partitioned bit LUT, regardless
of whether it uses the upper or lower partitioned pixel function as its characteristic
function, can execute the computation without error.
When implementing a LUT in hardware, if the bit width of the input pixel differs from
the LUT’s input bit width, the operation cannot be directly computed. When the bit
widths of $x_{upper}$ and $x_{lower}$ differ, both of their bit widths are made equal
to that of the wider partitioned pixel. A fractional portion is appended at the position
of the least significant bits of the narrower partitioned pixel to match the bit width.
If the fractional value is zero, the expanded partitioned pixel retains the same value
as the original, thereby ensuring that the computation results remain unchanged. In
this manner, the bit widths of the partitioned bits are made identical, and denote
these adjusted partitioned pixels as $X_{upper}$ and $X_{lower}$. The relationship
between $x_{upper}$, $x_{lower}$, and their aligned forms is shown in detail in Fig. 2.
Fig. 2. Input bit width expansion. (a) in case of $n < \frac{m}{2}$, (b) in case of
$n > \frac{m}{2}$.
With the pixel partitioning as shown in Fig. 1, the bit widths of $x_{upper}$ and $x_{lower}$ become $m - n$ bits and $n$ bits,
respectively, resulting in a difference of $|m - 2n|$ bits. As shown in Fig. 2(a), when $x_{lower}$ has a smaller bit width, a fractional extension $ext_l$ is padded
below the least significant bits by $m - 2n$ bits to create $X_{lower}$, with the
same bit width as $X_{upper}$. In this case, the relationship between $x_{lower}$
and the extended $X_{lower}$ is given by Eq. (10), and since $X_{upper}$ is not extended, it remains identical to $x_{upper}$.
Similarly, as shown in Fig. 2(b), when $x_{upper}$ has a smaller bit width, a fractional extension $ext_u$ is padded
below the least significant bits by $2n - m$ bits to generate $X_{upper}$ with the
same bit width as $X_{lower}$. The relationship between $x_{upper}$ and the extended
$X_{upper}$ is expressed by Eq. (11).
If the value of the padded fraction is zero for either of the extended partitioned
pixels, the relationships $X_{upper} = x_{upper}$ and $X_{lower} = x_{lower}$ hold.
Furthermore, as shown in Fig. 2, $X_{upper}$ is always positioned $(m - n)$ bits higher than $X_{lower}$, satisfying
the relationship in Eq. (12). Eqs. (7) and (9), which show that computations using a partitioned bit LUT, can be performed without
error, can be restated for the extended partitioned pixels $X_{upper}$ and $X_{lower}$
as Eqs. (13) and (14), respectively.
Eqs. (13) and (14) show the case when the characteristic function of the partitioned bit is for the
computation for $X_{upper}$, and $X_{lower}$ respectively. By utilizing the bit width
extension method proposed in Fig. 2, the hardware limitations of the LUT can be effectively overcome. Moreover, by the
relationships in Eqs. (7) and (9), the linear computations can be performed without error using the partitioned bit
LUT.
III. ERROR REDUCTION FOR MINIMIZING ERROR LUT ROM SIZE
As previously discussed in Eqs. (13) and (14), the error-free computation using the proposed partitioned bit LUT is limited to
cases where the LUT characteristic function is linear. When computing nonlinear functions
with a partitioned bit LUT, errors inevitably occur. A separate error LUT is introduced
for compensation. Additionally, a method is proposed to reduce the magnitude of the
errors, thereby decreasing the size of the error LUT. Let the error range be the difference
between the minimum and maximum errors comparing the results obtained from the partitioned
bit LUT and those from the original full LUT. The size of the error LUT is then given
by Eq. (15).
Since the error LUT must compensate for errors corresponding to all input bit values,
the input bit width in Eq. (15) is identical to that of the non-partitioned original pixel. The error range represents
the range of errors to be compensated, which tends to increase as the deviation from
a linear LUT characteristic function is increased, i.e., the more nonlinear the function
is, the larger the error range becomes. Therefore, to reduce the size of the error
LUT, it is essential to minimize the error range that arises from computations using
the partitioned bit LUT. This can be achieved by appropriately selecting the characteristic
function of the partitioned bit LUT to reduce errors. As shown in Eqs. (13) and (14), when the LUT characteristic function is linear, either the computation result of
the upper or lower bit group can be used as the characteristic function of the partitioned
bit LUT without introducing any errors. In contrast, for a nonlinear function, it
is required to select the computation result that minimizes the error range of the
characteristic function. Since $X_{upper}$ corresponds to the MSBs of pixel $x$, it
has more representative value across the entire input range compared to $X_{lower}$,
and the computation result $f(X_{upper} \times 2^{m-n})$ tends to approximate $f(x)$
more closely than $f(X_{lower})$, the computation result of $X_{upper}$ is used for
characteristic function for the partitioned-bit LUT for both $X_{upper}$ and $X_{lower}$.
Let $f_{error}(x)$ denote the characteristic function of the error LUT. Then, the
equation for computing a nonlinear function with error compensation can be shown by
extending Eq. (13) to Eq. (16) as follows:
When the partitioned bit LUT uses the computation result of the upper bit group as
its characteristic function, it cannot accurately compute the lower bit group. As
a result, $f_{pbl}(X_{lower})/2^{m-n}$ in (16) introduces error. To mitigate this error, a scale factor is proposed for the lower
bit group output. Denoting the proposed scale factor as $i$, the lower bit group can
be computed as $f_{pbl}(X_{lower})/2^i$. By optimizing the value of $i$, the error
can be minimized, thereby improving the overall accuracy of the partitioned bit LUT
for non-linear function computation.
Fig. 3 shows the differences in computation results when employing a scale factor in the
computation of a partitioned bit LUT. When the same scale factor is applied uniformly
to all input values, as in Eq. (16), it is referred to as a common scale factor, corresponding to Figs. 3(a) and 3(b). Because the same scaling by $1/2^3$ is applied across all input values, the dynamic
range of the lower bit group computation, $f_{pbl}(X_{lower})/2^3$, remains constant
throughout the entire range. As shown in Fig. 3(a), when the LUT characteristic function is linear, the step size of the upper bit group
result, $f_{pbl}(X_{upper})$ is identical for all input values and matches the lower
bit group computation results (i.e. $f_{pbl}(X_{lower})/2^3$) in dynamic range, which
introduces no error in their summation. However, when dealing with a nonlinear function,
as shown in Fig. 3(b), the step size of $f_{pbl}(X_{upper})$ varies depending on the inputs, and it does
not match the dynamic range of $f_{pbl}(X_{lower})/2^3$. Consequently, computation
of a nonlinear function in the same manner as the linear function case leads to errors.
A closer examination of Fig. 3(b) reveals that large errors occur when the step size of $f_{pbl}(X_{upper})$ differs
from the dynamic range of $f_{pbl}(X_{lower})$ for the same pixel input value. Ultimately,
by adjusting the scale factor so that the step size of $f_{pbl}(X_{upper})$ matches
the dynamic range of $f_{pbl}(X_{lower})/2^i$, the error can be significantly reduced,
as shown in Fig. 3(c). This approach, where different scale factors are applied to $f_{pbl}(X_{lower})$
depending on the step size of $f_{pbl}(X_{upper})$, will be referred to as a stepwise
scale factor. Unlike the case of a common scale factor, where the same scaling is
used irrespective of input, the partitioned bit LUT using the stepwise scale factor
can achieve computation results that are highly consistent with the original LUT curve,
as shown in Fig. 3(c). The stepwise scale factor is used to match the dynamic range of the computation
result $f_{pbl}(X_{lower})$ to the step size of $f_{pbl}(X_{upper})$. Conversely,
given the step size of $f_{pbl}(X_{upper})$ and the dynamic range of $f_{pbl} (X_{partitioned})$,
an optimal scale factor can be determined for each pixel input. Let $step\_size(f_{pbl}(X_{upper}))$
denote the step size of $f_{pbl}(X_{upper})$, which can be expressed as Eq. (17). Defining $DR(g(x))$, as the dynamic range of a function $g(x)$, the optimal stepwise
scale factor, $SF_{stepwise}$, is given by Eq. (18).
Fig. 3. Characteristic function curve of partitioned bit LUT according to the scale
factor. (a) Linear function using the common scale factor, (b) Non-linear function
using the common scale factor, (c) Non-linear function using the stepwise scale factor.
For hardware implementation efficiency, the scale factor is converted into an integer
form as shown in Eq. (19). While 10% of error range reduction is wasted in this case, not only can substantial
savings be obtained in scale factor LUT size, but also simple shifters can be used
for the efficient implementation, which makes the overall tradeoff highly advantageous.
The stepwise scale factor, denoted by $k$, is stored in the scale factor LUT, which
takes $X_{upper}$ as its input. Since an $m$ bit data can be shifted from 0 to $m$
bits, the size of the scale factor LUT is given by Eq. (20).
This structure, where a stepwise scale factor is applied to the partitioned bit LUT,
is hereafter referred to as a step-size-scaled partitioned bit LUT. The computation
process of the step-size-scaled partitioned bit LUT, incorporating both the stepwise
scale factor and error compensation via the error LUT, is expressed in Eq. (21).
As shown in Eq. (21), the proposed computation scheme requires a single partitioned bit LUT, an error
LUT, a shifter, and a scale factor LUT. Examining the computation process of the step-size-scaled
partitioned bit LUT shown in Fig. 4, the input pixel is first partitioned into $X_{upper}$ and $X_{lower}$ which are
then processed sequentially through the same partitioned bit LUT. The scale factor
LUT takes the upper bit group $X_{upper}$ as input and outputs the stepwise scale
factor $k$. The shifter then shifts the computation result $f_{pbl}(X_{lower})$ by
$k$ bits, generating the scaled lower bit computation result $f_{pbl}(X_{lower})/2^k$.
Finally, the computation results from the error LUT, $f_{error}(x)$, the upper bit
group, $f_{pbl}(X_{upper})$, and the scaled lower bit group, $f_{pbl}(X_{lower})/2^k$,
are summed to produce the final output.
Fig. 4. Processes of step size scaled partitioned bit LUT.
The total LUT ROM count of the proposed step-size-scaled partitioned bit LUT is equal
to the sum of the partitioned bit LUT given in Eq. (1), the error LUT in Eq. (15), and the scale factor LUT in Eq. (20). All of them are expressed as Eq. (22). Here, the bit width of the input pixel $x$ is $m$ bits, and the bit width of the
partitioned pixel is denoted by $W_p$.
The values of $LUTC_{pbl}$ and $LUTC_{sf}$ depend on the bit width of the partitioned
pixel, whereas $LUTC_{error}$ increases as the error range widens or as the function
deviates further from a linear form. Based on these characteristics, the average LUT
ROM reduction rates is evaluated using six different arithmetic functions introduced
in [9], under various partitioned pixel bit widths.
As shown in Figs. 5(a) and 5(b), when the partitioned bit LUT and error LUT are used, the total LUT ROM count becomes
$LUTC_{pbl} + LUTC_{error}$. The proposed scheme shown in Fig. 5(c) additionally uses a scale factor LUT, resulting in a total LUT ROM count as given
by Eq. (22). As shown in Fig. 5, the stepwise scale factor achieves the highest reduction rate. This is because the
use of the stepwise scale factor significantly reduces the size of the error LUT.
Moreover, when the bit width of $X_{upper}$ is set to 5 bits during pixel partitioning,
the reduction rate reaches its maximum at 35.8%, demonstrating the most efficient
configuration.
Fig. 5. LUT size according to $X_{upper}$ bit width.
IV. STEP SIZE SCALED PARTITIONED BIT LUT WITH COMPRESSED IMAGE MEMORY
In many image processing applications, image data compression algorithms are used
to prevent increased costs arising from the growing amount of image memory. Following
this trend, the authors of this paper have previously proposed an image memory compression
algorithm [11]. The Partial Pixel Compression (PPC) image memory compression algorithm [11] performs compression based on the similarity between neighboring pixels.
Fig. 6 shows a simplified illustration of the compression and decompression processes of
PPC. During compression, each pixel is partitioned into an upper bit group and a lower
bit group, and the upper bit groups of adjacent pixels are compared. If the upper
bit groups are identical, only the lower bit group is stored; otherwise, the entire
pixel is stored. A flag is appended to the least significant bit to distinguish these
cases.
Fig. 6. Partial pixel compression.
This way, image memory compression algorithms partition pixels into upper and lower
bit groups for comparison and storage. Similarly, the proposed LUT ROM reduction method
also uses pixel partitioning. Utilizing these algorithmic similarities, the pixel
partitioning and the image memory compression can be seamlessly integrated to achieve
synergistic reductions in both memory access rates and efficient LUT-based computation.
Fig. 7 conceptually illustrates the seamless integration of the compressed image memory
and the step-size-scaled partitioned bit LUT based on the common pixel partitioning.
Importantly, since the compressed pixel is inherently composed of a combination of
partitioned pixels, the computations can be performed using the partitioned bit LUT
without any additional decompression process required in other works.
Fig. 7. Interoperation methodologies of compressed image memory and step size scaled
partitioned bit LUT.
Fig. 8 illustrates the integration of the compressed image memory with the proposed step-size-scaled
partitioned bit LUT. To enable this integration, a partitioned bit aligner is added,
while the remaining structure and computational process remain identical to those
described in Fig. 4. The partitioned bit aligner generates the partial pixels $X_{upper}$ and $X_{lower}$,
for the computation, as well as the full pixel $x$, using the compressed pixels. It
identifies the data type using the flag located at the least significant bit of the
compressed pixel. The flag, 0, indicates that the compressed pixel contains a full
pixel; thus, both the upper and lower bit groups are output and processed through
the partitioned bit LUT. Conversely, if the flag is 1, the compressed pixel consists
only of the lower bits of two pixels that share the same upper bits with a previously
processed pixel. In this case, the computation leverages the previously stored upper
bit result in the upper bit result latch. This approach not only ensures that the
computation yields the same result as an uncompressed pixel but also reduces redundant
computations of identical upper bits across neighboring pixels.
Fig. 8. Step size scaled partitioned bit LUT with image memory.
Fig. 9 shows the compression ratios of the PPC algorithm according to different pixel partitioning
schemes. This allows for selecting an appropriate bit width for the partitioned pixel
when integrating with the step-size-scaled partitioned bit LUT. As shown in Fig. 9, partitioning the upper bit group into 4 bits (equivalent to half of the pixel) yields
the highest compression ratio.
Fig. 9. Compression rate according to the upper bit group size.
However, as shown in Fig. 5, the step-size-scaled partitioned bit LUT achieves its highest reduction rate when
the upper bit group is 5 bits. Generally, since the volume of image memory is substantially
larger compared to the size of the LUT ROM, partitioning into half pixels, which offers
a high memory compression ratio, is more effective. In this case, the reduction rate
of the step-size-scaled partitioned bit LUT also remains relatively high at 35.4%.
Moreover, because the bit widths of $x_{upper}$ and $x_{lower}$ are identical using
half-pixel partitioning, there is no need for additional bit width expansion as described
in Fig. 2, simplifying the overall computation.
V. HARDWARE ARCHITECTURE
Based on the integration methods discussed in Section 4, Fig. 10 shows the hardware architecture of the compressed image memory combined with the
proposed step-size-scaled partitioned bit LUT. The compressed pixel output from the
compressed image memory is fed directly into the partitioned bit aligner without passing
through a decompression unit. At this stage, the partitioned bit aligner identifies
the data type by examining the flag located at the least significant bit of the compressed
pixel. If the flag is 0, indicating that partitioned pixel 1 corresponds to the upper
bit group, it is first stored in the upper latch. Then, the partitioned pixels 1 and
2 are sequentially output for the sequential partitioned bit LUT computations. The
computation result of the upper bit group is stored in the upper bit result latch
so that it can be subsequently combined with the computation result of the lower bit
group. Conversely, if the flag is 1, both partitioned pixels 1 and 2 belong to the
lower bit group, thus, they are sequentially processed by the partitioned bit aligner
without being stored in the upper latch. Meanwhile, the error LUT receives the combined
full pixel of the upper and lower bit groups and computes the error value. The scale
factor LUT takes the upper bit group as input and produces a scale factor, which is
then used by the shifter to divide the computation result of the lower bit group.
Finally, the outputs of the partitioned bit LUT and the error LUT are added to generate
the final outputs.
Fig. 10. Architecture of step size scaled partitioned bit LUT.
VI. SIMULATION RESULTS
Based on the hardware architecture shown in Fig. 10, RTL-level simulations were performed using actual image data. The results from these
simulations were subsequently evaluated in Section 7 through eye tests and PSNR measurements,
which are discussed in detail later.
Fig. 11 shows the simulation results of the step-size-scaled partitioned bit LUT operation
using compressed pixels processed by the PPC compression algorithm. The compressed
pixel, read from the compressed image memory, is fed into the partitioned bit LUT
through the cdata_in. Since the least significant bit, flag, is 0, it can be identified
as an uncompressed type, with the upper and lower bit groups being 6 and 15, respectively.
Accordingly, the upper bit group value 6 is first input to the partitioned bit LUT
through part_bit, generating an output of 47 on pb_out, which is then stored in the
up_latch. Next, the lower bit group value, 15, is input through the part_bit, generating
an output of 255 on pb_out. The scale factor LUT outputs 3 through sft_out, which
makes the shifter shift the data by 3 bits, resulting in a final output of 31 on shter_out.
The error LUT receives 111 through full_bit and error value 3, is output on err_out.
The upper bit group result, 47, and the scaled lower bit group, 31, are then combined
to generate a final computation result of 81 on final_data. These operations are repeated
to compute a compressed pixel. The above simulations verify that the partitioned bit
LUT operation using the compressed pixels and the seamless integration of the compressed
image memory and the partitioned bit LUT.
Fig. 11. Logic simulation waveforms.
VII. PERFORMANCE EVALUATIONS
In this section, three cases of the proposed partitioned-bit LUT in Fig. 5 are first evaluated to determine the most optimal case. Subsequently, the selected
case is compared with REP-ROM [9] in terms of chip area, throughput, and power consumption.
Fig. 12 shows the comparisons among the intermediate, final results, and LUT counts for three
partitioned-bit LUT configuration cases: without a scale factor, with a common scale
factor, and with the proposed stepwise scale factor. The comparison is performed for
histogram equalization on Kodak image dataset, which is widely accepted and commonly
used in image processing performance evaluations. Fig. 12(a) presents the results without applying a scale factor. In this case, large errors
are generated during the intermediate computation stage prior to the compensation
by the error LUT, as discussed in Section 3, which consequently leads to a significant
increase in the size of the error LUT. Fig. 12(b) shows the results with a common scale factor. Although the error range is reduced
compared to Fig. 12(a), a substantial deviation from the original curve is still observed. In contrast,
Fig. 12(c) shows the results with the proposed stepwise scale factor. Even before applying error
compensation, the intermediate computation closely follows the original curve, resulting
in a significant reduction of $LUTC_{error}$ as defined in Eq. (15).
Fig. 12. LUT reduction amounts according to scale factor methods for Kodak image dataset,
(a) No scale factor, (b) Common scale factor, (c) Stepwise scale factor.
Among the three partitioned-bit LUT approaches, the proposed method with the stepwise
scale factor achieves the highest reduction in LUTC.
Moreover, even when considering only the intermediate computation results without
the error compensation, the proposed method achieves an average PSNR of approximately
32.8 dB, which is substantially higher than that of the no scale factor case in Fig. 12(a) with 22.0 dB and the common scale factor case in Fig. 12(b) with 27.2 dB. These results show that the proposed stepwise scale factor method provides
superior image quality with smaller error ranges. Overall, the proposed method achieves
an average LUTC reduction rate of 23.4% as shown in Fig. 12.
The proposed partitioned bit LUT architecture with the stepwise scale factor is further
evaluated by comparing its performance with REP-ROM [9] in terms of chip area. In addition to the ratio of the LUT compression of the proposed
method, the chip area required for the additional hardware to be implemented should
be considered in comparison to non-compressed case. Total chip area reduction is evaluated
in Eq. (23) as the sum of the LUTC reduction, which represents the reduction in ROM area achieved
by the compression relative to the original LUT ROM area, and the additional hardware
reduction, which represents the chip area overhead of the required hardware components
relative to the original LUT ROM area, which will be defined below.
In Eq. (23), $A$ denotes the input bit width and $L$ denotes the output bit width. $A_{ROMcell}$
represents the typical chip area of a single ROM cell, and $A_{add}$ represents the
typical chip area of additional hardware in additions to the ROM. $2^A \cdot L \cdot
A_{ROMcell}$ represents the typical chip area of a original LUT of A-bit input and
L-bit output. Since 8-bit pixels are commonly used in image processing applications,
8-bit input and 8-bit output case is assumed. In this case, the original LUT contains
$2^8 \times 8 = 2,048$ ROM cells. Based on Eq. (23), the total chip area reduction ratio of the proposed method relative to REP-ROM is
evaluated using Eq. (24).
To evaluate the additional hardware chip area required by both methods relative to
the chip area of a ROM cell, the typical chip areas of a ROM cell and the additional
required hardware components of each method are summarized in Table 1.
Table 1. Typical chip area of the modules used.
|
Category
|
Area [$\lambda^2$]
|
|
ROM cell [12]
|
$16 \times 8$
|
|
SRAM cell [12]
|
$26 \times 45$
|
|
Barrel shifter cell [12]
|
$16 \times 18$
|
|
1 bit adder [13]
|
1224
|
|
4:1 MUX [14]
|
$100 \times 128$
|
In addition to the partitioned bit LUT, error LUT, and scale factor LUT required for
the computation, the proposed step-size-scaled partitioned bit LUT also needs modules
such as a partitioned bit aligner, adder, shifter, and upper bit result latch, as
shown in Fig. 10.
The largest area within the partitioned bit aligner is occupied by the upper bit latch,
which can be implemented using SRAM. Likewise, the upper bit result latch can also
be realized with SRAM. Given an input pixel with a bit width of $m$, the upper bit
latch requires storage of $0.5m$ bits, while the upper bit result latch requires $m$
bits, amounting to a total storage requirement of $1.5m$ bits. When evaluated in terms
of LUT ROM, this is expressed as Eq. (25).
If the bit width is 8, the increased latch size corresponds to approximately 110 LUT
ROM cells. Next, as shown in Fig. 10, the proposed shifter performs only right shift operations, allowing it to be implemented
at half the size of a conventional barrel shifter. Consequently, the size of the shifter,
when converted to an equivalent LUT ROM count, can be expressed as Eq. (26).
According to Eq. (26), the shifter is required to process pixels with an 8-bit pixel width corresponding
to approximately 72 LUT ROM cells. An 8-bit adder, based on Table 1, occupies a layout area of approximately 9,792$\lambda^2$, which is equivalent to
about 77 LUT ROM cells. Therefore, the additional hardware introduced by the proposed
step-size-scaled partitioned bit LUT to handle 8-bit pixel width amounts to a total
of 72 + 110 + 77 = 259 LUT ROM cells. Given that a conventional 8-bit LUT ROM count
is approximately 2,048 cells, this represents an increase of only about 12%, which
is relatively small compared to the 35.4% reduction in LUT ROM size that can be achieved
by the proposed LUT compression algorithm. Moreover, since this estimate is based
on module sizes evaluated using standard cells, the additional hardware overhead is
expected to be even smaller when custom-designed circuits are used.
In REP-ROM [9], the tone curve is partitioned into $2^{(A-R)}$ regions based on the selected value
of $R$. Accordingly, a $2^{(A-R)} : 1$ MUX is required for the region selection, and
an additional adder is required for error compensation [9]. For example, when $R = 7$, $R = 6$, and $R = 5$, the required region-selection MUX
becomes 2:1, 4:1, and 8:1 MUX, respectively. For the case of $R = 6$, as shown in
Table 1, the area of a 4:1 MUX is $100\lambda \times 128\lambda$ [14]. Thus, the equivalent additional hardware chip area relative to the LUT ROM cell
chip area can be expressed as Eq. (27).
According to Eq. (26), the total additional hardware in REP-ROM [9], including both the required 4:1 MUX and the adder, is equivalent to 77 + 800 = 877
ROM cells for an 8-bit pixel width. Compared to conventional 8-bit LUT, which requires
2,048 ROM cells, this corresponds to 42.8%, representing a substantial additional
hardware overhead. 2:1 MUX for the case of $R = 7$ contains one-third the number of
transistors of a 4:1 MUX [15], while 8:1 MUX for the case of $R = 5$ can be implemented by combining two 4:1 MUXs
and one 2:1 MUX [16]. Therefore, the areas of 2:1, 4:1, and 8:1 MUXs can be evaluated by a ratio of 1:3:7.
Therefore, the additional hardware overhead of REP-ROM [9], including the adder, can be estimated as 16.7%, 42.8%, and 94.9% for each $R = 7,
6, 5$, respectively.
For the total area estimation in Eq. (23), the LUTC of the proposed method is compared with that of REP-ROM [9]. The LUTC of REP-ROM [9] is given by $\sum_{i=1}^{A-R} (M_i + I) \cdot 2^S + 2^I \cdot (2^{R-S} \cdot E)$,
where $A$ denotes the input bit width, and $R$ and $S$ are the parameters to be chosen
to minimize the resulting LUTC. $I$ (the number of distinct error patterns) and $E$(the
dynamic range of an error pattern) are derived from the error patterns corresponding
to the given tone curve and the selected $R$ and $S$. For irregular tone curves in
which the error shape varies for the segments spaced by $2^{(R-S)}$, the stored error
patterns can not be reused. Therefore, a new error pattern must be stored in the error
LUT for each segment. This increases the number of bits, $I$, required to represent
the error pattern index, which in turn increases the overall LUTC. However, the LUTC
of the proposed method in Eq. (22) is independent of error reuse pattern. Hence, the compression efficiency of REP-ROM
[9] decreases as the error patterns for the segments become more irregular, whereas the
proposed method remains unaffected. In contrast, regardless of such irregularities,
as discussed in Eq. (6), the size of the error LUT required in the proposed method may decrease as the tone
curve becomes closer to globally linear shape, even if local nonlinearites remain.
Such irregular tone curves are widely used in modern image processing applications
that require image-specific optimization. Applications such as low-light image enhancement
[17], HDR image tone mapping [18], and AI-based content-adaptive image enhancement [19] demand optimal nonlinear tone curves that vary with the scenes and image contents.
In these applications, simple monotonic functions or predefined canonical curves are
often ineffective to achieve the desired image quality and visual representation.
For generality, in addition to the regular curves used for evaluation in REP-ROM [9], evaluations are also conducted on irregular curves in which the error patterns differ
for every segment. The curves used in the evaluation are shown in Fig. 13. Figs. 13(a)-13(c) represent $\sin(x)$, $x^{1/2}$, $x^2$ respectively, which are used in REP-ROM [9]. Figs. 13(d)-13(f) represent irregularly shaped tone curves based on real datasets used for low-light
image enhancement [17] as an example.
Fig. 13. Regular and irregular tone curves.
Based on the evaluations, The LUTC reduction, additional hardware reduction, and total
chip area reduction in Eq. (23) were evaluated for the proposed method and REP-ROM [9] under both regular and irregular tone curves. The results and total chip area reduction
ratio between proposed and REP-ROM [9] in Eq. (24) is summarized in Table 2.
Table 2. Comparison of chip area between REP-ROM and the proposed method.
|
|
REP-ROM [9]
|
Proposed
|
Total chip area reduction ratio (%)
|
|
(R, S, I, E)
|
LUTC reduction (%)
|
Additional hardware reduction (%)
|
Total chip area reduction (%)
|
LUTC reduction (%)
|
Additional hardware reduction (%)
|
Total chip area reduction (%)
|
|
sin(x)
|
(6, 4, 4, 3)
|
43
|
42.8
|
85.8
|
58.6
|
12.6
|
71.2
|
82.9
|
|
x(1/2)
|
(6, 4, 4, 5)
|
50
|
42.8
|
92.8
|
58.6
|
12.6
|
71.2
|
76.7
|
|
x2
|
(6, 4, 4, 3)
|
41.4
|
42.8
|
84.2
|
58.6
|
12.6
|
71.2
|
84.5
|
|
curve A
|
(6, 5, 3, 5)
|
63.3
|
42.8
|
106.1
|
71.1
|
12.6
|
83.7
|
78.8
|
|
curve B
|
(6, 5, 3, 3)
|
64.9
|
42.8
|
107.7
|
71.1
|
12.6
|
83.7
|
77.7
|
|
curve C
|
(6, 4, 5, 4)
|
61.8
|
42.8
|
104.6
|
58.6
|
12.6
|
71.2
|
68
|
|
Average
|
-
|
54.1
|
42.8
|
96.9
|
62.8
|
12.6
|
75.4
|
77.8
|
Since an 8-bit pixel width is commonly used in image processing applications, all
evaluations are performed assuming an 8-bit pixel depth. For REP-ROM [9], smaller values of $R$ generally lead to better LUTC reduction [9]. However, as discussed in Table 1, when $R = 5$, an 8:1 MUX is required, which increases the additional hardware overhead
to as high as 94.9%. Therefore, comparisons were conducted only for cases with $R
> 5$, where the hardware overhead remains within a reasonable range.
As shown in Table 2, for all six tone curves in Fig. 13, REP-ROM [9] achieves its minimum LUTC when $R = 6$. In this configuration, REP-ROM [9] exhibits an additional hardware reduction of 42.8% for all six tone curves as described
in Table 1, whereas the proposed method requires only 12.6%. As summarized in Table 2, the proposed method achieves an average total chip area reduction of 75.4% for both
regular and irregular curves, corresponding to a 24.9% reduction compared to the uncompressed
LUT size. In contrast, REP-ROM [9] shows an average total chip area reduction of 96.9% for the same set of curves. According
to the total chip area ratio defined in Eq. (24), the proposed method requires 77.8% of the total chip area of REP-ROM [9], on the average, as summarized in Table 2.
The power consumption and throughput of the proposed method are also evaluated and
compared with those of REP-ROM [9]. For a fair comparison, REP-ROM [9] is configured with $R = 7$, which minimizes power consumption by minimizing hardware
overhead for required MUXs. Both are synthesized with the Ultra96v2 Evaluation Board
using Vivado 2022.1, under identical synthesis options. Power consumption is evaluated
using post-synthesis power analysis, and throughput is evaluated based on the maximum
operating frequency obtained from post-synthesis simulations. To provide a more comprehensive
evaluation of energy efficiency, the power-delay product (PDP) is also analyzed. According
to the evaluation results, the proposed method achieves a total power consumption
of 0.28 W, whereas REP-ROM [9] shows a total power consumption of 7.53 W. The proposed method yields PDP$_{proposed}$
= 0.28 W $\times \left( \frac{1}{4.9\text{Gpixels/s}} \right)$ = $5.71 \times 10^{-11}$
J, while REP-ROM [9] yields PDP$_{REP-ROM}$ = 7.53 W $\times \left( \frac{1}{8.33\text{Gpixels/s}} \right)$
= $9.04 \times 10^{-10}$ J. These results indicate that the proposed method achieves
a significantly lower PDP, demonstrating superior energy efficiency compared to REP-ROM
[9]. The evaluation results show that the proposed method achieves 15.8 times less PDP
performance than REP-ROM [9]. Furthermore, REP-ROM [9] requires finding suitable parameters of R and S that minimize the LUT count for each
tone curve. In contrast, the proposed method does not require such hardware reconfiguration,
which is more suitable for programmable LUT implementations.
The proposed step-size-scaled partitioned bit LUT not only reduces the LUT ROM size
but also decreases the computational load through the integration with the compressed
image memory. The number of computations performed by the proposed LUT equals the
number of compressed pixels, allowing a reduction in computation that can be evaluated
by comparing the number of original pixels to the number of compressed pixels. According
to [11], the PPC algorithm achieves an average compression ratio of 22%, with each compressed
pixel having a size of 9 bits. Letting $PC_{org}$ denote the total number of original
pixels and $PC_{comp}$ the total number of compressed pixels after compression, their
relationship can be expressed as Eq. (28).
Rearranging Eq. (28) yields $PC_{comp} = 0.69PC_{org}$, indicating that the number of compressed pixels
to be stored is reduced by 31% compared to the original. Consequently, this compression
process, along with the step-size-scaled partitioned bit LUT, results in a 31% reduction
in image memory access rate as well as a 31% reduction in the computation load within
the proposed LUT.
VIII. CONCLUSION
LUT compression algorithms and hardware design methodologies are proposed, which can
be seamlessly integrated with image memory data compression. By LUT input pixel partitioning,
the LUT ROM count can be reduced. By the proposed stepwise scale factor, which adjusts
the dynamic range of the lower bit group computations based on the step size of each
upper bit group, the errors can be compensated with a minimal amount of error LUT.
Simulation results using actual image show that the proposed LUT compression approach
achieves approximately 23.4% reduction in LUT ROM count without compromising image
quality for Kodak image dataset.
For both arithmetic functions and irregular tone curves, the total chip area reduction
ratio, accounting for both the LUTC and the additional hardware reduction, needs 77.8%
on the average, relative to the REP-ROM. Furthermore, by integrating the proposed
LUT with compressed image memory, the computation amount and memory access rate can
be reduced by 31% compared with the uncompressed image memory case.
ACKNOWLEDGMENTS
This work was supported by 2025 Hongik University Innovation Support Program Fund.
Furthermore, we express our gratitude to Donghun Cho, who made great contributions
to experimental result analysis and manuscript organization.
REFERENCES
De Greef P. , Hulze H. G. , 2007, 39.1: Adaptive dimming and boosting backlight for
LCD-TV systems, Proc. of SID Symposium Digest of Technical Papers, Vol. 38, No. 1,
pp. 1332-1335

Zhao X. , Hu Z. , Chang L. , 2024, USR-LUT: A high-efficient universal super resolution
accelerator with lookup table, Proc. of the IEEE International Symposium on Circuits
and Systems, pp. 1-5

Yan L. , Cengiz K. , Sharma A. , 2021, An improved image processing algorithm for
automatic defect inspection in TFT-LCD TCON, Nonlinear Engineering, Vol. 10, No. 1,
pp. 293-303

Matsueda Y. , Park Y.-S. , Choi S.-M. , Chung H.-K. , 2005, Trend of system on panel,
Proc. of the Korean Information Display Society Annual Meeting, pp. 841-844

He L.-F. , Zhou Y. , Wang Z. , Wang J. , Guo Y. , Wang H. , 2017, Light-erasable embedded
charge-trapping memory based on MoS2 for system-on-panel applications, Applied Physics
Letters, Vol. 111, No. 22

Zhang K. , Peng D. , Lau K. M. , Liu Z. , 2017, Fully integrated active matrix programmable
UV and blue micro-LED display system-on-panel (SoP), Journal of the Society for Information
Display, Vol. 25, No. 4, pp. 240-248

Han D. , 2004, Real-time color gamut mapping method for digital TV display quality
enhancement, IEEE Transactions on Consumer Electronics, Vol. 50, No. 2, pp. 691-698

Han W. , Cho H. , Kim D. , Kim J.-Y. , 2025, SAL-PIM: A subarray-level processing-in-memory
architecture with LUT-based linear interpolation for transformer-based text generation,
IEEE Transactions on Computers, Vol. 74, No. 9, pp. 2909-2922

Yang B.-D. , Kim L.-S. , 2004, An error pattern ROM compression method for continuous
data, Proceedings of the IEEE International Symposium on Circuits and Systems, Vol.
2, pp. II-845-II-848

Xie Y. , Chen J. , Xu X. , Wang Y. , Yang H. , 2020, A twofold lookup table architecture
for efficient approximation of activation functions, IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, Vol. 28, No. 12, pp. 2540-2550

Lim S. , Cho D. , You J. , 2024, Image data compression and decompression unit considering
brightness sensitivity, Journal of Semiconductor Technology and Science, Vol. 24,
No. 5, pp. 459-472

Weste N. H. E. , Harris D. , 2015, CMOS VLSI Design: A Circuits and Systems Perspective

Vesterbacka M. , 1999, A 14-transistor CMOS full adder with full voltage-swing nodes,
Proc. of the IEEE Workshop on Signal Processing Systems, pp. 713-722

Transistor count, Wikipedia

Multiplexers in digital logic, GeeksforGeeks

Heinbuch D. V. , 1988, CMOS3 Cell Library

Chen W. , Wang W. , Yang W. , Liu J. , 2018, Deep retinex decomposition for low-light
enhancement, Porc. of the British Machine Vision Conference, pp. 1-12

Duan J. , Bian Z. , Li S. , Zhang X. , 2010, Tone-mapping high dynamic range images
by novel histogram adjustment, Pattern Recognition, Vol. 43, No. 5, pp. 1847-1862

Zhao L. , Abdelhamed A. , Brown M. S. , 2022, Learning tone curves for local image
enhancement, IEEE Access, Vol. 10, pp. 60099-60111

Jeonghun Lee received the B.S. degree in electronic and electrical engineering from
Hongik University, Seoul, Korea in 2023. He is currently pursuing an M.S. degree in
electronic and electrical engineering from Hongik University, Seoul, Korea since 2023.
His research interest includes image signal processing, integrated display system
design.
Sanghyun Lim received B.S., and M.S. degrees in the Department of Electronic and Electrical
Engineering from Hongik University, Seoul, Korea, in 2007, and 2009, respectively.
In 2010, he joined at Samsung Display inc., where he has been working in the area
of display electronics development. His current interests include a touch sensor,
stylus driving technologies, sensor structure, and high dynamic range image processing.
Seongjo Youn received the B.S. degree in electronic and electrical engineering from
Hongik University, Seoul, Korea in 2025. He is currently pursuing an M.S. degree in
electronic and electrical engineering from Hongik University, Seoul, Korea since 2025.
His research interest includes image signal processing, system semiconductor design.
Jaehee You received his B.S. degree in electronics engineering from Seoul National
University, Seoul, Korea, in 1985. He received his M.S. and Ph.D. degrees in electrical
engineering from Cornell University, Ithaca, NY, in 1987 and 1990, respectively. In
1990, he joined Texas Instruments, Dallas, TX, as a Member of Technical Staff. In
1991, he joined the faculty of the School of Electrical Engineering, Hongik University
in Seoul, Korea, where he is now supervising the Semiconductor Integrated System Laboratory.
He is currently Vice President of the Institute of Semiconductor Engineers and has
served as an Executive Director of the Drive technology and System research group
of the Korean Information Display Society. He was a recipient of the Korean Ministry
of Strategy and Finance, KEIT Chairman Award for Excellence, in 2011. He has worked
as a technical consultant for many companies, such as Samsung Semiconductor, SK Hynix,
Global Communication Technologies, P&K, Penta Micro, Nexia device, and Primenet. His
current research interests include integrated system design for display image signal
processing, and perceptual image quality enhancements.