I. INTRODUCTION
In contemporary computing, the rapid expansion of mobile devices and cloud-based data
centers has led to an unprecedented demand for high-performance and energy-efficient
computation. Mobile devices are increasingly required to execute computationally intensive
applications, such as image processing and machine learning inference, while operating
under stringent power constraints [1]. Simultaneously, data centers must process massive volumes of data with limited power
budgets and physical constraints, making energy-efficient computation a critical concern
[2]. However, conventional computing architectures, which rely on precise arithmetic
operations, struggle to meet these demands due to inherent limitations in power efficiency,
scalability, and computational speed [3]. The ever-growing complexity of workloads exacerbates these challenges, as exact
computation often incurs significant hardware overhead and energy consumption. In
response to these constraints, approximate computing has emerged as a promising paradigm
that strategically relaxes computational accuracy to achieve substantial gains in
power efficiency and performance [4,5]. Approximate computing leverages the observation that many application domains, such
as multimedia processing, machine learning, and signal processing, exhibit an intrinsic
tolerance to minor inaccuracies [6-9]. In image and video processing, for instance, slight distortions are often imperceptible
to the human eye, allowing arithmetic operations to be performed with reduced precision
without significantly degrading perceptual quality [10]. By exploiting this error tolerance, approximate computing facilitates the design
of arithmetic circuits with reduced hardware complexity, lower power consumption,
and improved computational efficiency, making it a compelling approach for next-generation
computing architectures.
Arithmetic circuits, particularly adders and multipliers, play a crucial role in high-performance
computing applications, such as signal processing, machine learning, and cryptographic
systems [11-14]. Among these, multipliers are of particular importance as they directly impact overall
system performance, power consumption, and area efficiency. Due to the inherent complexity
of multiplication, modern multiplier architectures are designed to optimize computational
efficiency by minimizing the number of operations required for partial product accumulation.
Tree-based multipliers, such as Wallace and Dadda multipliers, are widely employed
to enhance multiplication efficiency. These architectures aim to reduce the number
of sequential addition operations by utilizing a tree-like structure that compresses
partial products efficiently. Wallace multipliers employ a highly parallel reduction
scheme, using carry-save adders and compressors to rapidly reduce the number of partial
products at each stage. In contrast, Dadda multipliers adopt a more structured approach,
minimizing the number of reduction steps while maintaining a near-optimal delay. Both
architectures share a common three-stage multiplication process: 1) partial product
generation, 2) partial product reduction, and 3) final summation. Among these stages,
the partial product reduction stage is particularly critical, as it has a direct impact
on circuit delay, area, and power consumption [15]. To enhance the efficiency of this stage, approximate compressors have been increasingly
utilized for error-tolerant applications. These compressors simplify the reduction
process by intentionally introducing minor errors, thereby reducing the complexity
of arithmetic operations. By strategically balancing computational accuracy and hardware
efficiency, approximate compressors offer a promising approach to designing energy-efficient
and high-speed multipliers.
In this paper, we propose an error-aware approximate 4-2 compressor design that integrates
a lightweight error compensation mechanism for balanced accuracy and efficiency in
approximate multipliers. The first part of our design redefines the Boolean expressions
of an existing approximate 4-2 compressor to improve logic structure and reduce hardware
cost. The second part introduces a selective error correction logic that addresses
high-impact, high-probability error cases to improve overall computational accuracy.
As a result, in a 32-nm CMOS process, the proposed designs achieve up to 15% reductions
in area and power compared to prior approximate multipliers while maintaining competitive
accuracy performance. In particular, the compensated version improves accuracy without
incurring significant hardware overhead, demonstrating a favorable trade-off between
efficiency and precision. The key contributions of this paper are summarized as follows:
-
We present a novel approximate 4-2 compressor that improves hardware efficiency by
simplifying logic expressions while maintaining acceptable accuracy.
-
We propose a selective error compensation mechanism that effectively reduces error
distance with minimal hardware overhead, improving the accuracy of approximate multiplication.
-
We validate the proposed designs through implementation and application to digital
image processing and deep learning tasks, demonstrating their practical effectiveness
in error-resilient computing scenarios.
II. RELATED WORK
In multiplier architectures, the partial product reduction stage plays a pivotal role
in determining overall circuit efficiency. Compressors, particularly 4-2 compressors,
are widely utilized in this stage to minimize the number of partial products and enhance
computational performance. The exact 4-2 compressor is a fundamental arithmetic component
designed to accurately reduce four input bits $X_1, X_2, X_3, X_4$ along with a carry-in
bit $C_{in}$ into three output signals: the carry-out bit $C_{out}$, the intermediate
carry bit $Carry$, and the final sum $Sum$. The behavior of an exact 4-2 compressor
is governed by the following Boolean equations:
These expressions dictate how the input bits are processed to generate precise output
values, ensuring the exact representation of the reduced partial products without
introducing computational errors. The significance of the exact 4-2 compressor lies
in its ability to balance hardware complexity and computational accuracy. Unlike approximate
compressors, which introduce minor errors to achieve lower power consumption and reduced
circuit area, the exact 4-2 compressor prioritizes accuracy, making it particularly
suitable for applications requiring high numerical precision. Consequently, it remains
an essential component in conventional multiplier designs, especially in scenarios
where error tolerance is minimal and strict computational correctness is paramount.
Exact 4-2 compressors maintain full computational accuracy but come at the cost of
increased hardware complexity, power consumption, and propagation delay. These drawbacks
become critical in large-scale multipliers where numerous compressors operate simultaneously.
To mitigate these issues, approximate 4-2 compressors have been developed, leveraging
strategic simplifications in Boolean logic to reduce transistor count while tolerating
minor computational errors. The most common approximation techniques involve removing
carry-in (i.e., $C_{in}$) and carry-out (i.e., $C_{out}$) signals and simplifying
sum and carry calculations. These modifications enable lower power consumption, reduced
circuit area, and improved speed, making approximate compressors highly suitable for
error-resilient applications, such as deep learning accelerators, image processing,
and signal processing.
Several approximate 4-2 compressor designs have been proposed to balance hardware
efficiency and computational accuracy, as summarized in Table 1. Each design employs a unique strategy, offering different trade-offs in terms of
error tolerance, circuit area, and power efficiency. Momeni et al. proposed a design
that removes $C_{in}$ and $C_{out}$, significantly reducing logic complexity [16]. This simplification leads to lower power consumption and area overhead. However,
eliminating carry propagation introduces larger error magnitudes, making this approach
less suitable for applications demanding high accuracy. Akbari's dual-quality compressors
(Akbari$_1$ and Akbari$_2$) offer a configurable approximation mode, allowing a trade-off
between precision and efficiency [17]. The first variant (Akbari$_1$) introduces a simplified carry computation, reducing
transistor count, while the second variant (Akbari$_2$) employs an optimized logic
expression for better accuracy. While these designs provide adaptive performance,
their switching mechanisms introduce additional control overhead, slightly increasing
circuit complexity. Venkatachalam et al. developed a compressor that approximates
basic arithmetic building blocks such as half-adders and full-adders [18]. This approach minimizes propagation delay and reduces power consumption, making
it particularly effective for high-speed applications. However, the accumulation of
small approximation errors over multiple compressor stages may result in higher total
error rates in large multipliers. Ahmadinejad et al. employed NOR-based and NAND-based
logic simplifications, significantly reducing transistor count while maintaining a
reasonable level of accuracy [19]. This design is advantageous in low-power embedded systems, but its reliance on basic
logic simplifications makes it susceptible to higher error variability across different
input conditions. Sabetz et al. introduced an extreme approximation by fixing the
sum ($Sum$) output to a constant value, dramatically reducing logic complexity and
power consumption [20]. While this method achieves the smallest possible hardware footprint, it leads to
a significant increase in computational error, limiting its applicability to only
highly error-tolerant applications. Pei's compressor uses an error compensation mechanism
that adjusts outputs based on specific input patterns [21]. This method reduces the overall error distance compared to other designs, improving
accuracy while maintaining moderate hardware efficiency. However, the additional compensation
logic slightly increases power consumption compared to more aggressively simplified
compressors. Zhang's design reconfigures signal paths and selectively eliminates certain
logic operations, achieving a balanced trade-off between area and error performance
[22]. By restructuring logic expressions, this design maintains a reasonable level of
accuracy with minimal hardware cost. However, its fixed logic modifications may not
be optimal for all input distributions, making its effectiveness application-dependent.
Table 1. Summary of existing approximate 4-2 compressors.
|
Design
|
Carry equation
|
Sum equation
|
|
Momeni [16] |
$(X_1 + X_2) + (X_3 + X_4)$
|
$(X_1 \oplus X_2) + (X_3 \oplus X_4)$
|
|
Akbari$_1$ [17] |
$X_4$
|
$(X_1 \oplus X_2) + (X_3 \oplus X_4)$
|
|
Akbari$_2$ [17] |
$(X_1 \cdot X_2) + (X_3 \cdot X_4)$
|
$(X_1 \oplus X_2) + (X_3 \oplus X_4)$
|
|
Venka [18] |
$(X_1 \cdot X_2) + (X_3 \cdot X_4)$
|
$(X_1 \oplus X_2) + (X_3 \oplus X_4) + (X_1 \cdot X_2) \cdot (X_3 \cdot X_4)$
|
|
Ahma [19] |
$\overline{(X_1 + X_2) + (X_3 + X_4)}$
|
$\overline{(X_1 + X_2) \cdot (X_3 + X_4)}$
|
|
Sabetz [20] |
$(X_4 \cdot (X_1 + X_3)) + (X_1 \cdot X_3)$
|
1
|
|
Pei [21] |
0
|
$(X_1 \cdot X_2) + (X_3 \cdot X_4) + (X_1 + X_2) \cdot (X_3 + X_4)$
|
|
Zhang [22] |
$X_1 + X_2$
|
$(X_1 \oplus X_2) + (X_3 + X_4)$
|
III. PROPOSED APPROXIMATE 4-2 COMPRESSOR AND COMPENSATOR
In this section, we present a novel optimized approximate 4-2 compressor that leverages
compound gates to enhance circuit efficiency while incorporating an error compensation
mechanism to improve computational accuracy. The proposed design follows two key strategies:
first, the systematic optimization of the compressor's Boolean logic, enabling a compound
gate-based implementation that reduces area, power consumption, and delay while maintaining
acceptable accuracy; second, an error compensation mechanism designed to selectively
correct errors occurring in high-probability input cases.
To improve the efficiency of the 4-2 compressor, the Boolean expressions governing
the carry and sum functions are systematically reformulated to enable compound gate-based
implementation. Instead of relying on separate NOR and NAND logic, which increases
gate count and critical path delay, we restructure the logical expressions using Boolean
algebra from the design in [19] to achieve a more efficient implementation. This baseline was chosen because its
Boolean structure can be directly reformulated into a compact form with minimal logic
depth. Compared with other approximate compressors that rely on irregular or complex
expressions, this design offers a practical balance between simplicity and accuracy,
making it a suitable starting point for further optimization and selective error compensation.
The optimized carry and sum functions are expressed as follows:
These optimized equations are implemented using compound gates to minimize logic depth
and transistor count. The carry function utilizes an OA22 gate, which integrates two
OR operations followed by an AND operation, reducing the number of logic stages. The
sum function is computed using an OR4 gate, which aggregates all four inputs directly
without intermediate logic layers. As a result, our preliminary synthesis using a
32-nm CMOS library shows that this compound-gate mapping reduces the compressor area
by approximately 33%, power by 38%, and critical path delay by 27% compared to the
original Boolean implementation. These preliminary results confirm the effectiveness
of the reformulated logic and provide the basis for the multiplier-level improvements
discussed in Section IV. However, certain input conditions remain error-prone, requiring
an additional error correction mechanism to improve overall computation accuracy.
While approximate 4-2 compressors reduce hardware complexity, they inherently introduce
errors in specific input conditions. To quantify these errors, we assume uniformly
distributed inputs to the multiplier, a commonly adopted assumption in approximate
arithmetic. Since each partial product is the output of an AND gate, the probability
of '1' is 1/4 and that of '0' is 3/4. As a result, an input pattern of the 4-2 compressor
with "0000" occurs most frequently $(3/4)^4 = 81/256$, while "1111" is least frequent
$(1/4)^4 = 1/256$. Intermediate input cases follow accordingly, with single-one patterns
occurring with probability $27/256$, two-one patterns with $9/256$, and three-one
patterns with $3/256$. These derived probabilities are used to weight the error distance
of each input pattern. A detailed analysis of the truth table of the optimized approximate
compressor, as shown in Table 2, reveals that errors predominantly occur when inputs have a probability of $9/256$,
leading to an error distance of +1. These errors arise when the sum output is incorrectly
computed as '1' instead of '0,' which may propagate through subsequent arithmetic
operations and degrade computational accuracy. To mitigate this issue, a compensation
function is derived using Boolean algebraic transformations. The error correction
function $EC$ is optimized as follows:
Table 2. Truth table of the optimized compressor with and without proposed error compensation
mechanism.
|
$X_4X_3X_2X_1$
|
Prob.
|
w/o EC
|
w/ EC
|
|
C
|
S
|
ED
|
C
|
S
|
ED
|
|
0 0 0 0
|
81/256
|
0
|
0
|
0
|
0
|
0
|
0
|
|
0 0 0 1
|
27/256
|
0
|
1
|
0
|
0
|
1
|
0
|
|
0 0 1 0
|
27/256
|
0
|
1
|
0
|
0
|
1
|
0
|
|
0 0 1 1
|
9/256
|
0
|
1
|
−1
|
0
|
1
|
−1
|
|
0 1 0 0
|
27/256
|
0
|
1
|
0
|
0
|
1
|
0
|
|
0 1 0 1
|
9/256
|
1
|
1
|
+1
|
1
|
0
|
0
|
|
0 1 1 0
|
9/256
|
1
|
1
|
+1
|
1
|
0
|
0
|
|
0 1 1 1
|
3/256
|
1
|
1
|
0
|
1
|
1
|
0
|
|
1 0 0 0
|
27/256
|
0
|
1
|
0
|
0
|
1
|
0
|
|
1 0 0 1
|
9/256
|
1
|
1
|
+1
|
1
|
0
|
0
|
|
1 0 1 0
|
9/256
|
1
|
1
|
+1
|
1
|
0
|
0
|
|
1 0 1 1
|
3/256
|
1
|
1
|
0
|
1
|
1
|
0
|
|
1 1 0 0
|
9/256
|
0
|
1
|
−1
|
0
|
1
|
−1
|
|
1 1 0 1
|
3/256
|
1
|
1
|
0
|
1
|
1
|
0
|
|
1 1 1 0
|
3/256
|
1
|
1
|
0
|
1
|
1
|
0
|
|
1 1 1 1
|
1/256
|
1
|
1
|
−1
|
1
|
1
|
−1
|
This compensation function selectively detects and corrects errors while minimizing
hardware overhead. The implementation is realized using compound gates, ensuring efficient
integration with the compressor logic. An OAI22 gate is utilized to detect error-prone
input conditions, followed by an AO221 gate that combines the detected error signal
with auxiliary logic terms. The final compensation output is XORed with the sum function
to correct identified errors while maintaining the area and power advantages of approximation.
The complete architecture of the proposed compressor, integrating both the optimized
compressor logic and the error compensation mechanism, is illustrated in Fig. 1. The design consists of an optimized compressor module, which employs compound gates
to minimize delay and power consumption, and an error compensation module, which selectively
corrects erroneous cases using an OAI22-AO221-based logic network. This architecture
achieves an improved trade-off between computational accuracy and hardware efficiency
while ensuring a minimal increase in circuit complexity.
Fig. 1. Proposed optimized approximate 4-2 compressor with error compensation logic.
Building upon the proposed approximate 4-2 compressor and error compensation mechanism,
we extend our designs to an optimized approximate 8 × 8 multiplier architecture that
strategically integrates these compressor and error compensator to achieve a balance
between hardware efficiency and computational accuracy. The complete structure of
the proposed multiplier is illustrated in Fig. 2, which follows the $C$-$N$ configuration, a well-established technique that selectively
applies approximate compressors to the $N$ least significant columns of the partial
product matrix to optimize energy efficiency while maintaining acceptable precision.
In this design, the partial products are initially generated using conventional AND
gates, forming an 8 × 8 partial product matrix. These partial products are then compressed
using a selective arrangement of compressors based on their error tolerance characteristics.
The proposed optimized compressor without any error compensation scheme (marked in
red in Fig. 2) are primarily placed in the least significant columns, where minor computational
errors have a negligible impact on the overall accuracy of the multiplier. In contrast,
the proposed compressor with integrated compensation logic (marked in blue in Fig. 2) are selectively deployed in the seventh column, where error-prone cases are more
critical. This targeted placement ensures that error compensation is applied only
in regions where it provides the most benefit, effectively reducing significant computational
deviations while maintaining a compact hardware footprint. To preserve computational
precision in the final summation stages, exact 4-2 compressors, full adders, and half
adders are used in the most significant columns. After partial product compression
by the compressors, the remaining partial sums are processed by an accurate adder,
such as ripple-carry adder, composed of conventional full adders and half adders,
to compute the final multiplication result. By strategically incorporating the proposed
compressors within this architecture, the multiplier achieves an optimal trade-off
between accuracy, power efficiency, and circuit complexity.
Fig. 2. Proposed 8 × 8 approximate multiplier structure based on C-N configuration,
applying optimized compressors with and without the proposed error compensation logic.
IV. EXPERIMENTAL RESULTS
To evaluate the effectiveness of the proposed approximate 4-2 compressors in multiplier
designs, we implemented two variants of the 8 × 8 multiplier, shown in Fig. 2, in Verilog HDL. The first variant, referred to as $Proposed$, incorporates the optimized
compressor without the error compensation module. The second variant, denoted as $Proposed_{EC}$,
includes the proposed compensation logic described in the previous section. Both designs
were synthesized using a 32-nm standard cell library to obtain post-synthesis metrics
including area, delay, and power consumption. It is noteworthy that, for the compressors,
compound gates were directly instantiated in the Verilog description, and synthesis
was performed with optimization options disabled to preserve the intended mapping.
For comparison, we also implemented and synthesized a baseline exact multiplier design
as well as eight representative approximate multipliers, each based on a different
4-2 compressor previously presented in the literature. These include designs utilizing
compressors proposed by Momeni [16], Akbari [17], Venka [18], Ahma [19], Sabetz [20], Pei [21], and Zhang [22]. All multiplier architectures were configured identically in terms of partial product
generation and reduction tree structure to ensure a fair comparison.
Table 3 summarizes the performance of all multiplier designs in terms of hardware and error
metrics. For clarity and ease of comparison, the top two designs for each metric are
highlighted in bold. The hardware evaluation includes fundamental parameters, such
as area, power, and delay, as well as compound metrics including power-delay product
(PDP), energy-delay product (EDP), area-delay product (ADP), and power-delay-area
product (PDAP), which collectively provide insight into overall design efficiency.
Among all evaluated designs, both the $Proposed$ and $Proposed_{EC}$ achieve the smallest
area footprints, measuring 642.22 μm$^2$ and 658.49 μm$^2$, respectively. This represents
a 26.1% and 24.2% area reduction compared to the exact multiplier and a 5%∼10% improvement
over most other approximate designs. In terms of power consumption, the $Sabetz$ and
$Pei$ record the lowest values of 171.86 μW and 175.23 μW, respectively, due to their
extremely simplified logic, while the $Proposed$ follows closely at 177.36 μW, indicating
that the optimized compound-gate-based logic structure contributes to low switching
activity and efficient signal propagation. Despite including additional error correction
logic, the $Proposed_{EC}$ maintains a competitive power consumption of 183.02 μW,
which is still lower than most existing designs. With respect to delay, all approximate
multipliers exhibit the identical timing performance, as they share the same reduction
tree structure. Consequently, the joint metrics more clearly differentiate overall
design efficiency. The proposed design achieves a PDP of 206.31 fJ and an ADP of 747.03
fm$^2 \cdot$s, which are among the best reported values. Notably, the PDAP, which
jointly reflects the impact of area, delay, and power, is only 132.49 aJ·m$^2$ for
the $Proposed$, outperforming all designs except the $Sabetz$, which sacrifices accuracy
significantly to minimize power.
Table 3. Performance summary of various multiplier designs in terms of various hardware
and error metrics.
|
Design
|
Hardware metrics
|
Error metrics
|
|
Area (μm$^2$)
|
Power (μW)
|
Delay (ns)
|
PDP (fJ)
|
EDP (yJ·s)
|
ADP (fm$^2 \cdot$s)
|
PDAP (aJ·m$^2$)
|
Error Rate (%)
|
NMED ($\times 10^{-3}$)
|
MRED ($\times 10^{-2}$)
|
NoEB
|
|
Exact
|
868.66
|
258.23
|
1.410
|
363.98
|
513.02
|
1224.38
|
316.17
|
−
|
−
|
−
|
16
|
|
Momeni [16] |
752.01
|
203.99
|
1.163
|
237.27
|
275.98
|
874.70
|
178.43
|
93.38
|
1.63
|
9.03
|
8.92
|
|
Akbari$_1$ [17] |
697.12
|
192.97
|
1.163
|
224.45
|
261.06
|
810.83
|
156.47
|
84.34
|
2.93
|
4.82
|
8.01
|
|
Akbari$_2$ [17] |
719.99
|
191.00
|
1.163
|
222.21
|
258.53
|
837.64
|
159.99
|
63.45
|
1.29
|
1.29
|
8.98
|
|
Venka [18] |
761.16
|
202.37
|
1.163
|
235.38
|
275.98
|
885.35
|
179.16
|
63.45
|
1.25
|
1.28
|
9.05
|
|
Ahma [19] |
669.67
|
183.24
|
1.163
|
213.13
|
247.90
|
778.92
|
142.73
|
77.40
|
1.47
|
1.70
|
8.94
|
|
Sabetz [20] |
660.52
|
171.86
|
1.163
|
199.90
|
232.52
|
768.29
|
132.04
|
97.90
|
1.93
|
9.52
|
8.71
|
|
Pei [21] |
715.42
|
175.23
|
1.151
|
201.61
|
231.96
|
823.11
|
144.24
|
98.00
|
6.20
|
8.91
|
7.18
|
|
Zhang [22] |
690.26
|
196.25
|
1.163
|
228.27
|
265.51
|
802.87
|
157.56
|
92.45
|
1.95
|
4.41
|
8.70
|
|
Proposed
|
642.22
|
177.36
|
1.163
|
206.31
|
239.97
|
747.03
|
132.49
|
77.40
|
1.47
|
1.70
|
8.94
|
|
ProposedEC
|
658.49
|
183.02
|
1.163
|
212.89
|
247.63
|
765.95
|
140.18
|
74.99
|
1.24
|
1.41
|
9.13
|
In terms of error metrics, we utilize four well-known indicators in approximate arithmetic:
error rate (ER), normalized mean error distance (NMED), mean relative error distance
(MRED), and the number of effective bits (NoEB) [23]. All figures are measured by exhaustive evaluation over all unsigned 8-bit input
pairs $a, b \in [0, 255]$ ($256 \times 256 = 65,536$ cases) under a uniform distribution,
where for each pair we compute the exact product $P$ and the approximate product $\tilde{P}$.
The $Akbari_2$ and $Venka$ achieve the lowest error rates of 63.45%, and also show
strong performance in MRED and NMED. However, these accuracy gains come at a higher
cost in area and power. In comparison, our $Proposed$ design exhibits an error rate
of 77.40% with moderate values of $1.47 \times 10^{-3}$ and $1.70 \times 10^{-2}$
in NMED and MRED, respectively, striking a better balance between resource cost and
approximate accuracy. The inclusion of compensation logic in the $Proposed_{EC}$ improves
all error metrics. Specifically, it reduces NMED to $1.24 \times 10^{-3}$ and MRED
to $1.41 \times 10^{-2}$, confirming that the compensation mechanism effectively corrects
high-probability, high-impact errors without incurring major hardware overhead. Moreover,
the NoEB of the $Proposed_{EC}$ reaches 9.13, the highest among all approximate designs,
suggesting that it preserves a greater number of statistically reliable bits.
Overall, both proposed designs demonstrate a highly competitive trade-off between
hardware efficiency and accuracy. The $Proposed$ variant is more power- and area-efficient,
while the $Proposed_{EC}$ offers enhanced accuracy with minimal additional cost, validating
the effectiveness of our compensation strategy within an approximate computing context.
Fig. 3 illustrates the trade-off between hardware efficiency and computational accuracy
by plotting PDAP against three representative error metrics: NMED, MRED, and NoEB.
For each plot, designs that are positioned lower on the vertical axis indicate better
hardware efficiency (lower PDAP), while the horizontal axis reflects computational
accuracy, smaller values are preferable for NMED and MRED, and larger values are preferable
for NoEB. In Fig. 3(a), the $Proposed$ and $Proposed_{EC}$ appear in the bottom-left region, indicating
strong hardware efficiency along with low normalized mean error distance. The $Proposed_{EC}$
shows slightly lower NMED than $Proposed$, validating the effectiveness of the error
compensation mechanism. While designs such as $Venka$ and $Akbari_2$ achieve marginally
lower NMED, they do so at the cost of significantly higher PDAP, reflecting increased
hardware complexity. In Fig. 3(b), a similar trend is observed. The proposed designs maintain their position near the
lower region of the graph, with the $Proposed_{EC}$ achieving notably lower MRED than
the $Proposed$. This again confirms that the compensation logic effectively mitigates
high-magnitude errors without compromising hardware efficiency. On the other hand,
designs like the $Sabetz$, while achieving a low PDAP, suffer from substantially higher
MRED—indicating poor relative accuracy despite being hardware-light. In Fig. 3(c), the $Proposed_{EC}$ occupies a position toward the bottom-right, demonstrating that
it offers one of the highest NoEB values while maintaining a low PDAP. This suggests
a strong preservation of effective bitwidth alongside hardware efficiency. By comparison,
designs such as the $Venka$ and $Momeni$ exhibit high NoEB but incur much greater
PDAP, again reflecting a bias toward accuracy at the cost of area and power.
Fig. 3. Tradeoff analysis of hardware efficiency and computation accuracy on various
multipliers: (a) PDAP vs NMED, (b) PDAP vs MRED, and (c) PDAP vs NoEB.
In summary, across all three plots, the proposed designs, particularly the $Proposed_{EC}$,
consistently occupy balanced, favorable regions, demonstrating their effectiveness
in jointly optimizing hardware and accuracy. While some designs prioritize one objective
over the other, our approach achieves a well-rounded trade-off, making it a compelling
candidate for approximate computing scenarios where both energy efficiency and output
fidelity are essential.
V. APPLICATIONS
1. Evaluation on Digital Image Processing
Image processing tasks often involve intensive arithmetic operations such as filtering,
denoising, and feature extraction, which place significant computational demands on
underlying hardware. To assess the practical effectiveness of the proposed approximate
multipliers, we evaluated their performance on Gaussian smoothing (i.e., blurring)
and Sobel edge detection tasks applied to a standard benchmark image. These operations
represent common image processing workloads, low-pass filtering and gradient-based
feature extraction, respectively, where approximate arithmetic can reduce hardware
cost while maintaining acceptable visual quality. The output quality was measured
using the peak signal-to-noise ratio (PSNR), a widely used objective metric in image
processing. Although PSNR does not always perfectly correlate with human perceptual
quality, higher PSNR values generally indicate results that are closer to the ground
truth, making it a useful numerical benchmark for comparing hardware-level approximations.
This evaluation allows us to quantify the impact of our approximate designs not only
in terms of hardware efficiency but also in terms of real-world application-level
fidelity.
Gaussian smoothing, also known as Gaussian blur, is a widely used filtering technique
that suppresses high-frequency noise and detail by averaging neighboring pixel values
with a weighted kernel. In our experiment, we employed a discrete 5 × 5 Gaussian kernel
to perform smoothing over a grayscale input image [24]. The kernel weights are highest at the center and gradually decrease toward the edges,
effectively applying a low-pass filter that preserves the general structure of the
image while reducing sharp variations. To evaluate the impact of approximate arithmetic
on visual quality, we replaced all multiplication operations in the Gaussian convolution
process with those performed by exact and approximate multipliers, including our proposed
designs. The resulting images were then compared using PSNR to assess the fidelity
of each multiplier architecture. This experiment provides practical insight into how
approximate multipliers affect perceptual quality in noise-reduction tasks.
Fig. 4 shows the output images produced by applying Gaussian smoothing using both exact
and approximate multipliers, along with their corresponding PSNR values in decibels
(dB). This visual and quantitative comparison allows us to assess the impact of approximate
arithmetic on image quality. Among the approximate designs, multipliers, such as the
$Momeni$ and $Sabetz$, which prioritize minimal hardware complexity, result in noticeable
visual degradation and low PSNR values of 26.91 dB and 26.56 dB, respectively. Visually,
these outputs exhibit darkened backgrounds and increased blurring in object boundaries,
indicating a loss of fine-grained detail due to excessive approximation. In particular,
high-frequency noise artifacts become more prominent in the smoother background regions,
reducing perceptual clarity. The $Pei$, with the lowest PSNR of 22.89 dB, exhibits
the most severe degradation, with overly smooth textures and poor contrast across
the image. On the other hand, designs, such as the $Akbari_2$ and $Venka$, yield relatively
higher PSNRs of 38.98 dB and 39.01 dB, respectively. Their outputs retain more structure
and edge definition, especially around the camera and tripod regions. However, subtle
background noise and minor edge distortions are still visible, suggesting residual
approximation errors. Our proposed designs demonstrate clear advantages in both objective
and perceptual quality. The $Proposed$ multiplier achieves a PSNR of 42.62 dB, matching
the best-performing reference design of the $Ahma$, while visually preserving sharpness
and smooth transitions across the image. The background remains clean, and key structural
features, such as the subject's coat edges and the camera details, are well preserved.
Most notably, the $Proposed_{EC}$ variant, which integrates targeted error compensation,
achieves the highest PSNR at 49.12 dB. In addition to this quantitative improvement,
it provides superior visual quality, with minimal background noise, enhanced uniformity,
and clearly defined object contours. The result closely resembles the exact multiplier
output, confirming the effectiveness of the compensation logic in mitigating high-impact
approximation errors while maintaining hardware efficiency.
Fig. 4. Gaussian smoothing results on cameraman image using exact and approximate
multipliers, with corresponding PSNR values.
To further evaluate the robustness of the proposed approximate multipliers in feature-sensitive
applications, we applied them to Sobel edge detection using a 5 × 5 convolution operator.
This task highlights image gradients and edge structures, making it particularly sensitive
to arithmetic precision and local error accumulation [25,26]. In our implementation, both horizontal and vertical Sobel kernels were applied to
the input image to extract edge information in both directions. As in the previous
experiment, all multiplication operations within the convolution were replaced by
exact or approximate multipliers, including our proposed designs. After computing
the gradient magnitude at each pixel, the results were normalized to an 8-bit grayscale
range. To assess structural fidelity, we compared the output images against the exact
results using PSNR, providing a quantitative measure of how well each multiplier preserves
edge detail under approximate computation. Fig. 5 presents the results of Sobel edge detection using exact and approximate multipliers,
alongside their corresponding PSNR values. Similar to the Gaussian smoothing experiment,
low-complexity designs, such as the $Momeni$ and $Sabetz$, produce visually degraded
edge maps, with PSNR values of 17.70 dB and 17.69 dB, respectively. These results
reveal a significant loss of edge clarity and increased background artifacts, caused
by imprecise arithmetic during gradient computation. Intermediate designs, including
the $Pei$ (28.37 dB) and $Zhang$ (33.92 dB), exhibit moderate performance. While they
retain general edge structures, distortions remain noticeable, particularly around
fine details such as the tripod and background textures. These visual deviations suggest
that moderate arithmetic precision may still lead to undesirable errors in feature-sensitive
tasks. By contrast, high-precision approximate designs, such as the $Akbari_2$, $Venka$,
and $Ahma$, achieve PSNR values above 40 dB and produce edge maps visually close to
the exact result, preserving critical edge contours and gradient transitions. Our
$Proposed$ design matches this level of fidelity with a PSNR of 40.21 dB, confirming
the effectiveness of the restructured compressor logic. Most notably, the $Proposed_{EC}$
variant achieves the highest PSNR of 42.21 dB, producing clean, continuous edges with
minimal background noise, closely resembling the ground-truth edge map. These results
validate that the proposed approximate multipliers, particularly the compensated version,
are highly effective at preserving high-frequency information that is crucial for
edge-based analysis and downstream vision tasks.
Fig. 5. Sobel edge detection results on cameraman image using exact and approximate
multipliers, with corresponding PSNR values.
Taken together, the outcomes of both Gaussian smoothing and Sobel edge detection experiments
consistently highlight the practical benefits of our proposed designs across varying
computational demands. While many conventional approximate multipliers sacrifice perceptual
or structural fidelity to minimize hardware cost, our $Proposed_{EC}$ architecture
demonstrates that carefully targeted error compensation, combined with a hardware-efficient
compressor design, substantially mitigates high-impact errors and achieves superior
PSNR without incurring significant overhead. In particular, the proposed compressor
achieves the lowest NMED and the highest NoEB among all evaluated designs, indicating
that it not only reduces the average error distance but also preserves effective bitwidth,
which directly translates into higher perceptual quality in image-processing tasks.
This results in improved numerical accuracy and visual quality, even under feature-sensitive
conditions. As such, the proposed approach offers a promising design paradigm for
approximate arithmetic in energy-constrained, yet accuracy-aware applications.
2. Evaluation on Convolutional Neural Networks
Deep learning inference, particularly convolutional neural networks (CNNs), represents
one of the most important and compute-intensive workloads in modern AI applications.
Since CNNs rely heavily on multiply-accumulate (MAC) operations in convolutional and
fully connected layers, the efficiency and accuracy of multipliers directly affect
both system-level performance and energy. Therefore, evaluating approximate multipliers
in CNNs provides a strong indication of their suitability for real-world artificial
intelligence (AI) accelerators.
To evaluate the impact of the proposed approximate multipliers on CNN inference accuracy,
we conduct experiments using PyTorch with the AdaPT framework, which simulates approximate
DNN accelerators by substituting each multiplier with a lookup table (LUT)-based approximate
arithmetic in the convolutional and fully connected layers [27]. The classification accuracy under the proposed and existing approximate multipliers
is measured on the CIFAR-10 dataset across three representative CNN models: VGG-19,
ResNet50, and DenseNet121. Fig. 6 summarize the classification accuracies on various multipliers. For reference, when
using the exact multiplier, the baseline accuracies are 93.80% on VGG-19, 93.61% on
ResNet50, and 93.93% on DenseNet121. These values serve as the accuracy baselines
against which all approximate designs are compared. Several approximate designs, such
as $Akbari_2$, $Venka$, and $Ahma$, achieve accuracies above 93%, confirming that
approximate arithmetic can be applied to CNNs without severe accuracy degradation.
Our proposed and $Proposed_{EC}$ designs also consistently fall into this high-accuracy
group. Specifically, $Proposed_{EC}$ achieves 93.40% on VGG-19, 93.17% on ResNet50,
and 93.61% on DenseNet121, all within 0.2 percentage points of the exact baseline.
These values are either on par with or slightly higher than the best-performing approximate
baselines. Notably, some designs, such as $Momeni$ and $Sabetz$, suffer from severe
degradation, with accuracies dropping to around 10%. Since CIFAR-10 has ten classes,
this level of accuracy is essentially equivalent to random guessing, indicating that
these multipliers fail to support meaningful CNN inference. In contrast, our proposed
designs maintain baseline-level performance, highlighting their robustness and practical
applicability. In addition to accuracy performance, our designs retain the hardware
benefits demonstrated earlier. With around 15% reduction in area and 10∼15% reduction
in power compared to other approximate compressors, the proposed multiplier provides
a highly competitive solution for CNN inference. This combination of near-baseline
accuracy and improved hardware efficiency underscores the practical value of the proposed
architecture for AI hardware accelerators.
Fig. 6. CNN classification accuracy with VGG-19, ResNet50, and DenseNet121 under different
multiplier designs.
VI. CONCLUSION
In this paper, we presented an error-aware approximate 4-2 compressor architecture
that achieves a balanced trade-off between hardware efficiency and computational accuracy
in approximate multiplier designs. The proposed design first reduces hardware cost
by simplifying the Boolean expressions of a baseline compressor, and further improves
accuracy through a lightweight error compensation logic that targets high-impact error
patterns. When implemented in a 32-nm CMOS technology, our baseline design achieves
up to 15.1% area and 9.5% power reduction compared to prior approximate counterparts,
while maintaining comparable accuracy performance. The compensated variant further
improves output fidelity, achieving the highest PSNR of 49.12 dB in Gaussian smoothing
and 42.21 dB in Sobel edge detection, surpassing all other tested designs. Both variants
also exhibit low PDAP values, indicating strong hardware efficiency, and maintain
classification accuracy within 0.5 percentage points of the baseline in CNN inference
tasks, further confirming their suitability for modern AI applications. Overall, the
proposed approach offers a practical and scalable framework for integrating approximate
arithmetic into future low-power computing architectures.
ACKNOWLEDGMENT
This work was supported by the National Research Foundation of Korea (NRF) grant funded
by the Korea government (MSIT) (RS-2024-00414964).
REFERENCES
Mendoza-Cardenas F., Aparcana-Tasayco A. J., Leon-Aguilar R. S., Quiroz-Arroyo J.
L., 2022, Cryptography for privacy in a resource-constrained IoT: A systematic literature
review, IEIE Transactions on Smart Processing and Computing, Vol. 11, No. 5, pp. 351-360

Yoon D.-H., Seo H., Lee J., Kim Y., 2024, Online electric vehicle charging strategy
in residential areas with limited power supply, IEEE Transactions on Smart Grid, Vol.
15, No. 3, pp. 3141-3151

Ryu S., 2022, Review and analysis of variable bit-precision MAC microarchitectures
for energy-efficient AI computation, Journal of Semiconductor Technology and Science,
Vol. 22, No. 5, pp. 353-360

Venkataramani S., Chakradhar S. T., Roy K., Raghunathan A., 2015, Approximate computing
and the quest for computing efficiency, Proc. of 2015 52nd ACM/EDAC/IEEE Design Automation
Conference (DAC), pp. 1-6

Xu Q., Mytkowicz T., Kim N. S., 2016, Approximate computing: A survey, IEEE Design
& Test, Vol. 33, No. 1, pp. 8-22

Kwak M., Kim J., Kim Y., 2023, TorchAxf: Enabling rapid simulation of approximate
DNN models using GPU-based floating-point computing framework, Proc. of 2023 31st
International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication
Systems (MASCOTS), pp. 1-8

Kwak M., Kim J., Kim Y., 2024, A comprehensive exploration of approximate DNN models
with a novel floating-point simulation framework, Performance Evaluation, Vol. 165,
pp. 102423

Seo H., Kim Y., 2024, Enabling quantum computer simulation under minimal precision
floating-point using irrational value decomposition, Proc. of 2024 32nd International
Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
(MASCOTS), pp. 1-8

Hwang S., Seo H., Kim Y., 2025, Can less accurate be more accurate? Surpassing exact
multiplier with approximate design on NISQ quantum computers, Proc. of the 40th ACM/SIGAPP
Symposium on Applied Computing, pp. 590-591

Kwak M., Lee S., Kim Y., 2025, Design of approximate floating-point arithmetic units
using hardware-efficient rounding schemes, IEEE Embedded Systems Letters, pp. 1-4

Seo H., Seok H., Lee J., Han Y., Kim Y., 2023, Design of an approximate adder based
on modified full adder and nonzero truncation for machine learning, Journal of Semiconductor
Technology and Science, Vol. 23, No. 2, pp. 138-148

Seo H., Kim Y., 2023, A low latency approximate adder design based on dual sub-adders
with error recovery, IEEE Transactions on Emerging Topics in Computing, Vol. 11, No.
3, pp. 811-816

Hwang S., Seok H., Kim Y., 2024, Design of an approximate 4-2 compressor with error
recovery for efficient approximate multiplication, Journal of Semiconductor Technology
and Science, Vol. 24, No. 4, pp. 305-315

Gu J., Kim Y., 2022, Design and analysis of approximate 4-c Compressor for efficient
multiplication, IEIE Transactions on Smart Processing and Computing, Vol. 11, No.
3, pp. 162-168

Strollo A. G. M., Napoli E., Caro D. De, Petra N., Meo G. D., 2020, Comparison and
extension of approximate 4-2 compressors for low-power approximate multipliers, IEEE
Transactions on Circuits and Systems I: Regular Papers, Vol. 67, No. 9, pp. 3021-3034

Momeni A., Han J., Montuschi P., Lombardi F., 2015, Design and analysis of approximate
compressors for multiplication, IEEE Transactions on Computers, Vol. 64, No. 4, pp.
984-994

Akbari O., Kamal M., Afzali-Kusha A., Pedram M., 2017, Dual-quality 4:2 compressors
for utilizing in dynamic accuracy configurable multipliers, IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, Vol. 25, No. 4, pp. 1352-1361

Venkatachalam S., Ko S.-B., 2017, Design of power and area efficient approximate multipliers,
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 5,
pp. 1782-1786

Ahmadinejad M., Moaiyeri M. H., Sabetzadeh F., 2019, Energy and area efficient imprecise
compressors for approximate multiplication at nanoscale, AEU - International Journal
of Electronics and Communications, Vol. 110, pp. 152859

Sabetzadeh F., Moaiyeri M. H., Ahmadinejad M., 2019, A majority-based imprecise multiplier
for ultra-efficient approximate image multiplication, IEEE Transactions on Circuits
and Systems I: Regular Papers, Vol. 66, No. 11, pp. 4200-4208

Pei H., Yi X., Zhou H., He Y., 2021, Design of ultra-low power consumption approximate
4-2 compressors based on the compensation characteristic, IEEE Transactions on Circuits
and Systems II: Express Briefs, Vol. 68, No. 1, pp. 461-465

Zhang M., Nishizawa S., Kimura S., 2023, Area efficient approximate 4-2 compressor
and probability-based error adjustment for approximate multiplier, IEEE Transactions
on Circuits and Systems II: Express Briefs, Vol. 70, No. 5, pp. 1714-1718

Esposito D., Strollo A. G. M., Napoli E., Caro D. De, Petra N., 2018, Approximate
multipliers based on new approximate compressors, IEEE Transactions on Circuits and
Systems I: Regular Papers, Vol. 65, No. 12, pp. 4169-4182

Hwang S., Kwon K.-W., Kim Y., 2025, Design of a hardware-efficient approximate 4-2
compressor for multiplications in image processing, IEEE Embedded Systems Letters,
Vol. 17, No. 4, pp. 226-229

Chung Y., Kim Y., 2021, Comparison of approximate computing with sobel edge detection,
IEIE Transactions on Smart Processing and Computing, Vol. 10, No. 4, pp. 355-361

Joe H., Kim Y., 2020, Compact and power-efficient sobel edge detection with fully
connected cube-network-based stochastic computing, Journal of Semiconductor Technology
and Science, Vol. 20, No. 5, pp. 436-446

Danopoulos D., Zervakis G., Siozios K., Soudris D., Henkel J., 2023, AdaPT: Fast emulation
of approximate DNN accelerators in PyTorch, IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, Vol. 42, No. 6, pp. 2074-2078

Dongju Kim received his B.S. degree from the School of Computer Science and Engineering
at Kyungpook National University, Daegu, Republic of Korea in 2025, where he is currently
pursuing an M.S. degree. His research interests include computer architecture, approximate
computing, and quantum computing.
Yongtae Kim received B.S. and M.S. degrees in electrical engineering from the Korea
University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and a Ph.D.
degree from the Department of Electrical and Computer Engineering from the Texas A&M
University, College Station, TX, in 2013. From 2013 to 2018, he was a software engineer
with Intel Corporation, Santa Clara, CA. Since 2018, he has been with the School of
Computer Science and Engineering at Kyungpook National University, Daegu, Republic
of Korea, where he is currently an Associate Professor. His research interests are
in energy-efficient integrated circuits and systems, particularly, approximate computing,
quantum computing, neuromorphic computing, and new memory architecture.