Mobile QR Code QR CODE

  1. (School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Korea)



Approximate computing, approximate multiplier, approximate 4-2 compressor, error compensation

I. INTRODUCTION

In contemporary computing, the rapid expansion of mobile devices and cloud-based data centers has led to an unprecedented demand for high-performance and energy-efficient computation. Mobile devices are increasingly required to execute computationally intensive applications, such as image processing and machine learning inference, while operating under stringent power constraints [1]. Simultaneously, data centers must process massive volumes of data with limited power budgets and physical constraints, making energy-efficient computation a critical concern [2]. However, conventional computing architectures, which rely on precise arithmetic operations, struggle to meet these demands due to inherent limitations in power efficiency, scalability, and computational speed [3]. The ever-growing complexity of workloads exacerbates these challenges, as exact computation often incurs significant hardware overhead and energy consumption. In response to these constraints, approximate computing has emerged as a promising paradigm that strategically relaxes computational accuracy to achieve substantial gains in power efficiency and performance [4,5]. Approximate computing leverages the observation that many application domains, such as multimedia processing, machine learning, and signal processing, exhibit an intrinsic tolerance to minor inaccuracies [6-9]. In image and video processing, for instance, slight distortions are often imperceptible to the human eye, allowing arithmetic operations to be performed with reduced precision without significantly degrading perceptual quality [10]. By exploiting this error tolerance, approximate computing facilitates the design of arithmetic circuits with reduced hardware complexity, lower power consumption, and improved computational efficiency, making it a compelling approach for next-generation computing architectures.

Arithmetic circuits, particularly adders and multipliers, play a crucial role in high-performance computing applications, such as signal processing, machine learning, and cryptographic systems [11-14]. Among these, multipliers are of particular importance as they directly impact overall system performance, power consumption, and area efficiency. Due to the inherent complexity of multiplication, modern multiplier architectures are designed to optimize computational efficiency by minimizing the number of operations required for partial product accumulation. Tree-based multipliers, such as Wallace and Dadda multipliers, are widely employed to enhance multiplication efficiency. These architectures aim to reduce the number of sequential addition operations by utilizing a tree-like structure that compresses partial products efficiently. Wallace multipliers employ a highly parallel reduction scheme, using carry-save adders and compressors to rapidly reduce the number of partial products at each stage. In contrast, Dadda multipliers adopt a more structured approach, minimizing the number of reduction steps while maintaining a near-optimal delay. Both architectures share a common three-stage multiplication process: 1) partial product generation, 2) partial product reduction, and 3) final summation. Among these stages, the partial product reduction stage is particularly critical, as it has a direct impact on circuit delay, area, and power consumption [15]. To enhance the efficiency of this stage, approximate compressors have been increasingly utilized for error-tolerant applications. These compressors simplify the reduction process by intentionally introducing minor errors, thereby reducing the complexity of arithmetic operations. By strategically balancing computational accuracy and hardware efficiency, approximate compressors offer a promising approach to designing energy-efficient and high-speed multipliers.

In this paper, we propose an error-aware approximate 4-2 compressor design that integrates a lightweight error compensation mechanism for balanced accuracy and efficiency in approximate multipliers. The first part of our design redefines the Boolean expressions of an existing approximate 4-2 compressor to improve logic structure and reduce hardware cost. The second part introduces a selective error correction logic that addresses high-impact, high-probability error cases to improve overall computational accuracy. As a result, in a 32-nm CMOS process, the proposed designs achieve up to 15% reductions in area and power compared to prior approximate multipliers while maintaining competitive accuracy performance. In particular, the compensated version improves accuracy without incurring significant hardware overhead, demonstrating a favorable trade-off between efficiency and precision. The key contributions of this paper are summarized as follows:

  • We present a novel approximate 4-2 compressor that improves hardware efficiency by simplifying logic expressions while maintaining acceptable accuracy.

  • We propose a selective error compensation mechanism that effectively reduces error distance with minimal hardware overhead, improving the accuracy of approximate multiplication.

  • We validate the proposed designs through implementation and application to digital image processing and deep learning tasks, demonstrating their practical effectiveness in error-resilient computing scenarios.

II. RELATED WORK

In multiplier architectures, the partial product reduction stage plays a pivotal role in determining overall circuit efficiency. Compressors, particularly 4-2 compressors, are widely utilized in this stage to minimize the number of partial products and enhance computational performance. The exact 4-2 compressor is a fundamental arithmetic component designed to accurately reduce four input bits $X_1, X_2, X_3, X_4$ along with a carry-in bit $C_{in}$ into three output signals: the carry-out bit $C_{out}$, the intermediate carry bit $Carry$, and the final sum $Sum$. The behavior of an exact 4-2 compressor is governed by the following Boolean equations:

(1)
$C_{out} = (X_1 \oplus X_2) \cdot X_3 + \overline{(X_1 \oplus X_2)} \cdot X_1,$
(2)
$Carry = (X_1 \oplus X_2 \oplus X_3 \oplus X_4) \cdot C_{in} + \overline{(X_1 \oplus X_2 \oplus X_3 \oplus X_4)} \cdot X_4,$
(3)
$Sum = X_1 \oplus X_2 \oplus X_3 \oplus X_4 \oplus C_{in}.$

These expressions dictate how the input bits are processed to generate precise output values, ensuring the exact representation of the reduced partial products without introducing computational errors. The significance of the exact 4-2 compressor lies in its ability to balance hardware complexity and computational accuracy. Unlike approximate compressors, which introduce minor errors to achieve lower power consumption and reduced circuit area, the exact 4-2 compressor prioritizes accuracy, making it particularly suitable for applications requiring high numerical precision. Consequently, it remains an essential component in conventional multiplier designs, especially in scenarios where error tolerance is minimal and strict computational correctness is paramount.

Exact 4-2 compressors maintain full computational accuracy but come at the cost of increased hardware complexity, power consumption, and propagation delay. These drawbacks become critical in large-scale multipliers where numerous compressors operate simultaneously. To mitigate these issues, approximate 4-2 compressors have been developed, leveraging strategic simplifications in Boolean logic to reduce transistor count while tolerating minor computational errors. The most common approximation techniques involve removing carry-in (i.e., $C_{in}$) and carry-out (i.e., $C_{out}$) signals and simplifying sum and carry calculations. These modifications enable lower power consumption, reduced circuit area, and improved speed, making approximate compressors highly suitable for error-resilient applications, such as deep learning accelerators, image processing, and signal processing.

Several approximate 4-2 compressor designs have been proposed to balance hardware efficiency and computational accuracy, as summarized in Table 1. Each design employs a unique strategy, offering different trade-offs in terms of error tolerance, circuit area, and power efficiency. Momeni et al. proposed a design that removes $C_{in}$ and $C_{out}$, significantly reducing logic complexity [16]. This simplification leads to lower power consumption and area overhead. However, eliminating carry propagation introduces larger error magnitudes, making this approach less suitable for applications demanding high accuracy. Akbari's dual-quality compressors (Akbari$_1$ and Akbari$_2$) offer a configurable approximation mode, allowing a trade-off between precision and efficiency [17]. The first variant (Akbari$_1$) introduces a simplified carry computation, reducing transistor count, while the second variant (Akbari$_2$) employs an optimized logic expression for better accuracy. While these designs provide adaptive performance, their switching mechanisms introduce additional control overhead, slightly increasing circuit complexity. Venkatachalam et al. developed a compressor that approximates basic arithmetic building blocks such as half-adders and full-adders [18]. This approach minimizes propagation delay and reduces power consumption, making it particularly effective for high-speed applications. However, the accumulation of small approximation errors over multiple compressor stages may result in higher total error rates in large multipliers. Ahmadinejad et al. employed NOR-based and NAND-based logic simplifications, significantly reducing transistor count while maintaining a reasonable level of accuracy [19]. This design is advantageous in low-power embedded systems, but its reliance on basic logic simplifications makes it susceptible to higher error variability across different input conditions. Sabetz et al. introduced an extreme approximation by fixing the sum ($Sum$) output to a constant value, dramatically reducing logic complexity and power consumption [20]. While this method achieves the smallest possible hardware footprint, it leads to a significant increase in computational error, limiting its applicability to only highly error-tolerant applications. Pei's compressor uses an error compensation mechanism that adjusts outputs based on specific input patterns [21]. This method reduces the overall error distance compared to other designs, improving accuracy while maintaining moderate hardware efficiency. However, the additional compensation logic slightly increases power consumption compared to more aggressively simplified compressors. Zhang's design reconfigures signal paths and selectively eliminates certain logic operations, achieving a balanced trade-off between area and error performance [22]. By restructuring logic expressions, this design maintains a reasonable level of accuracy with minimal hardware cost. However, its fixed logic modifications may not be optimal for all input distributions, making its effectiveness application-dependent.

Table 1. Summary of existing approximate 4-2 compressors.

Design Carry equation Sum equation
Momeni [16] $(X_1 + X_2) + (X_3 + X_4)$ $(X_1 \oplus X_2) + (X_3 \oplus X_4)$
Akbari$_1$ [17] $X_4$ $(X_1 \oplus X_2) + (X_3 \oplus X_4)$
Akbari$_2$ [17] $(X_1 \cdot X_2) + (X_3 \cdot X_4)$ $(X_1 \oplus X_2) + (X_3 \oplus X_4)$
Venka [18] $(X_1 \cdot X_2) + (X_3 \cdot X_4)$ $(X_1 \oplus X_2) + (X_3 \oplus X_4) + (X_1 \cdot X_2) \cdot (X_3 \cdot X_4)$
Ahma [19] $\overline{(X_1 + X_2) + (X_3 + X_4)}$ $\overline{(X_1 + X_2) \cdot (X_3 + X_4)}$
Sabetz [20] $(X_4 \cdot (X_1 + X_3)) + (X_1 \cdot X_3)$ 1
Pei [21] 0 $(X_1 \cdot X_2) + (X_3 \cdot X_4) + (X_1 + X_2) \cdot (X_3 + X_4)$
Zhang [22] $X_1 + X_2$ $(X_1 \oplus X_2) + (X_3 + X_4)$

III. PROPOSED APPROXIMATE 4-2 COMPRESSOR AND COMPENSATOR

In this section, we present a novel optimized approximate 4-2 compressor that leverages compound gates to enhance circuit efficiency while incorporating an error compensation mechanism to improve computational accuracy. The proposed design follows two key strategies: first, the systematic optimization of the compressor's Boolean logic, enabling a compound gate-based implementation that reduces area, power consumption, and delay while maintaining acceptable accuracy; second, an error compensation mechanism designed to selectively correct errors occurring in high-probability input cases.

To improve the efficiency of the 4-2 compressor, the Boolean expressions governing the carry and sum functions are systematically reformulated to enable compound gate-based implementation. Instead of relying on separate NOR and NAND logic, which increases gate count and critical path delay, we restructure the logical expressions using Boolean algebra from the design in [19] to achieve a more efficient implementation. This baseline was chosen because its Boolean structure can be directly reformulated into a compact form with minimal logic depth. Compared with other approximate compressors that rely on irregular or complex expressions, this design offers a practical balance between simplicity and accuracy, making it a suitable starting point for further optimization and selective error compensation. The optimized carry and sum functions are expressed as follows:

(4)
$Carry = \overline{(X_1 + X_2) + (X_3 + X_4)} = \overline{(X_1 + X_2)} \cdot \overline{(X_3 + X_4)},$
(5)
$Sum = \overline{(X_1 + X_2) \cdot (X_3 + X_4)} = \overline{X_1 + X_2} + \overline{X_3 + X_4}.$

These optimized equations are implemented using compound gates to minimize logic depth and transistor count. The carry function utilizes an OA22 gate, which integrates two OR operations followed by an AND operation, reducing the number of logic stages. The sum function is computed using an OR4 gate, which aggregates all four inputs directly without intermediate logic layers. As a result, our preliminary synthesis using a 32-nm CMOS library shows that this compound-gate mapping reduces the compressor area by approximately 33%, power by 38%, and critical path delay by 27% compared to the original Boolean implementation. These preliminary results confirm the effectiveness of the reformulated logic and provide the basis for the multiplier-level improvements discussed in Section IV. However, certain input conditions remain error-prone, requiring an additional error correction mechanism to improve overall computation accuracy.

While approximate 4-2 compressors reduce hardware complexity, they inherently introduce errors in specific input conditions. To quantify these errors, we assume uniformly distributed inputs to the multiplier, a commonly adopted assumption in approximate arithmetic. Since each partial product is the output of an AND gate, the probability of '1' is 1/4 and that of '0' is 3/4. As a result, an input pattern of the 4-2 compressor with "0000" occurs most frequently $(3/4)^4 = 81/256$, while "1111" is least frequent $(1/4)^4 = 1/256$. Intermediate input cases follow accordingly, with single-one patterns occurring with probability $27/256$, two-one patterns with $9/256$, and three-one patterns with $3/256$. These derived probabilities are used to weight the error distance of each input pattern. A detailed analysis of the truth table of the optimized approximate compressor, as shown in Table 2, reveals that errors predominantly occur when inputs have a probability of $9/256$, leading to an error distance of +1. These errors arise when the sum output is incorrectly computed as '1' instead of '0,' which may propagate through subsequent arithmetic operations and degrade computational accuracy. To mitigate this issue, a compensation function is derived using Boolean algebraic transformations. The error correction function $EC$ is optimized as follows:

(6)
$\begin{aligned} EC &= \overline{(X_1 \oplus X_2) \cdot (X_3 \oplus X_4)} \\ &= \overline{(X_1 + X_2) \cdot (\overline{X_1} + \overline{X_2}) \cdot (X_3 + X_4) \cdot (\overline{X_3} + \overline{X_4})} \\ &= \overline{(X_1 + X_2) \cdot \overline{(X_1 \cdot X_2)} \cdot (X_3 + X_4) \cdot \overline{(X_3 \cdot X_4)}} \\ &= \overline{(X_1 + X_2) \cdot (X_3 + X_4)} + X_1X_2 + X_3X_4 \\ &= \overline{(X_1 + X_2)} \cdot \overline{(X_3 + X_4)} + X_1X_2 + X_3X_4. \end{aligned}$

Table 2. Truth table of the optimized compressor with and without proposed error compensation mechanism.

$X_4X_3X_2X_1$ Prob. w/o EC w/ EC
C S ED C S ED
0 0 0 0 81/256 0 0 0 0 0 0
0 0 0 1 27/256 0 1 0 0 1 0
0 0 1 0 27/256 0 1 0 0 1 0
0 0 1 1 9/256 0 1 −1 0 1 −1
0 1 0 0 27/256 0 1 0 0 1 0
0 1 0 1 9/256 1 1 +1 1 0 0
0 1 1 0 9/256 1 1 +1 1 0 0
0 1 1 1 3/256 1 1 0 1 1 0
1 0 0 0 27/256 0 1 0 0 1 0
1 0 0 1 9/256 1 1 +1 1 0 0
1 0 1 0 9/256 1 1 +1 1 0 0
1 0 1 1 3/256 1 1 0 1 1 0
1 1 0 0 9/256 0 1 −1 0 1 −1
1 1 0 1 3/256 1 1 0 1 1 0
1 1 1 0 3/256 1 1 0 1 1 0
1 1 1 1 1/256 1 1 −1 1 1 −1

This compensation function selectively detects and corrects errors while minimizing hardware overhead. The implementation is realized using compound gates, ensuring efficient integration with the compressor logic. An OAI22 gate is utilized to detect error-prone input conditions, followed by an AO221 gate that combines the detected error signal with auxiliary logic terms. The final compensation output is XORed with the sum function to correct identified errors while maintaining the area and power advantages of approximation.

The complete architecture of the proposed compressor, integrating both the optimized compressor logic and the error compensation mechanism, is illustrated in Fig. 1. The design consists of an optimized compressor module, which employs compound gates to minimize delay and power consumption, and an error compensation module, which selectively corrects erroneous cases using an OAI22-AO221-based logic network. This architecture achieves an improved trade-off between computational accuracy and hardware efficiency while ensuring a minimal increase in circuit complexity.

Fig. 1. Proposed optimized approximate 4-2 compressor with error compensation logic.

../../Resources/ieie/JSTS.2025.25.6.633/fig1.png

Building upon the proposed approximate 4-2 compressor and error compensation mechanism, we extend our designs to an optimized approximate 8 × 8 multiplier architecture that strategically integrates these compressor and error compensator to achieve a balance between hardware efficiency and computational accuracy. The complete structure of the proposed multiplier is illustrated in Fig. 2, which follows the $C$-$N$ configuration, a well-established technique that selectively applies approximate compressors to the $N$ least significant columns of the partial product matrix to optimize energy efficiency while maintaining acceptable precision. In this design, the partial products are initially generated using conventional AND gates, forming an 8 × 8 partial product matrix. These partial products are then compressed using a selective arrangement of compressors based on their error tolerance characteristics. The proposed optimized compressor without any error compensation scheme (marked in red in Fig. 2) are primarily placed in the least significant columns, where minor computational errors have a negligible impact on the overall accuracy of the multiplier. In contrast, the proposed compressor with integrated compensation logic (marked in blue in Fig. 2) are selectively deployed in the seventh column, where error-prone cases are more critical. This targeted placement ensures that error compensation is applied only in regions where it provides the most benefit, effectively reducing significant computational deviations while maintaining a compact hardware footprint. To preserve computational precision in the final summation stages, exact 4-2 compressors, full adders, and half adders are used in the most significant columns. After partial product compression by the compressors, the remaining partial sums are processed by an accurate adder, such as ripple-carry adder, composed of conventional full adders and half adders, to compute the final multiplication result. By strategically incorporating the proposed compressors within this architecture, the multiplier achieves an optimal trade-off between accuracy, power efficiency, and circuit complexity.

Fig. 2. Proposed 8 × 8 approximate multiplier structure based on C-N configuration, applying optimized compressors with and without the proposed error compensation logic.

../../Resources/ieie/JSTS.2025.25.6.633/fig2.png

IV. EXPERIMENTAL RESULTS

To evaluate the effectiveness of the proposed approximate 4-2 compressors in multiplier designs, we implemented two variants of the 8 × 8 multiplier, shown in Fig. 2, in Verilog HDL. The first variant, referred to as $Proposed$, incorporates the optimized compressor without the error compensation module. The second variant, denoted as $Proposed_{EC}$, includes the proposed compensation logic described in the previous section. Both designs were synthesized using a 32-nm standard cell library to obtain post-synthesis metrics including area, delay, and power consumption. It is noteworthy that, for the compressors, compound gates were directly instantiated in the Verilog description, and synthesis was performed with optimization options disabled to preserve the intended mapping. For comparison, we also implemented and synthesized a baseline exact multiplier design as well as eight representative approximate multipliers, each based on a different 4-2 compressor previously presented in the literature. These include designs utilizing compressors proposed by Momeni [16], Akbari [17], Venka [18], Ahma [19], Sabetz [20], Pei [21], and Zhang [22]. All multiplier architectures were configured identically in terms of partial product generation and reduction tree structure to ensure a fair comparison.

Table 3 summarizes the performance of all multiplier designs in terms of hardware and error metrics. For clarity and ease of comparison, the top two designs for each metric are highlighted in bold. The hardware evaluation includes fundamental parameters, such as area, power, and delay, as well as compound metrics including power-delay product (PDP), energy-delay product (EDP), area-delay product (ADP), and power-delay-area product (PDAP), which collectively provide insight into overall design efficiency. Among all evaluated designs, both the $Proposed$ and $Proposed_{EC}$ achieve the smallest area footprints, measuring 642.22 μm$^2$ and 658.49 μm$^2$, respectively. This represents a 26.1% and 24.2% area reduction compared to the exact multiplier and a 5%∼10% improvement over most other approximate designs. In terms of power consumption, the $Sabetz$ and $Pei$ record the lowest values of 171.86 μW and 175.23 μW, respectively, due to their extremely simplified logic, while the $Proposed$ follows closely at 177.36 μW, indicating that the optimized compound-gate-based logic structure contributes to low switching activity and efficient signal propagation. Despite including additional error correction logic, the $Proposed_{EC}$ maintains a competitive power consumption of 183.02 μW, which is still lower than most existing designs. With respect to delay, all approximate multipliers exhibit the identical timing performance, as they share the same reduction tree structure. Consequently, the joint metrics more clearly differentiate overall design efficiency. The proposed design achieves a PDP of 206.31 fJ and an ADP of 747.03 fm$^2 \cdot$s, which are among the best reported values. Notably, the PDAP, which jointly reflects the impact of area, delay, and power, is only 132.49 aJ·m$^2$ for the $Proposed$, outperforming all designs except the $Sabetz$, which sacrifices accuracy significantly to minimize power.

Table 3. Performance summary of various multiplier designs in terms of various hardware and error metrics.

Design Hardware metrics Error metrics
Area (μm$^2$) Power (μW) Delay (ns) PDP (fJ) EDP (yJ·s) ADP (fm$^2 \cdot$s) PDAP (aJ·m$^2$) Error Rate (%) NMED ($\times 10^{-3}$) MRED ($\times 10^{-2}$) NoEB
Exact 868.66 258.23 1.410 363.98 513.02 1224.38 316.17 16
Momeni [16] 752.01 203.99 1.163 237.27 275.98 874.70 178.43 93.38 1.63 9.03 8.92
Akbari$_1$ [17] 697.12 192.97 1.163 224.45 261.06 810.83 156.47 84.34 2.93 4.82 8.01
Akbari$_2$ [17] 719.99 191.00 1.163 222.21 258.53 837.64 159.99 63.45 1.29 1.29 8.98
Venka [18] 761.16 202.37 1.163 235.38 275.98 885.35 179.16 63.45 1.25 1.28 9.05
Ahma [19] 669.67 183.24 1.163 213.13 247.90 778.92 142.73 77.40 1.47 1.70 8.94
Sabetz [20] 660.52 171.86 1.163 199.90 232.52 768.29 132.04 97.90 1.93 9.52 8.71
Pei [21] 715.42 175.23 1.151 201.61 231.96 823.11 144.24 98.00 6.20 8.91 7.18
Zhang [22] 690.26 196.25 1.163 228.27 265.51 802.87 157.56 92.45 1.95 4.41 8.70
Proposed 642.22 177.36 1.163 206.31 239.97 747.03 132.49 77.40 1.47 1.70 8.94
ProposedEC 658.49 183.02 1.163 212.89 247.63 765.95 140.18 74.99 1.24 1.41 9.13

In terms of error metrics, we utilize four well-known indicators in approximate arithmetic: error rate (ER), normalized mean error distance (NMED), mean relative error distance (MRED), and the number of effective bits (NoEB) [23]. All figures are measured by exhaustive evaluation over all unsigned 8-bit input pairs $a, b \in [0, 255]$ ($256 \times 256 = 65,536$ cases) under a uniform distribution, where for each pair we compute the exact product $P$ and the approximate product $\tilde{P}$. The $Akbari_2$ and $Venka$ achieve the lowest error rates of 63.45%, and also show strong performance in MRED and NMED. However, these accuracy gains come at a higher cost in area and power. In comparison, our $Proposed$ design exhibits an error rate of 77.40% with moderate values of $1.47 \times 10^{-3}$ and $1.70 \times 10^{-2}$ in NMED and MRED, respectively, striking a better balance between resource cost and approximate accuracy. The inclusion of compensation logic in the $Proposed_{EC}$ improves all error metrics. Specifically, it reduces NMED to $1.24 \times 10^{-3}$ and MRED to $1.41 \times 10^{-2}$, confirming that the compensation mechanism effectively corrects high-probability, high-impact errors without incurring major hardware overhead. Moreover, the NoEB of the $Proposed_{EC}$ reaches 9.13, the highest among all approximate designs, suggesting that it preserves a greater number of statistically reliable bits.

Overall, both proposed designs demonstrate a highly competitive trade-off between hardware efficiency and accuracy. The $Proposed$ variant is more power- and area-efficient, while the $Proposed_{EC}$ offers enhanced accuracy with minimal additional cost, validating the effectiveness of our compensation strategy within an approximate computing context.

Fig. 3 illustrates the trade-off between hardware efficiency and computational accuracy by plotting PDAP against three representative error metrics: NMED, MRED, and NoEB. For each plot, designs that are positioned lower on the vertical axis indicate better hardware efficiency (lower PDAP), while the horizontal axis reflects computational accuracy, smaller values are preferable for NMED and MRED, and larger values are preferable for NoEB. In Fig. 3(a), the $Proposed$ and $Proposed_{EC}$ appear in the bottom-left region, indicating strong hardware efficiency along with low normalized mean error distance. The $Proposed_{EC}$ shows slightly lower NMED than $Proposed$, validating the effectiveness of the error compensation mechanism. While designs such as $Venka$ and $Akbari_2$ achieve marginally lower NMED, they do so at the cost of significantly higher PDAP, reflecting increased hardware complexity. In Fig. 3(b), a similar trend is observed. The proposed designs maintain their position near the lower region of the graph, with the $Proposed_{EC}$ achieving notably lower MRED than the $Proposed$. This again confirms that the compensation logic effectively mitigates high-magnitude errors without compromising hardware efficiency. On the other hand, designs like the $Sabetz$, while achieving a low PDAP, suffer from substantially higher MRED—indicating poor relative accuracy despite being hardware-light. In Fig. 3(c), the $Proposed_{EC}$ occupies a position toward the bottom-right, demonstrating that it offers one of the highest NoEB values while maintaining a low PDAP. This suggests a strong preservation of effective bitwidth alongside hardware efficiency. By comparison, designs such as the $Venka$ and $Momeni$ exhibit high NoEB but incur much greater PDAP, again reflecting a bias toward accuracy at the cost of area and power.

Fig. 3. Tradeoff analysis of hardware efficiency and computation accuracy on various multipliers: (a) PDAP vs NMED, (b) PDAP vs MRED, and (c) PDAP vs NoEB.

../../Resources/ieie/JSTS.2025.25.6.633/fig3.png

In summary, across all three plots, the proposed designs, particularly the $Proposed_{EC}$, consistently occupy balanced, favorable regions, demonstrating their effectiveness in jointly optimizing hardware and accuracy. While some designs prioritize one objective over the other, our approach achieves a well-rounded trade-off, making it a compelling candidate for approximate computing scenarios where both energy efficiency and output fidelity are essential.

V. APPLICATIONS

1. Evaluation on Digital Image Processing

Image processing tasks often involve intensive arithmetic operations such as filtering, denoising, and feature extraction, which place significant computational demands on underlying hardware. To assess the practical effectiveness of the proposed approximate multipliers, we evaluated their performance on Gaussian smoothing (i.e., blurring) and Sobel edge detection tasks applied to a standard benchmark image. These operations represent common image processing workloads, low-pass filtering and gradient-based feature extraction, respectively, where approximate arithmetic can reduce hardware cost while maintaining acceptable visual quality. The output quality was measured using the peak signal-to-noise ratio (PSNR), a widely used objective metric in image processing. Although PSNR does not always perfectly correlate with human perceptual quality, higher PSNR values generally indicate results that are closer to the ground truth, making it a useful numerical benchmark for comparing hardware-level approximations. This evaluation allows us to quantify the impact of our approximate designs not only in terms of hardware efficiency but also in terms of real-world application-level fidelity.

Gaussian smoothing, also known as Gaussian blur, is a widely used filtering technique that suppresses high-frequency noise and detail by averaging neighboring pixel values with a weighted kernel. In our experiment, we employed a discrete 5 × 5 Gaussian kernel to perform smoothing over a grayscale input image [24]. The kernel weights are highest at the center and gradually decrease toward the edges, effectively applying a low-pass filter that preserves the general structure of the image while reducing sharp variations. To evaluate the impact of approximate arithmetic on visual quality, we replaced all multiplication operations in the Gaussian convolution process with those performed by exact and approximate multipliers, including our proposed designs. The resulting images were then compared using PSNR to assess the fidelity of each multiplier architecture. This experiment provides practical insight into how approximate multipliers affect perceptual quality in noise-reduction tasks.

Fig. 4 shows the output images produced by applying Gaussian smoothing using both exact and approximate multipliers, along with their corresponding PSNR values in decibels (dB). This visual and quantitative comparison allows us to assess the impact of approximate arithmetic on image quality. Among the approximate designs, multipliers, such as the $Momeni$ and $Sabetz$, which prioritize minimal hardware complexity, result in noticeable visual degradation and low PSNR values of 26.91 dB and 26.56 dB, respectively. Visually, these outputs exhibit darkened backgrounds and increased blurring in object boundaries, indicating a loss of fine-grained detail due to excessive approximation. In particular, high-frequency noise artifacts become more prominent in the smoother background regions, reducing perceptual clarity. The $Pei$, with the lowest PSNR of 22.89 dB, exhibits the most severe degradation, with overly smooth textures and poor contrast across the image. On the other hand, designs, such as the $Akbari_2$ and $Venka$, yield relatively higher PSNRs of 38.98 dB and 39.01 dB, respectively. Their outputs retain more structure and edge definition, especially around the camera and tripod regions. However, subtle background noise and minor edge distortions are still visible, suggesting residual approximation errors. Our proposed designs demonstrate clear advantages in both objective and perceptual quality. The $Proposed$ multiplier achieves a PSNR of 42.62 dB, matching the best-performing reference design of the $Ahma$, while visually preserving sharpness and smooth transitions across the image. The background remains clean, and key structural features, such as the subject's coat edges and the camera details, are well preserved. Most notably, the $Proposed_{EC}$ variant, which integrates targeted error compensation, achieves the highest PSNR at 49.12 dB. In addition to this quantitative improvement, it provides superior visual quality, with minimal background noise, enhanced uniformity, and clearly defined object contours. The result closely resembles the exact multiplier output, confirming the effectiveness of the compensation logic in mitigating high-impact approximation errors while maintaining hardware efficiency.

Fig. 4. Gaussian smoothing results on cameraman image using exact and approximate multipliers, with corresponding PSNR values.

../../Resources/ieie/JSTS.2025.25.6.633/fig4.png

To further evaluate the robustness of the proposed approximate multipliers in feature-sensitive applications, we applied them to Sobel edge detection using a 5 × 5 convolution operator. This task highlights image gradients and edge structures, making it particularly sensitive to arithmetic precision and local error accumulation [25,26]. In our implementation, both horizontal and vertical Sobel kernels were applied to the input image to extract edge information in both directions. As in the previous experiment, all multiplication operations within the convolution were replaced by exact or approximate multipliers, including our proposed designs. After computing the gradient magnitude at each pixel, the results were normalized to an 8-bit grayscale range. To assess structural fidelity, we compared the output images against the exact results using PSNR, providing a quantitative measure of how well each multiplier preserves edge detail under approximate computation. Fig. 5 presents the results of Sobel edge detection using exact and approximate multipliers, alongside their corresponding PSNR values. Similar to the Gaussian smoothing experiment, low-complexity designs, such as the $Momeni$ and $Sabetz$, produce visually degraded edge maps, with PSNR values of 17.70 dB and 17.69 dB, respectively. These results reveal a significant loss of edge clarity and increased background artifacts, caused by imprecise arithmetic during gradient computation. Intermediate designs, including the $Pei$ (28.37 dB) and $Zhang$ (33.92 dB), exhibit moderate performance. While they retain general edge structures, distortions remain noticeable, particularly around fine details such as the tripod and background textures. These visual deviations suggest that moderate arithmetic precision may still lead to undesirable errors in feature-sensitive tasks. By contrast, high-precision approximate designs, such as the $Akbari_2$, $Venka$, and $Ahma$, achieve PSNR values above 40 dB and produce edge maps visually close to the exact result, preserving critical edge contours and gradient transitions. Our $Proposed$ design matches this level of fidelity with a PSNR of 40.21 dB, confirming the effectiveness of the restructured compressor logic. Most notably, the $Proposed_{EC}$ variant achieves the highest PSNR of 42.21 dB, producing clean, continuous edges with minimal background noise, closely resembling the ground-truth edge map. These results validate that the proposed approximate multipliers, particularly the compensated version, are highly effective at preserving high-frequency information that is crucial for edge-based analysis and downstream vision tasks.

Fig. 5. Sobel edge detection results on cameraman image using exact and approximate multipliers, with corresponding PSNR values.

../../Resources/ieie/JSTS.2025.25.6.633/fig5.png

Taken together, the outcomes of both Gaussian smoothing and Sobel edge detection experiments consistently highlight the practical benefits of our proposed designs across varying computational demands. While many conventional approximate multipliers sacrifice perceptual or structural fidelity to minimize hardware cost, our $Proposed_{EC}$ architecture demonstrates that carefully targeted error compensation, combined with a hardware-efficient compressor design, substantially mitigates high-impact errors and achieves superior PSNR without incurring significant overhead. In particular, the proposed compressor achieves the lowest NMED and the highest NoEB among all evaluated designs, indicating that it not only reduces the average error distance but also preserves effective bitwidth, which directly translates into higher perceptual quality in image-processing tasks. This results in improved numerical accuracy and visual quality, even under feature-sensitive conditions. As such, the proposed approach offers a promising design paradigm for approximate arithmetic in energy-constrained, yet accuracy-aware applications.

2. Evaluation on Convolutional Neural Networks

Deep learning inference, particularly convolutional neural networks (CNNs), represents one of the most important and compute-intensive workloads in modern AI applications. Since CNNs rely heavily on multiply-accumulate (MAC) operations in convolutional and fully connected layers, the efficiency and accuracy of multipliers directly affect both system-level performance and energy. Therefore, evaluating approximate multipliers in CNNs provides a strong indication of their suitability for real-world artificial intelligence (AI) accelerators.

To evaluate the impact of the proposed approximate multipliers on CNN inference accuracy, we conduct experiments using PyTorch with the AdaPT framework, which simulates approximate DNN accelerators by substituting each multiplier with a lookup table (LUT)-based approximate arithmetic in the convolutional and fully connected layers [27]. The classification accuracy under the proposed and existing approximate multipliers is measured on the CIFAR-10 dataset across three representative CNN models: VGG-19, ResNet50, and DenseNet121. Fig. 6 summarize the classification accuracies on various multipliers. For reference, when using the exact multiplier, the baseline accuracies are 93.80% on VGG-19, 93.61% on ResNet50, and 93.93% on DenseNet121. These values serve as the accuracy baselines against which all approximate designs are compared. Several approximate designs, such as $Akbari_2$, $Venka$, and $Ahma$, achieve accuracies above 93%, confirming that approximate arithmetic can be applied to CNNs without severe accuracy degradation. Our proposed and $Proposed_{EC}$ designs also consistently fall into this high-accuracy group. Specifically, $Proposed_{EC}$ achieves 93.40% on VGG-19, 93.17% on ResNet50, and 93.61% on DenseNet121, all within 0.2 percentage points of the exact baseline. These values are either on par with or slightly higher than the best-performing approximate baselines. Notably, some designs, such as $Momeni$ and $Sabetz$, suffer from severe degradation, with accuracies dropping to around 10%. Since CIFAR-10 has ten classes, this level of accuracy is essentially equivalent to random guessing, indicating that these multipliers fail to support meaningful CNN inference. In contrast, our proposed designs maintain baseline-level performance, highlighting their robustness and practical applicability. In addition to accuracy performance, our designs retain the hardware benefits demonstrated earlier. With around 15% reduction in area and 10∼15% reduction in power compared to other approximate compressors, the proposed multiplier provides a highly competitive solution for CNN inference. This combination of near-baseline accuracy and improved hardware efficiency underscores the practical value of the proposed architecture for AI hardware accelerators.

Fig. 6. CNN classification accuracy with VGG-19, ResNet50, and DenseNet121 under different multiplier designs.

../../Resources/ieie/JSTS.2025.25.6.633/fig6.png

VI. CONCLUSION

In this paper, we presented an error-aware approximate 4-2 compressor architecture that achieves a balanced trade-off between hardware efficiency and computational accuracy in approximate multiplier designs. The proposed design first reduces hardware cost by simplifying the Boolean expressions of a baseline compressor, and further improves accuracy through a lightweight error compensation logic that targets high-impact error patterns. When implemented in a 32-nm CMOS technology, our baseline design achieves up to 15.1% area and 9.5% power reduction compared to prior approximate counterparts, while maintaining comparable accuracy performance. The compensated variant further improves output fidelity, achieving the highest PSNR of 49.12 dB in Gaussian smoothing and 42.21 dB in Sobel edge detection, surpassing all other tested designs. Both variants also exhibit low PDAP values, indicating strong hardware efficiency, and maintain classification accuracy within 0.5 percentage points of the baseline in CNN inference tasks, further confirming their suitability for modern AI applications. Overall, the proposed approach offers a practical and scalable framework for integrating approximate arithmetic into future low-power computing architectures.

ACKNOWLEDGMENT

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2024-00414964).

REFERENCES

1 
Mendoza-Cardenas F., Aparcana-Tasayco A. J., Leon-Aguilar R. S., Quiroz-Arroyo J. L., 2022, Cryptography for privacy in a resource-constrained IoT: A systematic literature review, IEIE Transactions on Smart Processing and Computing, Vol. 11, No. 5, pp. 351-360DOI
2 
Yoon D.-H., Seo H., Lee J., Kim Y., 2024, Online electric vehicle charging strategy in residential areas with limited power supply, IEEE Transactions on Smart Grid, Vol. 15, No. 3, pp. 3141-3151DOI
3 
Ryu S., 2022, Review and analysis of variable bit-precision MAC microarchitectures for energy-efficient AI computation, Journal of Semiconductor Technology and Science, Vol. 22, No. 5, pp. 353-360DOI
4 
Venkataramani S., Chakradhar S. T., Roy K., Raghunathan A., 2015, Approximate computing and the quest for computing efficiency, Proc. of 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1-6DOI
5 
Xu Q., Mytkowicz T., Kim N. S., 2016, Approximate computing: A survey, IEEE Design & Test, Vol. 33, No. 1, pp. 8-22DOI
6 
Kwak M., Kim J., Kim Y., 2023, TorchAxf: Enabling rapid simulation of approximate DNN models using GPU-based floating-point computing framework, Proc. of 2023 31st International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1-8DOI
7 
Kwak M., Kim J., Kim Y., 2024, A comprehensive exploration of approximate DNN models with a novel floating-point simulation framework, Performance Evaluation, Vol. 165, pp. 102423DOI
8 
Seo H., Kim Y., 2024, Enabling quantum computer simulation under minimal precision floating-point using irrational value decomposition, Proc. of 2024 32nd International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1-8DOI
9 
Hwang S., Seo H., Kim Y., 2025, Can less accurate be more accurate? Surpassing exact multiplier with approximate design on NISQ quantum computers, Proc. of the 40th ACM/SIGAPP Symposium on Applied Computing, pp. 590-591DOI
10 
Kwak M., Lee S., Kim Y., 2025, Design of approximate floating-point arithmetic units using hardware-efficient rounding schemes, IEEE Embedded Systems Letters, pp. 1-4DOI
11 
Seo H., Seok H., Lee J., Han Y., Kim Y., 2023, Design of an approximate adder based on modified full adder and nonzero truncation for machine learning, Journal of Semiconductor Technology and Science, Vol. 23, No. 2, pp. 138-148DOI
12 
Seo H., Kim Y., 2023, A low latency approximate adder design based on dual sub-adders with error recovery, IEEE Transactions on Emerging Topics in Computing, Vol. 11, No. 3, pp. 811-816DOI
13 
Hwang S., Seok H., Kim Y., 2024, Design of an approximate 4-2 compressor with error recovery for efficient approximate multiplication, Journal of Semiconductor Technology and Science, Vol. 24, No. 4, pp. 305-315DOI
14 
Gu J., Kim Y., 2022, Design and analysis of approximate 4-c Compressor for efficient multiplication, IEIE Transactions on Smart Processing and Computing, Vol. 11, No. 3, pp. 162-168DOI
15 
Strollo A. G. M., Napoli E., Caro D. De, Petra N., Meo G. D., 2020, Comparison and extension of approximate 4-2 compressors for low-power approximate multipliers, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 67, No. 9, pp. 3021-3034DOI
16 
Momeni A., Han J., Montuschi P., Lombardi F., 2015, Design and analysis of approximate compressors for multiplication, IEEE Transactions on Computers, Vol. 64, No. 4, pp. 984-994DOI
17 
Akbari O., Kamal M., Afzali-Kusha A., Pedram M., 2017, Dual-quality 4:2 compressors for utilizing in dynamic accuracy configurable multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 4, pp. 1352-1361DOI
18 
Venkatachalam S., Ko S.-B., 2017, Design of power and area efficient approximate multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 5, pp. 1782-1786DOI
19 
Ahmadinejad M., Moaiyeri M. H., Sabetzadeh F., 2019, Energy and area efficient imprecise compressors for approximate multiplication at nanoscale, AEU - International Journal of Electronics and Communications, Vol. 110, pp. 152859DOI
20 
Sabetzadeh F., Moaiyeri M. H., Ahmadinejad M., 2019, A majority-based imprecise multiplier for ultra-efficient approximate image multiplication, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 66, No. 11, pp. 4200-4208DOI
21 
Pei H., Yi X., Zhou H., He Y., 2021, Design of ultra-low power consumption approximate 4-2 compressors based on the compensation characteristic, IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 68, No. 1, pp. 461-465DOI
22 
Zhang M., Nishizawa S., Kimura S., 2023, Area efficient approximate 4-2 compressor and probability-based error adjustment for approximate multiplier, IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 70, No. 5, pp. 1714-1718DOI
23 
Esposito D., Strollo A. G. M., Napoli E., Caro D. De, Petra N., 2018, Approximate multipliers based on new approximate compressors, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 65, No. 12, pp. 4169-4182DOI
24 
Hwang S., Kwon K.-W., Kim Y., 2025, Design of a hardware-efficient approximate 4-2 compressor for multiplications in image processing, IEEE Embedded Systems Letters, Vol. 17, No. 4, pp. 226-229DOI
25 
Chung Y., Kim Y., 2021, Comparison of approximate computing with sobel edge detection, IEIE Transactions on Smart Processing and Computing, Vol. 10, No. 4, pp. 355-361DOI
26 
Joe H., Kim Y., 2020, Compact and power-efficient sobel edge detection with fully connected cube-network-based stochastic computing, Journal of Semiconductor Technology and Science, Vol. 20, No. 5, pp. 436-446DOI
27 
Danopoulos D., Zervakis G., Siozios K., Soudris D., Henkel J., 2023, AdaPT: Fast emulation of approximate DNN accelerators in PyTorch, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 42, No. 6, pp. 2074-2078DOI
Dongju Kim
../../Resources/ieie/JSTS.2025.25.6.633/au1.png

Dongju Kim received his B.S. degree from the School of Computer Science and Engineering at Kyungpook National University, Daegu, Republic of Korea in 2025, where he is currently pursuing an M.S. degree. His research interests include computer architecture, approximate computing, and quantum computing.

Yongtae Kim
../../Resources/ieie/JSTS.2025.25.6.633/au2.png

Yongtae Kim received B.S. and M.S. degrees in electrical engineering from the Korea University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and a Ph.D. degree from the Department of Electrical and Computer Engineering from the Texas A&M University, College Station, TX, in 2013. From 2013 to 2018, he was a software engineer with Intel Corporation, Santa Clara, CA. Since 2018, he has been with the School of Computer Science and Engineering at Kyungpook National University, Daegu, Republic of Korea, where he is currently an Associate Professor. His research interests are in energy-efficient integrated circuits and systems, particularly, approximate computing, quantum computing, neuromorphic computing, and new memory architecture.