Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 25, No. 04, p.363-374

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 11 Dec. 2024Revised : 3 Mar. 2025Accepted : 9 Jun 2025

DOI :

https://doi.org/10.5573/JSTS.2025.25.4.363

Scalable Bit Partitioning for Hybrid Random Number Generation in Parallel Stochastic Computing

LeeDonghui¹ KimYongtae¹

(The School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Korea)

^* E-mail: yongtae@knu.ac.kr (Corresponding author: Yongtae Kim)

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Stochastic computing (SC) allows error tolerance design and the ability to perform complex binary operations using simple digital logic circuits, advantageous for low-power design. However, it requires a longer computation time for higher accuracy. To address this, we propose a novel random number generator (RNG) that produces multiple random numbers simultaneously, reducing computation time through parallel processing while achieving lower hardware resource consumption compared to conventional parallel RNG. Our design uses fixed upper bits and a single RNG to generate multiple random numbers in parallel. This approach demonstrates computation speeds up to 64 times faster than conventional methods while reducing area and power by up to 68.2% and 81.9%, respectively, when implemented in 65-$nm$ CMOS technology. Furthermore, we prove the efficacy of our proposed design in digital image processing applications.

Index Terms

Stochastic computing (SC), random number generator (RNG), bit partitioning, parallel computing

I. INTRODUCTION

Stochastic computing (SC) is an emerging computational paradigm that uses probabilistic methods to perform computations, unlike traditional deterministic binary computing ^[1]. The primary advantages of SC are its high error tolerance to external noise and hardware efficiency. By representing values as bit streams, SC can maintain overall computational accuracy even when individual bits are erroneous ^[2]. In SC, numbers are represented as bit streams, where the value is encoded in the probability of encountering a `1' in the stream, denoted as a stochastic number (SN). For example, an 8-bit binary stream ``10110010'' contains four occurrences of `1' in a total of eight bits, corresponding to a value $4/8 = 0.5$. The basic operation of SC involves generating stochastic bit streams using a stochastic number generator (SNG), which typically consists of a random number generator (RNG) and a comparator. In addition to its bit error resilience, another advantage of SC is that arithmetic operations can be performed using simple logic gates, such as AND gates for multiplication and multiplexers (MUX) for addition, significantly reducing hardware complexity compared to binary arithmetic units.

Fig. 1 exhibits the implementation of addition, subtraction, and multiplication operations using simple logic gates. These characteristics make SC applicable to applications where slight computational errors are acceptable, such as neural networks, image processing, and digital filters ^[3-^9], which is similar to the nature of approximate computing ^[10-^13]. However, SC requires longer bitstreams to achieve higher accuracy and precision, consequently increasing computation time. Specifically, to achieve equivalent precision for an n-bit binary number (BN) in SC, a $2^{n}$-bit SN bitstream is required. For instance, an 8-bit BN requires a 256-bit SN bitstream to maintain the same precision.

To address the computational latency caused by long bit streams for high accuracy in SC systems, we propose a scalable bit partitioning RNG architecture. While our previous work in ^[14] generated two parallel random numbers by fixing the most significant bit (MSB) to either 0 or 1, this approach was limited in scalability for higher parallel processing requirements. Our enhanced design in this work fixes multiple bits in the MSBs and employs a single RNG for the remaining bits, achieving improved scalability while reducing hardware complexity without compromising random number generation capability. This hybrid scheme allows for the generation of multiple random numbers simultaneously, improving the speed of computation in SC systems while maintaining the inherent benefits of SC. The main contributions of this work can be outlined as follows:

• We propose a hybrid RNG design that combines scalable bit partitioning between fixed bits for MSBs and random number generation for the remaining bits.

• We provide a comparative analysis of the proposed RNG against traditional RNG, evaluating and demonstrating both accuracy and hardware resource consumption.

• We validate our proposed RNG by applying it to digital image processing, demonstrating its efficacy in real-world applications.

Fig. 1. Stochastic computing (SC) arithmetic circuits: (a) addition, (b) subtraction, and (c) multiplication.

II. BACKGROUND AND RELATED WORK

The processing of computations in the SC domain necessitates the conversion between a BN and an SN bitstream, which is accomplished utilizing an SNG. The SNG operates by comparing a conventional binary input with a random number produced by an RNG. The RNG is widely used in various applications, including cryptography, clocking, Monte Carlo simulations, machine learning, and SC^[15-^17]. In SC systems, the RNG plays a crucial role in generating stochastic bitstreams that enable probabilistic computations. These bitstreams are essential for tasks such as image processing, neural network acceleration, and fault-tolerant computing, where approximate results are acceptable. The RNG can be divided into a true random number generator (TRNG) and a pseudo-random number generator (PRNG). A TRNG generates a number based on unpredictable physical entropy sources, such as thermal noise or clock jitter, making it highly random and suitable for security-critical applications like cryptography. However, a TRNG requires complex hardware and consumes significant power. In contrast, a PRNG uses a deterministic algorithm to generate number sequences that appear random, though they are based on a predictable pattern. Since SC systems prioritize efficiency in hardware, power, and speed, they primarily rely on PRNGs, which offer a good balance between randomness and computational cost. The RNG generally employs PRNG techniques such as the linear feedback shift register (LFSR), valued for its straightforward hardware implementation ^[18], and low discrepancy (LD) sequences, which provide uniform distribution properties ^[19]. If the input BN exceeds the generated random number, the SNG outputs a '1'; otherwise, it outputs a '0'. This process is repeated to generate a bit stream of the desired length, forming the basis of stochastic number representation. On the other hand, the conversion from stochastic to binary representation is implemented using a counter that accumulates the number of 1s in the stochastic bitstream. The final binary value is determined by the ratio of accumulated 1s to the total stream length. SC employs two primary encoding schemes: unipolar and bipolar. In unipolar encoding, numbers are represented in the range [0, 1], where the probability of ones in the stream directly corresponds to the value. On the other hand, bipolar encoding represents numbers in the range [-1, 1]. This flexibility allows SC to handle both positive-only and signed number representations. Arithmetic operations in SC are performed using remarkably simple circuits. Multiplication, for instance, is achieved using a single AND gate for unipolar encoding, while the addition uses a MUX, as shown in Fig. 1. This simplicity in arithmetic units contributes significantly to the hardware efficiency of SC systems. Also, one of the most notable features of SC is its inherent error tolerance. Due to the probabilistic nature of the computations, individual bit flips have minimal impact on the overall value represented by a stochastic stream. This enables SC to exhibit robust performance in environments susceptible to external errors and noise.

Fig. 2 illustrates two different configurations of SC. In the serial configuration shown in Fig. 2(a), an n-bit BN is converted to an SN using a single SNG, which consists of an RNG and a comparator. This SN bitstream is then processed arithmetic operations through the SC Core, and the result is converted back to an n-bit BN using a counter that accumulates the number of 1s in the stochastic bitstream. The entire process operates sequentially, one bit at a time. On the other hand, the parallel configuration illustrated in Fig. 2(b) employs multiple SNGs operating simultaneously to generate multiple SNs in parallel. These SNs are processed through multiple SC Cores concurrently, and the results are converted back to an n-bit BN using a parallel counter that performs the same accumulation process but for multiple streams simultaneously, enabling faster computation at the cost of increased hardware resources.

Despite its advantages, SC exhibits an inherent trade-off between computation speed and precision, where achieving enhanced accuracy requires longer bitstreams, consequently increasing the computation time. To reduce these long computation times, many schemes have parallelized SC using multiple RNGs that generate a number of random numbers simultaneously ^[20-^32]. However, these methods improve hardware efficiency and randomness, they often struggle to effectively reduce computation latency, which remains a significant limitation in SC-based systems. Ichihara et al. presented a method to generate multiple uncorrelated SNs through circular shifting and wire exchange of the RNG output ^[20]. Similarly, Salehi reduced the correlation between RNG generated multiple random numbers by permutating RNG outputs ^[21]. This approach introduced a similarity function to effectively identify permutations that minimize correlation between the RNG outputs but face scalability challenges as the search space grows rapidly with increasing bit width. Xie et al. proposed an efficient binary-to-stochastic interface using the LFSR sharing with shuffle-flip networks but is sensitive to the pattern selection of shuffle-flip networks ^[22]. Additionally, Kim et al. presented a parallel LFSR architecture to generate multiple random numbers simultaneously using only XOR gates, but increased complexity from the XOR gates can degrade system performance ^[23].

While this approach significantly reduces computation latency, it also leads to higher hardware resource consumption. Joe et al. introduced an efficient RNG approach using a fully connected cube network for generating multiple uncorrelated random numbers from a single RNG ^[24]. While this scheme eliminates additional hardware overhead, it still suffers from long computation times due to the reliance on extended bitstreams. Furthermore, as the network size increases, scalability becomes a challenge due to rising bitstream length constraints and growing network complexity. Although these methods have significantly improved SC efficiency, they still face challenges in computation speed, scalability, and hardware complexity. Whereas many techniques offer novel solutions, achieving a hardware efficient design that effectively reduces computation time remains a challenge.

Fig. 2. Architecture of SC implementation: (a) serial and (b) parallel configurations.

III. PROPOSED HYBRID RANDOM NUMBER GENERATOR

We address the challenge of efficiently generating multiple random numbers in parallel SC scenarios without significantly increasing hardware consumption. Our proposed architecture introduces a novel approach to generating multiple $n$-bit random numbers to enhance parallel SC using a fixed value. Fig. 3 shows the proposed hybrid RNG-based parallel SC architecture. The proposed RNG design involves a scalable bit partitioning scheme that divides an $n$-bit random number into two components: the upper $m$-bit fixed part and the lower $(n-m)$-bit random part. The proposed RNG requires a seed and a clock (CLK) as inputs. The seed sets the initial state for deterministic random number generation, while CLK synchronizes the generation process. The upper $m$-bit fixed part contains predetermined values ranging from 0 to $2^{m}-1$, representing all possible values that can be expressed using $m$-bit. For instance, when $m=3$, the upper three bits utilize eight $(=2^{3})$ combinations ranging from $000_{2}$ to $111_{2}$. This approach effectively splits the entire random number range into $k$ segments, where $k$ is $2^{m}$. For the lower $(n-m)$-bit random part, a single $(n-m)$-bit shared random source is used to generate random numbers within the defined range of each segment. It is noteworthy that the random part can be generated using any random sources, such as LFSR and low discrepancy (LD) sequences. The range for a given $m$-bit value is calculated from $k\times2^{(n-m)}$ to $(k+1)\times2^{(n-m)}-1$ inclusive, where $n$ denotes the total bit width, $m$ represents the fixed bit width, and $k$ indicates the segment index, which ranges from 0 to $2^{m}-1$. Therefore, each segment corresponds to $m$-bit configurations: the first segment, corresponding to the $m$-bit value 0, generates numbers from 0 to $2^{(n-m)}-1$. The second segment generates numbers from $2^{(n-m)}$ to $2\times2^{(n-m)}-1$. This pattern continues until the final segment, represented by the $m$-bit value $k$, which produces numbers from $k\times2^{(n-m)}$ to $(k+1)\times2^{(n-m)}-1$. As an example, we assume that the proposed design generates 8-bit random numbers using an $m=3$ fixed part combined with a 5-bit random source that generates $10110_{2}$. Then, the proposed design produces eight distinct values by combining each 3-bit fixed value with random bits: $00010110_{2}$, $00110110_{2}$, $01010110_{2}$, $01110110_{2}$, $10010110_{2}$, $10110110_{2}$, $11010110_{2}$, and $11110110_{2}$. These numbers are distributed across distinct segments, where each segment contains $32(=2^{(8-3)})$ values: the first segment ranges from 0 to 31 inclusive, the second extends from 32 to 63 inclusive, and the final segment ranges from 224 to 255 inclusive. This partitioning ensures that each m-bit combination generates random numbers within its designated range while maintaining distinct boundaries across the entire range. These generated random numbers (i.e., $R_{0}$ to $R_{k-1}$) serve as the outputs of the random source and are then fed into multiple comparators, which compare them with $n$-bit BN to produce parallel SNs (i.e., $SN_{0}$ to $SN_{k-1}$). Each comparator generates either 0 or 1 based on the comparison result. A key advantage of the proposed design is its ability to generate $k$ random numbers simultaneously in a single cycle, significantly enhancing random number generation efficiency for parallel processing applications. Additionally, the hardware efficiency is substantially improved by employing a single $(n-m)$-bit random source instead of multiple $n$-bit random sources, reducing hardware requirements for parallel implementations.

Furthermore, the proposed design introduces quasi-deterministic characteristics through the bit partitioning scheme that divides an $n$-bit random number into an upper $m$-bit fixed part and a lower $(n-m)$-bit random part. The primary property of quasi-deterministic is based on controlling random fluctuation, defined as bitstream deviations that impact computational precision in SC systems ^[34], through deterministic elements and achieving enhanced convergence characteristics ^[34]. Consider an 8-bit number where the upper 3 bits are fixed to 0102 and the lower 5 bits are randomly generated. This configuration generates random numbers inclusively between 64 and 95, instead of the full random range from 0 to 255. The proposed design utilizes fixed deterministic values in the upper $m$-bits to partition the entire range into multiple segments, constraining random fluctuation within defined boundaries for each segment. This bit partitioning enhances convergence characteristics by constraining random variations within predetermined ranges, enabling faster convergence.

Fig. 3. Proposed hybrid random number generator (RNG)-based parallel SC architecture.

IV. EXPERIMENTAL RESULTS

1. Accuracy Performance Analysis

To investigate the accuracy of our proposed design, we utilize 8-bit binary input numbers and use $m=2$, $3$, and $4$ for different configurations. A 6, 5, and 4-bit LFSR is employed as the RNG for each $m$ value, respectively, while conventional design employs an 8-bit LFSR. For the SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$ designs, each input uses its own random source to generate SNs. In contrast, the SCshare uses a single shared random source for generating SNs for all inputs. The SC$_{\rm{reverse}}$ and SC$_{\rm{shift}}$ methods employ a single random source and generate a stochastic number through bit-shuffling techniques. The SC$_{\rm{reverse}}$ reorders the bits of the random number in reverse order ^[4], while the SC$_{\rm{shift}}$ adjusts bit positions to improve input correlation ^[20]. The SC$_{\rm{serial}}$ generates one SN per cycle for each input, while the SC$_{\rm{parallel}}$ simultaneously generates four SNs per cycle for each input, processing them in parallel. We calculate the mean squared error (MSE) for all possible input combinations (i.e., $2^{8}\times2^{8}$) for addition, subtraction, and multiplication operations across different clock cycles. Fig. 4 presents MSE results for addition, subtraction, and multiplication operations under various $m$-bit values. The proposed designs demonstrate consistently excellent performance across addition, subtraction, and multiplication operations. When compared to the SC$_{\rm{parallel}}$, our design shows comparable performance in addition and multiplication, while significantly outperforming in subtraction operations with markedly lower MSE values. Furthermore, while SC$_{\rm{reverse}}$ and SC$_{\rm{shift}}$ improve accuracy over conventional methods by reordering or shifting bits, their effectiveness varies depending on the operation. At 256 cycles, both SCreverse and SCshift achieve lower MSE values than our proposed design in addition and multiplication. However, our method still maintains competitive accuracy while achieving significantly faster convergence. More importantly, in subtraction, our design clearly outperforms both SC$_{\rm{reverse}}$ and SC$_{\rm{shift}}$, with drastically lower MSE values. Additionally, a primary advantage of our design is its significantly enhanced convergence characteristics. As the $m$-bit increases, the convergence speed improves dramatically. For instance, with $m=2$, MSE convergence occurs within 16 to 32 cycles, while with $m=4$, convergence is achieved earlier, occurring between 4 and 8 cycles. In contrast, SC$_{\rm{reverse}}$ and SC$_{\rm{shift}}$ require longer cycles to reach similar accuracy levels. This represents a significant improvement over the SC$_{\rm{share}}$ design, which requires 256 cycles for convergence, demonstrating up to 64 times faster convergence speed while also surpassing SC$_everse$ and SC$_{\rm{shift}}$ in subtraction performance and overall computational efficiency.

Fig. 4. Accuracy comparison of arithmetic operations: addition, subtraction, and multiplication.

2. Hardware Performance Anaylsis

To compare the hardware efficiency of the conventional RNG and the proposed RNGs with $m=2, 3,$ and 4, we synthesized the RNGs using Verilog HDL with 65-$nm$ CMOS technology. Table 1 presents the area, power consumption, and energy-delay-area product (EDAP) of the RNGs. The proposed designs employ an ($8-m$)-bit LFSR with $m$-bit fixed value, while the conventional RNG uses an 8-bit LFSR. It is noteworthy that SC$_{\rm{reverse}}$ and SC$_{\rm{shift}}$, which rely solely on bit shuffling techniques without modifying the RNG structure, exhibit equivalent hardware resources to the conventional RNG without requiring additional area and power consumption. As the $m$ value increases in the proposed designs, there is a consistent decrease in both area and power consumption. The EDAP, which combines energy efficiency, delay, and area utilization, also shows significant improvements in the proposed designs. Specifically, when compared to the conventional RNG, the proposed design with $m=2$ achieves a 22.5% reduction in area, a 26.7% decrease in power, and 43.2% improvement in EDAP. For $m=3$, the area and power decreases by 33.7% and 49.4%, respectively, resulting in a 66.4% EDAP improvement. The most significant reductions are achieved with $m=4$, where the area and power improve by 52.5% and 59%, respectively. It is noteworthy that the proposed design can generate $k$ ($=2^{m}$) random numbers from a single RNG, unlike the conventional design that produces only one. For instance, the proposed design with $m=2$ can generate four $(=2^{2})$ random numbers at once, while the conventional design requires four separate RNG units to generate the same number of outputs, which increases the total area and power to 307.2 $\mu m^2$ and 504 $\mu W$, respectively. In contrast, the proposed design with $m=2$ achieves the same functionality with only 59.5 $\mu m^2$ and 92.4 $\mu W$, reducing area and power consumption by approximately 80.6% and 81.7%, respectively.

Table 1. Hardware resource consumption of RNG.

V. CASE STUDY:IMAGE PROCESSING

To validate the practical effectiveness of our proposed design, we consider two digital image processing applications as case studies: Robert’s cross-edge detection and median filter. These applications are commonly employed in SC for feature extraction and noise reduction and can be efficiently implemented using SC.

1. Robert’s Cross-Edge Detector

Robert’s cross-edge detection algorithm is well-known for its ability to highlight edges in images efficiently. It performs convolution operations using a $2\times2$ kernel on pixels from the input image. The algorithm calculates the image gradient, which can be expressed as follows:

(1)

$ Z_{i,j} =|X_{i,j} - X_{i+1, j+1}| + |X_{i,j+1} - X_{i+1, j}|, $

where $X_{i,j}$ represents the pixel value at location $(i, j)$ in the input image and $Z_{i,j}$ denotes the corresponding output pixel value after processing by the algorithm. The implementation of Robert’s cross-edge detector varies significantly between binary computing and SC-based approaches. Fig. 5(a) depicts the binary computing implementation of Robert’s cross-edge detector. The conventional implementation relies on multiple complex hardware components such as subtractors to compute differences between pixel values, absolute value circuits to ensure non-negative results, and an adder to determine gradient and edge strength. On the other hand, the SC-based implementation, shown in Fig. 5(b), achieves the same functionality through two XOR gates and one MUX. The XOR gates and MUX perform absolute subtraction operations and scaled addition, respectively, yielding significantly reduced hardware complexity.

Fig. 5. Implementation of edge detector: (a) binary computing Robert’s cross-edge detector and (b) SC-based Robert’s cross-edge detector.

Fig. 6. Original input image and output images by fixed-point arithmetic and SC-based Robert’s cross-edge detectors.

Fig. 7. PSNR performance comparison of SC-based Robert’s cross-edge detectors.

To evaluate the quality of the proposed SC-based image processing algorithms, we calculate the peak signal-to-noise ratio (PSNR) for each design. Fig. 6 shows the original image and the results of a binary fixed-point and eight SC-based Robert’s cross-edge detection designs, including the SC$_{\rm{share}}$, SC$_{\rm{serial}}$, SC$_{\rm{parallel}}$, SC$_{\rm{shift}}$, SC$_{\rm{reverse}}$, and proposed designs with $m=2, m=3,$ and $m=4$. The proposed SC with $m=2$, $m=3$, and $m=4$ achieves edge detection quality comparable to the SC$_{\rm{share}}$ while substantially reducing processing cycles, requiring only 16, 8, and 4 cycles, respectively, in contrast to the 256 cycles necessitated by the SC$_{\rm{share}}$. Fig. 7 presents a comprehensive PSNR performance analysis under cycles for different designs. The SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$ exhibit poor PSNR due to their use of uncorrelated inputs in the subtraction operations over all clock cycle configurations. On the other hand, the SC$_{\rm{share}}$ shows gradual improvement in PSNR from 17.7 dB at 2 cycles to 25.5 dB at 256 cycles. Obviously, the proposed SC exhibits superior convergence characteristics with increasing $m$ values. Specifically, the proposed design with $m=2$ achieves a PSNR value of 24.7 dB within 16 cycles, while the $m=4$ reaches equivalent performance in rarely 4 cycles. The proposed SC exhibits significant improvement in PSNR convergence speed as increasing $m$ values, achieving the equivalent quality up to 64 times faster than conventional SC-based approaches. The SC$_{\rm{shift}}$ and SC$_{\rm{reverse}}$ designs also show gradual improvements in PSNR. However, these designs require significantly more cycles to reach high-quality edge detection compared to the SC$_{\rm{proposed}}$ designs.

Additionally, Table 2 shows the PSNR values of various designs at 32 cycles, the point at which SC$_{\rm{parallel}}$ operation terminates. Although SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$ show lower PSNR performance, the SC$_{\rm{share}}$, SC$_{\rm{reverse}}$, and SC$_{\rm{shift}}$ demonstrate improved results. Notably, the proposed designs consistently attain the highest PSNR values for all benchmark images.

To evaluate the hardware efficiency of the proposed SC architecture, we implemented the SC-based Robert’s cross-edge detectors using Verilog HDL and synthesized them with 65-nm CMOS technology to obtain area and power metrics, using the same design methodology used in Section IV.2. Fig. 8 illustrates the hardware performance of the SC-based Robert’s cross-edge detectors. The SC$_{\rm{parallel}}$ shows the highest hardware consumption among all designs, primarily due to its requirement for independent RNG units for each operation. As described in Section IV.2, the SC$_{\rm{share}}$, SC$_{\rm{reverse}}$, and SC$_{\rm{shift}}$ designs share the same hardware architecture. The proposed design exhibits a gradual increase of hardware resources as $m$ values increase, which is attributed to the requirement of additional comparators and SC cores to process random numbers simultaneously. Specifically, compared to the SC$_{\rm{share/reverse/shift}}$ implementation, the area increases by $1.6\times$, $2.5\times$, and $4.3\times$ for $m=2$, $3$, and $4$, respectively. Similarly, the power shows corresponding growth with larger $m$ values. Compared to the SC$_{\rm{share/reverse/shift}}$, the proposed design with $m=2$, $3$, and $4$ consumes $1.2\times$, $1.6\times$, and $2.4\times$ more power than the SC$_{\rm{share/reverse/shift}}$, respectively. Importantly, although the area and power increase with higher $m$ values, the proposed design achieves a substantial reduction in processing cycles, leading to enhanced energy efficiency. The proposed architecture demonstrates an effective trade-off between hardware complexity and processing speed while maintaining superior edge detection quality.

Fig. 8. Hardware resource comparison of SC-based Robert’s cross-edge detectors.

Table 2. PSNR analysis of Robert’s cross-edge detection for various images at 32 cycles.

2. Median filter

The median filter is a widely used image processing technique that effectively removes salt-and-pepper noise while preserving edge information, making it especially useful in Internet of Thing (IoT)-based vision systems, where ensuring high image quality is essential for real-time processing ^[35]. The median filter processes the image by analyzing each pixel’s neighborhood within a defined window and replacing the center pixel value with the median of its surrounding values. Among various window sizes, the $3\times3$ median filter is commonly used due to its efficient noise removal, where its hardware implementation primarily relies on sorting networks to arrange the nine pixels in ascending or descending order ^[36]. Fig. 9 shows the sorting network used in the hardware implementation of the median filter ^[37].

Fig. 9(a) shows the basic sorter element, which takes two inputs, X and Y, and outputs their minimum and maximum values, serving as the fundamental unit in the median filter implementation. Fig. 9(b) represents a simplified symbol for the sorter, and Fig. 9(c) depicts the structure of the overall $3\times3$ median filter. Multiple sorter elements combine in a specific pattern to create a network that sorts input values and selects the median value among the nine input values. Fig. 10 illustrates an SC-based sorter architecture that efficiently determines the minimum and maximum values from two inputs using simple logic gates. The AND gate produces the minimum value while the OR gate determines the maximum value, offering a significantly simplified architecture compared to conventional binary computing. A $3\times3$ median filter utilizes a structured network of 19 sorter units, which process nine inputs to efficiently compute the median value.

Fig. 9. Implementation of a median filter using the sorting network: (a) basic sorter element, (b) sorter symbol, and (c) structure of the $3\times3$ median filter using sorting network.

Fig. 10. SC-based sorter architecture.

To evaluate the noise reduction capabilities of SC-based median filter designs, we assess filtered image quality using the PSNR metric. Fig. 11 exhibits the effectiveness of the median filter in removing noise under various configurations. The noise-added image is generated by applying salt-and-pepper noise with a density of 0.1. The proposed design achieves filtering quality equivalent to a fixed-point design. Compared to the SC$_{\rm{share}}$ which requires 256 cycles, the proposed SC with $m=2$, $3$, and $4$ improves remarkable reduction in processing time, requiring only 16, 8, and 4 cycles, respectively. Additionally, Fig. 12 presents a comparative analysis of PSNR performance over processing cycles for different SC-based median filter implementations. The SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$ approaches exhibit relatively low PSNR performance, maintaining approximately 15 dB across all operating cycles. The SC$_{\rm{share}}$ shows a gradual increase, improving from nearly 9 dB at 2 cycles to around 32 dB at 256 cycles. However, the proposed designs achieve similar PSNR with significantly fewer cycles. The design with $m=2$ starts at 15.8 dB at 2 cycles and reaches approximately 31 dB at 16 cycles. The design with $m=3$ begins at 23.2 dB at 2 cycles and surpasses 31 dB within 16 cycles. Additionally, the proposed SC with $m=4$ achieves a PSNR of 28.8 dB at 2 cycles and reaches 31 dB within only 4 cycles. The results indicate that as $m$ increases, the proposed designs consistently achieve higher PSNR values.

Table 3 provides the PSNR comparison of various images at 32 cycles for the SC-based median filter. The proposed design with $m=2$, $3$, and $4$ achieves significantly better noise reduction quality than conventional SC methods for all images. Consistent with Robert’s cross-edge detection, the proposed SC demonstrates enhanced PSNR convergence characteristics with increasing $m$ values, achieving comparable quality up to 64 times faster than traditional SC-based median filter.

Fig. 13 presents the hardware utilization, specifically area and power, using different SC-based median filter architectures implemented using the same design methodology applied to the Robert’s cross-edge detectors. The proposed design exhibits increased area and power consumption compared to the SC$_{\rm{share/reverse/shift}}$. For $m=2$, $3$, and $4$, the area increases $2.0\times$, $3.6\times$, and $6.6\times$, respectively. Similarly, power also increases with higher $m$ values, achieving increments of $1.4\times$, $2.1\times$, and $3.4\times$ compared to the SC$_{\rm{share/reverse/shift}}$ for $m=2$, $3$, and $4$, respectively. While the proposed design demands increased hardware resources, this trade-off facilitates substantial enhancement in computational speed. Notably, the proposed design with $m=4$ achieves equivalent image quality in only 4 cycles, compared to the 256 cycles necessitated by the SC$_{\rm{share/reverse/shift}}$. Despite the increase in area and power consumption, the proposed design achieves significantly improved energy efficiency through a substantial reduction in processing cycles. Similar to Robert's cross-edge detection results, the proposed architecture demonstrates an effective trade-off between hardware complexity and processing efficiency in median filtering applications. While requiring additional hardware resources, the significant reduction in processing cycles validates the effectiveness of our parallel processing approach across various SC applications.

Fig. 11. Noise-added input image and output images by fixed-point arithmetic and SC-based median filters.

Fig. 12. PSNR performance of different SC-based median filter across processing cycles.

Fig. 13. Hardware resource analysis of SC-based median filters.

Table 3. PSNR comparison of SC-based median filter among various images at 32 cycles.}}

VI. CONCLUSION

In this paper, we addressed one of the primary limitations of SC, which is long computation times. To achieve this, we have presented a novel parallel RNG design that significantly enhances SC efficiency. The proposed design adopts a scalable partitioned structure that combines m-bit fixed values and an ($n-m$)-bit random source to generate $k$ random numbers, each with n bits, simultaneously. The proposed design substantially reduces hardware resource requirements compared to conventional parallel SC architectures while dramatically improving processing speed. The proposed design demonstrates consistently excellent performance across addition, subtraction, and multiplication operations. When compared to the SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$, the proposed design maintains competitive MSE for addition and multiplication, while significantly improving subtraction. As the number of $m$-bits increases from 2 to 4, the convergence characteristics of the design improve with higher $m$-bit numbers, achieving faster convergence within 4 to 8 cycles with 4-bit compared to within 16 to 32 cycles with 2-bit. Furthermore, to validate the effectiveness of the proposed design in practical applications, we adopted two digital image processing applications, Robert’s cross-edge detector and median filter, achieving substantial processing speed improvements while comparable image quality. Notably, the proposed SC design with $m=4$ achieves equivalent performance in only 4 cycles, whereas SC$_{\rm{share}}$, SC$_{\rm{reverse}}$, and SC$_{\rm{shift}}$ counterparts require significantly more cycles to reach the same quality of processed images. The proposed design advances the SC-based image processing field through its highly efficient architecture, which achieves a substantial reduction in computational cycles while maintaining image quality.

ACKNOWLEDGMENTS

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00279770).

References

B. R. Gaines, `'Stochastic computing systems,'' in Tou, J.T. (eds) Advances in Information Systems Science, Springer, pp. 37-172, 1969.

A. Alaghi, W. Qian, and J. P. Hayes, ``The promise and challenge of stochastic computing,'' IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 8, pp. 1515-1531, 2018.

S. S. Tehrani, S. Mannor, and W. J. Gross, ``Fully parallel stochastic LDPC decoders,'' IEEE Transactions on Signal Processing, vol. 56, no. 11, pp. 5692-5703, 2008.

D. Lee, J. Baik, and Y. Kim, ``An accurate and efficient stochastic computing adder exploiting bit shuffle control scheme,'' Proc. of 19th International SoC Design Conference (ISOCC), pp. 51-52, 2022.

Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han, ``A survey of stochastic computing neural networks for machine learning applications,'' IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 7, pp. 2809-2824, 2021.

S. Liu and J. Han, ``Dynamic stochastic computing for digital signal processing applications,'' Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 604-609, 2020.

P. K. Muthappa, F. Neugebauer, I. Polian, and J. P. Hayes, ``Hardware-based fast real-time image classification with stochastic computing,'' Proc. of IEEE 38th International Conference on Computer Design (ICCD), pp. 340-347, 2020.

D. Lee and Y. Kim, ``Exploiting omega network and inexact accumulative parallel counter to enhance energy efficiency in stochastic computing,'' Proc. of 40th ACM/SIGAPP Symposium on Applied Computing (SAC), pp. 524–531, 2025.

Z. Chen, Y. Ma, and Z. Wang, ``Hybrid stochastic-binary computing for low-latency and high-precision inference of CNNs,'' IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 7, pp. 2707-2720, 2022.

H. Seo and Y. Kim. ``A low latency approximate adder design based on dual sub-adders with error recovery,'' IEEE Transactions on Emerging Topics in Computing, vol. 11, no. 3, pp. 811-816, 2023.

Y. Kim, ``A novel approximate adder with enhanced low-cost carry prediction for error tolerant computing,'' IEIE Transactions on Smart Processing & Computing, vol. 8, no. 6, pp. 506-510, 2019.

Y. Chung and Y. Kim, ``Comparison of approximate computing with sobel edge detection,'' IEIE Transactions on Smart Processing & Computing, vol. 10, no. 4, pp. 355-361, 2021.

H. Seok, H. Seo, J. Lee, Y. Kim, ``Design optimization of a 4-2 compressor for low-cost approximate multipliers,'' IEIE Transactions on Smart Processing & Computing, vol. 11, no. 6, pp. 455-461, 2022.

D. Lee, H. Seo, and Y. Kim, ``Design of an efficient parallel random number generator using a single LFSR for stochastic computing,'' Proc. International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pp. 775-777, 2024.

P. Choi, J. Kim, and D. Kim, ``Improving ring-oscillator-based true random number generators using multiple sampling,'' Journal of Semiconductor Technology and Science, vol. 19, no. 3, pp. 305-309, 2019.

E. Kim and J. Kim, ``A single-ring-oscillator based true-random-number generator with 3-edges collapse,'' Journal of Semiconductor Technology and Science, vol. 23, no. 6, pp. 396-402, 2023.

E. Lee, S.-M. Yu, Y. Kang, and J. Song, ``Phase-locked Loop with true random number generator for reference spur reduction,'' IEIE Transactions on Smart Processing & Computing, vol. 10, no. 2, pp. 146-150, 2021.

A. Zhakatayev, K. Kim, K. Choi, and J. Lee, ``An efficient and accurate stochastic number generator using even-distribution coding,'' IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 12, pp. 3056-3066, 2018.

A. Alaghi and J. P. Hayes, ``Fast and accurate computation using stochastic circuits,'' in Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1-4, 2014.

H. Ichihara, T. Sugino, S. Ishii, T. Iwagaki, and T. Inoue, ``Compact and accurate digital filters based on stochastic computing,'' EEE Transactions on Emerging Topics in Computing , vol. 7, no. 1, pp. 31-43, 2019.

S. A. Salehi, ``Low-cost stochastic number generators for stochastic computing,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 28, no. 4, pp. 992-1001, 2020.

Y. Xie, S. Liao, B. Yuan, Y. Wang, and Z. Wang, ``Fully-parallel area-efficient deep neural network design using stochastic computing,'' IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 64, no. 12, pp. 1382-1386, Dec. 2017.

J. Kim, W. S. Jeong, Y. Jeong, and S. E. Lee, ``Parallel stochastic computing architecture for computationally intensive applications,'' Electronics, vol. 12, no. 7, 1749, 2023.

H. Joe and Y. Kim, ``Compact and power-efficient sobel edge detection with fully connected cube-network-based stochastic computing,'' Journal of Semiconductor Technology and Science, vol. 20, no. 5, pp. 436-446, 2020.

H. Joe and Y. Kim, ``Novel stochastic computing for energy-efficient image processors,'' Electronics, vol. 8, no. 6, 720, 2019.

D. Lee and Y. Kim, ``Design of a low-cost stochastic computing-based median filter for digital image processing,'' Proc. of 21th International SoC Design Conference (ISOCC), pp. 298-299, 2024.

D. Lee, J. Baik, and Y. Kim, ``Enhancing stochastic computing using a novel hybrid random number generator integrating LFSR and Halton sequence,'' Proc. of 20th International SoC Design Conference (ISOCC), pp. 7-8, 2023.

D. Lee and Y. Kim, ``Towards quantized stochastic computing by leveraging reduced precision binary numbers through bit truncation,'' Proc. of IEEE 41st International Conference on Computer Design (ICCD), pp. 419-422, 2023.

Z. Wang, D. Larso, M. Barker, S. Mohajer, and K. Bazargan, ``Deterministic shuffling networks to implement stochastic circuits in parallel,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 28, no. 8, pp. 1821-1832, 2020.

X. Wei and L. Xiu, ``A VLSI digital circuit platform for performing deterministic stochastic computing in the time dimension using fraction operations on rational numbers,'' EEE Transactions on Emerging Topics in Computing , vol. 11, no. 1, pp. 194-207, 2023.

Z. Lin, G. Xie, W. Xu, J. Han, and Y. Zhang, ``Accelerating stochastic computing using deterministic Halton sequences,'' IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 68, no. 10, pp. 3351-3355, 2021.

V. Sehwag, N. Prasad, and I. Chakrabarti, ``A parallel stochastic number generator with bit permutation networks,'' IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 65, no. 2, pp. 231-235, 2018.

A. Alaghi and J. P. Hayes, ``Survey of stochastic computing,'' ACM Transactions on Embedded Computing Systems, vol. 12, no. 2, p. 92, 2013.

M. I. Freidlin, ``Quasi-deterministic approximation, metastability, and stochastic resonance,'' Physica D, vol. 137, pp. 333-352, 2000.

A. Roy, S. Bandopadhaya, S. Chandra, and A. Suhag, ``Removal of impulse noise for multimedia-IoT applications at gateway level,'' Multimedia Tools and Applications, vol. 81, pp. 34463-34480, 2022.

D. E. Knuth, The Art of Computer Programming: Sorting and Searching, vol. 3, Pearson Education, 1998.

S. Liu and J. Han, ``Toward energy-efficient stochastic circuits using parallel Sobol sequences,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 7, pp. 1326-1339, 2018.

Donghui Lee

Donghui Lee received his M.S. degree from the School of Computer Science and Engineering at Kyungpook National University, Daegu, Republic of Korea, in 2023, where he is pursuing a Ph.D. degree. His research interests include stochastic computing (SC) and hardware design.

Yongtae Kim

Yongtae Kim received his B.S. and M.S. degrees in electrical engineering from the Korea University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and a Ph.D. degree from the Department of Electrical and Computer Engineering from the Texas A&M University, College Station, TX, in 2013. From 2013 to 2018, he was a software engineer with Intel Corporation, Santa Clara, CA. Since 2018, he has been with the School of Computer Science and Engineering at Kyungpook National University, Daegu, South Korea, where he is currently an Associate Professor. His research interests are in energy-efficient integrated circuits and systems, particularly, neuromorphic computing, approximate computing, quantum computing, and new memory devices and architecture.