LeeDonghui1
KimYongtae1
-
(The School of Computer Science and Engineering, Kyungpook National University, Daegu
41566, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Stochastic computing (SC), random number generator (RNG), bit partitioning, parallel computing
I. INTRODUCTION
Stochastic computing (SC) is an emerging computational paradigm that uses probabilistic
methods to perform computations, unlike traditional deterministic binary computing
[1]. The primary advantages of SC are its high error tolerance to external noise and
hardware efficiency. By representing values as bit streams, SC can maintain overall
computational accuracy even when individual bits are erroneous [2]. In SC, numbers are represented as bit streams, where the value is encoded in the
probability of encountering a `1' in the stream, denoted as a stochastic number (SN).
For example, an 8-bit binary stream ``10110010'' contains four occurrences of `1'
in a total of eight bits, corresponding to a value $4/8 = 0.5$. The basic operation
of SC involves generating stochastic bit streams using a stochastic number generator
(SNG), which typically consists of a random number generator (RNG) and a comparator.
In addition to its bit error resilience, another advantage of SC is that arithmetic
operations can be performed using simple logic gates, such as AND gates for multiplication
and multiplexers (MUX) for addition, significantly reducing hardware complexity compared
to binary arithmetic units.
Fig. 1 exhibits the implementation of addition, subtraction, and multiplication operations
using simple logic gates. These characteristics make SC applicable to applications
where slight computational errors are acceptable, such as neural networks, image processing,
and digital filters [3-9], which is similar to the nature of approximate computing [10-13]. However, SC requires longer bitstreams to achieve higher accuracy and precision,
consequently increasing computation time. Specifically, to achieve equivalent precision
for an n-bit binary number (BN) in SC, a $2^{n}$-bit SN bitstream is required. For
instance, an 8-bit BN requires a 256-bit SN bitstream to maintain the same precision.
To address the computational latency caused by long bit streams for high accuracy
in SC systems, we propose a scalable bit partitioning RNG architecture. While our
previous work in [14] generated two parallel random numbers by fixing the most significant bit (MSB) to
either 0 or 1, this approach was limited in scalability for higher parallel processing
requirements. Our enhanced design in this work fixes multiple bits in the MSBs and
employs a single RNG for the remaining bits, achieving improved scalability while
reducing hardware complexity without compromising random number generation capability.
This hybrid scheme allows for the generation of multiple random numbers simultaneously,
improving the speed of computation in SC systems while maintaining the inherent benefits
of SC. The main contributions of this work can be outlined as follows:
• We propose a hybrid RNG design that combines scalable bit partitioning between
fixed bits for MSBs and random number generation for the remaining bits.
• We provide a comparative analysis of the proposed RNG against traditional RNG,
evaluating and demonstrating both accuracy and hardware resource consumption.
• We validate our proposed RNG by applying it to digital image processing, demonstrating
its efficacy in real-world applications.
Fig. 1. Stochastic computing (SC) arithmetic circuits: (a) addition, (b) subtraction,
and (c) multiplication.
II. BACKGROUND AND RELATED WORK
The processing of computations in the SC domain necessitates the conversion between
a BN and an SN bitstream, which is accomplished utilizing an SNG. The SNG operates
by comparing a conventional binary input with a random number produced by an RNG.
The RNG is widely used in various applications, including cryptography, clocking,
Monte Carlo simulations, machine learning, and SC[15-17]. In SC systems, the RNG plays a crucial role in generating stochastic bitstreams
that enable probabilistic computations. These bitstreams are essential for tasks such
as image processing, neural network acceleration, and fault-tolerant computing, where
approximate results are acceptable. The RNG can be divided into a true random number
generator (TRNG) and a pseudo-random number generator (PRNG). A TRNG generates a number
based on unpredictable physical entropy sources, such as thermal noise or clock jitter,
making it highly random and suitable for security-critical applications like cryptography.
However, a TRNG requires complex hardware and consumes significant power. In contrast,
a PRNG uses a deterministic algorithm to generate number sequences that appear random,
though they are based on a predictable pattern. Since SC systems prioritize efficiency
in hardware, power, and speed, they primarily rely on PRNGs, which offer a good balance
between randomness and computational cost. The RNG generally employs PRNG techniques
such as the linear feedback shift register (LFSR), valued for its straightforward
hardware implementation [18], and low discrepancy (LD) sequences, which provide uniform distribution properties
[19]. If the input BN exceeds the generated random number, the SNG outputs a '1'; otherwise,
it outputs a '0'. This process is repeated to generate a bit stream of the desired
length, forming the basis of stochastic number representation. On the other hand,
the conversion from stochastic to binary representation is implemented using a counter
that accumulates the number of 1s in the stochastic bitstream. The final binary value
is determined by the ratio of accumulated 1s to the total stream length. SC employs
two primary encoding schemes: unipolar and bipolar. In unipolar encoding, numbers
are represented in the range [0, 1], where the probability of ones in the stream directly
corresponds to the value. On the other hand, bipolar encoding represents numbers in
the range [-1, 1]. This flexibility allows SC to handle both positive-only and signed
number representations. Arithmetic operations in SC are performed using remarkably
simple circuits. Multiplication, for instance, is achieved using a single AND gate
for unipolar encoding, while the addition uses a MUX, as shown in Fig. 1. This simplicity in arithmetic units contributes significantly to the hardware efficiency
of SC systems. Also, one of the most notable features of SC is its inherent error
tolerance. Due to the probabilistic nature of the computations, individual bit flips
have minimal impact on the overall value represented by a stochastic stream. This
enables SC to exhibit robust performance in environments susceptible to external errors
and noise.
Fig. 2 illustrates two different configurations of SC. In the serial configuration shown
in Fig. 2(a), an n-bit BN is converted to an SN using a single SNG, which consists of an RNG and
a comparator. This SN bitstream is then processed arithmetic operations through the
SC Core, and the result is converted back to an n-bit BN using a counter that accumulates
the number of 1s in the stochastic bitstream. The entire process operates sequentially,
one bit at a time. On the other hand, the parallel configuration illustrated in Fig. 2(b) employs multiple SNGs operating simultaneously to generate multiple SNs in parallel.
These SNs are processed through multiple SC Cores concurrently, and the results are
converted back to an n-bit BN using a parallel counter that performs the same accumulation
process but for multiple streams simultaneously, enabling faster computation at the
cost of increased hardware resources.
Despite its advantages, SC exhibits an inherent trade-off between computation speed
and precision, where achieving enhanced accuracy requires longer bitstreams, consequently
increasing the computation time. To reduce these long computation times, many schemes
have parallelized SC using multiple RNGs that generate a number of random numbers
simultaneously [20-32]. However, these methods improve hardware efficiency and randomness, they often struggle
to effectively reduce computation latency, which remains a significant limitation
in SC-based systems. Ichihara et al. presented a method to generate multiple uncorrelated
SNs through circular shifting and wire exchange of the RNG output [20]. Similarly, Salehi reduced the correlation between RNG generated multiple random
numbers by permutating RNG outputs [21]. This approach introduced a similarity function to effectively identify permutations
that minimize correlation between the RNG outputs but face scalability challenges
as the search space grows rapidly with increasing bit width. Xie et al. proposed an
efficient binary-to-stochastic interface using the LFSR sharing with shuffle-flip
networks but is sensitive to the pattern selection of shuffle-flip networks [22]. Additionally, Kim et al. presented a parallel LFSR architecture to generate multiple
random numbers simultaneously using only XOR gates, but increased complexity from
the XOR gates can degrade system performance [23].
While this approach significantly reduces computation latency, it also leads to higher
hardware resource consumption. Joe et al. introduced an efficient RNG approach using
a fully connected cube network for generating multiple uncorrelated random numbers
from a single RNG [24]. While this scheme eliminates additional hardware overhead, it still suffers from
long computation times due to the reliance on extended bitstreams. Furthermore, as
the network size increases, scalability becomes a challenge due to rising bitstream
length constraints and growing network complexity. Although these methods have significantly
improved SC efficiency, they still face challenges in computation speed, scalability,
and hardware complexity. Whereas many techniques offer novel solutions, achieving
a hardware efficient design that effectively reduces computation time remains a challenge.
Fig. 2. Architecture of SC implementation: (a) serial and (b) parallel configurations.
III. PROPOSED HYBRID RANDOM NUMBER GENERATOR
We address the challenge of efficiently generating multiple random numbers in parallel
SC scenarios without significantly increasing hardware consumption. Our proposed architecture
introduces a novel approach to generating multiple $n$-bit random numbers to enhance
parallel SC using a fixed value. Fig. 3 shows the proposed hybrid RNG-based parallel SC architecture. The proposed RNG design
involves a scalable bit partitioning scheme that divides an $n$-bit random number
into two components: the upper $m$-bit fixed part and the lower $(n-m)$-bit random
part. The proposed RNG requires a seed and a clock (CLK) as inputs. The seed sets
the initial state for deterministic random number generation, while CLK synchronizes
the generation process. The upper $m$-bit fixed part contains predetermined values
ranging from 0 to $2^{m}-1$, representing all possible values that can be expressed
using $m$-bit. For instance, when $m=3$, the upper three bits utilize eight $(=2^{3})$
combinations ranging from $000_{2}$ to $111_{2}$. This approach effectively splits
the entire random number range into $k$ segments, where $k$ is $2^{m}$. For the lower
$(n-m)$-bit random part, a single $(n-m)$-bit shared random source is used to generate
random numbers within the defined range of each segment. It is noteworthy that the
random part can be generated using any random sources, such as LFSR and low discrepancy
(LD) sequences. The range for a given $m$-bit value is calculated from $k\times2^{(n-m)}$
to $(k+1)\times2^{(n-m)}-1$ inclusive, where $n$ denotes the total bit width, $m$
represents the fixed bit width, and $k$ indicates the segment index, which ranges
from 0 to $2^{m}-1$. Therefore, each segment corresponds to $m$-bit configurations:
the first segment, corresponding to the $m$-bit value 0, generates numbers from 0
to $2^{(n-m)}-1$. The second segment generates numbers from $2^{(n-m)}$ to $2\times2^{(n-m)}-1$.
This pattern continues until the final segment, represented by the $m$-bit value $k$,
which produces numbers from $k\times2^{(n-m)}$ to $(k+1)\times2^{(n-m)}-1$. As an
example, we assume that the proposed design generates 8-bit random numbers using an
$m=3$ fixed part combined with a 5-bit random source that generates $10110_{2}$. Then,
the proposed design produces eight distinct values by combining each 3-bit fixed value
with random bits: $00010110_{2}$, $00110110_{2}$, $01010110_{2}$, $01110110_{2}$,
$10010110_{2}$, $10110110_{2}$, $11010110_{2}$, and $11110110_{2}$. These numbers
are distributed across distinct segments, where each segment contains $32(=2^{(8-3)})$
values: the first segment ranges from 0 to 31 inclusive, the second extends from 32
to 63 inclusive, and the final segment ranges from 224 to 255 inclusive. This partitioning
ensures that each m-bit combination generates random numbers within its designated
range while maintaining distinct boundaries across the entire range. These generated
random numbers (i.e., $R_{0}$ to $R_{k-1}$) serve as the outputs of the random source
and are then fed into multiple comparators, which compare them with $n$-bit BN to
produce parallel SNs (i.e., $SN_{0}$ to $SN_{k-1}$). Each comparator generates either
0 or 1 based on the comparison result. A key advantage of the proposed design is its
ability to generate $k$ random numbers simultaneously in a single cycle, significantly
enhancing random number generation efficiency for parallel processing applications.
Additionally, the hardware efficiency is substantially improved by employing a single
$(n-m)$-bit random source instead of multiple $n$-bit random sources, reducing hardware
requirements for parallel implementations.
Furthermore, the proposed design introduces quasi-deterministic characteristics through
the bit partitioning scheme that divides an $n$-bit random number into an upper $m$-bit
fixed part and a lower $(n-m)$-bit random part. The primary property of quasi-deterministic
is based on controlling random fluctuation, defined as bitstream deviations that impact
computational precision in SC systems [34], through deterministic elements and achieving enhanced convergence characteristics
[34]. Consider an 8-bit number where the upper 3 bits are fixed to 0102 and the lower
5 bits are randomly generated. This configuration generates random numbers inclusively
between 64 and 95, instead of the full random range from 0 to 255. The proposed design
utilizes fixed deterministic values in the upper $m$-bits to partition the entire
range into multiple segments, constraining random fluctuation within defined boundaries
for each segment. This bit partitioning enhances convergence characteristics by constraining
random variations within predetermined ranges, enabling faster convergence.
Fig. 3. Proposed hybrid random number generator (RNG)-based parallel SC architecture.
IV. EXPERIMENTAL RESULTS
1. Accuracy Performance Analysis
To investigate the accuracy of our proposed design, we utilize 8-bit binary input
numbers and use $m=2$, $3$, and $4$ for different configurations. A 6, 5, and 4-bit
LFSR is employed as the RNG for each $m$ value, respectively, while conventional design
employs an 8-bit LFSR. For the SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$ designs,
each input uses its own random source to generate SNs. In contrast, the SCshare uses
a single shared random source for generating SNs for all inputs. The SC$_{\rm{reverse}}$
and SC$_{\rm{shift}}$ methods employ a single random source and generate a stochastic
number through bit-shuffling techniques. The SC$_{\rm{reverse}}$ reorders the bits
of the random number in reverse order [4], while the SC$_{\rm{shift}}$ adjusts bit positions to improve input correlation [20]. The SC$_{\rm{serial}}$ generates one SN per cycle for each input, while the SC$_{\rm{parallel}}$
simultaneously generates four SNs per cycle for each input, processing them in parallel.
We calculate the mean squared error (MSE) for all possible input combinations (i.e.,
$2^{8}\times2^{8}$) for addition, subtraction, and multiplication operations across
different clock cycles. Fig. 4 presents MSE results for addition, subtraction, and multiplication operations under
various $m$-bit values. The proposed designs demonstrate consistently excellent performance
across addition, subtraction, and multiplication operations. When compared to the
SC$_{\rm{parallel}}$, our design shows comparable performance in addition and multiplication,
while significantly outperforming in subtraction operations with markedly lower MSE
values. Furthermore, while SC$_{\rm{reverse}}$ and SC$_{\rm{shift}}$ improve accuracy
over conventional methods by reordering or shifting bits, their effectiveness varies
depending on the operation. At 256 cycles, both SCreverse and SCshift achieve lower
MSE values than our proposed design in addition and multiplication. However, our method
still maintains competitive accuracy while achieving significantly faster convergence.
More importantly, in subtraction, our design clearly outperforms both SC$_{\rm{reverse}}$
and SC$_{\rm{shift}}$, with drastically lower MSE values. Additionally, a primary
advantage of our design is its significantly enhanced convergence characteristics.
As the $m$-bit increases, the convergence speed improves dramatically. For instance,
with $m=2$, MSE convergence occurs within 16 to 32 cycles, while with $m=4$, convergence
is achieved earlier, occurring between 4 and 8 cycles. In contrast, SC$_{\rm{reverse}}$
and SC$_{\rm{shift}}$ require longer cycles to reach similar accuracy levels. This
represents a significant improvement over the SC$_{\rm{share}}$ design, which requires
256 cycles for convergence, demonstrating up to 64 times faster convergence speed
while also surpassing SC$_everse$ and SC$_{\rm{shift}}$ in subtraction performance
and overall computational efficiency.
Fig. 4. Accuracy comparison of arithmetic operations: addition, subtraction, and multiplication.
2. Hardware Performance Anaylsis
To compare the hardware efficiency of the conventional RNG and the proposed RNGs with
$m=2, 3,$ and 4, we synthesized the RNGs using Verilog HDL with 65-$nm$ CMOS technology.
Table 1 presents the area, power consumption, and energy-delay-area product (EDAP) of the
RNGs. The proposed designs employ an ($8-m$)-bit LFSR with $m$-bit fixed value, while
the conventional RNG uses an 8-bit LFSR. It is noteworthy that SC$_{\rm{reverse}}$
and SC$_{\rm{shift}}$, which rely solely on bit shuffling techniques without modifying
the RNG structure, exhibit equivalent hardware resources to the conventional RNG without
requiring additional area and power consumption. As the $m$ value increases in the
proposed designs, there is a consistent decrease in both area and power consumption.
The EDAP, which combines energy efficiency, delay, and area utilization, also shows
significant improvements in the proposed designs. Specifically, when compared to the
conventional RNG, the proposed design with $m=2$ achieves a 22.5% reduction in area,
a 26.7% decrease in power, and 43.2% improvement in EDAP. For $m=3$, the area and
power decreases by 33.7% and 49.4%, respectively, resulting in a 66.4% EDAP improvement.
The most significant reductions are achieved with $m=4$, where the area and power
improve by 52.5% and 59%, respectively. It is noteworthy that the proposed design
can generate $k$ ($=2^{m}$) random numbers from a single RNG, unlike the conventional
design that produces only one. For instance, the proposed design with $m=2$ can generate
four $(=2^{2})$ random numbers at once, while the conventional design requires four
separate RNG units to generate the same number of outputs, which increases the total
area and power to 307.2 $\mu m^2$ and 504 $\mu W$, respectively. In contrast, the
proposed design with $m=2$ achieves the same functionality with only 59.5 $\mu m^2$
and 92.4 $\mu W$, reducing area and power consumption by approximately 80.6% and 81.7%,
respectively.
Table 1. Hardware resource consumption of RNG.
V. CASE STUDY:IMAGE PROCESSING
To validate the practical effectiveness of our proposed design, we consider two digital
image processing applications as case studies: Robert’s cross-edge detection and median
filter. These applications are commonly employed in SC for feature extraction and
noise reduction and can be efficiently implemented using SC.
1. Robert’s Cross-Edge Detector
Robert’s cross-edge detection algorithm is well-known for its ability to highlight
edges in images efficiently. It performs convolution operations using a $2\times2$
kernel on pixels from the input image. The algorithm calculates the image gradient,
which can be expressed as follows:
where $X_{i,j}$ represents the pixel value at location $(i, j)$ in the input image
and $Z_{i,j}$ denotes the corresponding output pixel value after processing by the
algorithm. The implementation of Robert’s cross-edge detector varies significantly
between binary computing and SC-based approaches. Fig. 5(a) depicts the binary computing implementation of Robert’s cross-edge detector. The
conventional implementation relies on multiple complex hardware components such as
subtractors to compute differences between pixel values, absolute value circuits to
ensure non-negative results, and an adder to determine gradient and edge strength.
On the other hand, the SC-based implementation, shown in Fig. 5(b), achieves the same functionality through two XOR gates and one MUX. The XOR gates
and MUX perform absolute subtraction operations and scaled addition, respectively,
yielding significantly reduced hardware complexity.
Fig. 5. Implementation of edge detector: (a) binary computing Robert’s cross-edge
detector and (b) SC-based Robert’s cross-edge detector.
Fig. 6. Original input image and output images by fixed-point arithmetic and SC-based
Robert’s cross-edge detectors.
Fig. 7. PSNR performance comparison of SC-based Robert’s cross-edge detectors.
To evaluate the quality of the proposed SC-based image processing algorithms, we calculate
the peak signal-to-noise ratio (PSNR) for each design. Fig. 6 shows the original image and the results of a binary fixed-point and eight SC-based
Robert’s cross-edge detection designs, including the SC$_{\rm{share}}$, SC$_{\rm{serial}}$,
SC$_{\rm{parallel}}$, SC$_{\rm{shift}}$, SC$_{\rm{reverse}}$, and proposed designs
with $m=2, m=3,$ and $m=4$. The proposed SC with $m=2$, $m=3$, and $m=4$ achieves
edge detection quality comparable to the SC$_{\rm{share}}$ while substantially reducing
processing cycles, requiring only 16, 8, and 4 cycles, respectively, in contrast to
the 256 cycles necessitated by the SC$_{\rm{share}}$. Fig. 7 presents a comprehensive PSNR performance analysis under cycles for different designs.
The SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$ exhibit poor PSNR due to their use
of uncorrelated inputs in the subtraction operations over all clock cycle configurations.
On the other hand, the SC$_{\rm{share}}$ shows gradual improvement in PSNR from 17.7
dB at 2 cycles to 25.5 dB at 256 cycles. Obviously, the proposed SC exhibits superior
convergence characteristics with increasing $m$ values. Specifically, the proposed
design with $m=2$ achieves a PSNR value of 24.7 dB within 16 cycles, while the $m=4$
reaches equivalent performance in rarely 4 cycles. The proposed SC exhibits significant
improvement in PSNR convergence speed as increasing $m$ values, achieving the equivalent
quality up to 64 times faster than conventional SC-based approaches. The SC$_{\rm{shift}}$
and SC$_{\rm{reverse}}$ designs also show gradual improvements in PSNR. However, these
designs require significantly more cycles to reach high-quality edge detection compared
to the SC$_{\rm{proposed}}$ designs.
Additionally, Table 2 shows the PSNR values of various designs at 32 cycles, the point at which SC$_{\rm{parallel}}$
operation terminates. Although SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$ show lower
PSNR performance, the SC$_{\rm{share}}$, SC$_{\rm{reverse}}$, and SC$_{\rm{shift}}$
demonstrate improved results. Notably, the proposed designs consistently attain the
highest PSNR values for all benchmark images.
To evaluate the hardware efficiency of the proposed SC architecture, we implemented
the SC-based Robert’s cross-edge detectors using Verilog HDL and synthesized them
with 65-nm CMOS technology to obtain area and power metrics, using the same design
methodology used in Section IV.2. Fig. 8 illustrates the hardware performance of the SC-based Robert’s cross-edge detectors.
The SC$_{\rm{parallel}}$ shows the highest hardware consumption among all designs,
primarily due to its requirement for independent RNG units for each operation. As
described in Section IV.2, the SC$_{\rm{share}}$, SC$_{\rm{reverse}}$, and SC$_{\rm{shift}}$
designs share the same hardware architecture. The proposed design exhibits a gradual
increase of hardware resources as $m$ values increase, which is attributed to the
requirement of additional comparators and SC cores to process random numbers simultaneously.
Specifically, compared to the SC$_{\rm{share/reverse/shift}}$ implementation, the
area increases by $1.6\times$, $2.5\times$, and $4.3\times$ for $m=2$, $3$, and $4$,
respectively. Similarly, the power shows corresponding growth with larger $m$ values.
Compared to the SC$_{\rm{share/reverse/shift}}$, the proposed design with $m=2$, $3$,
and $4$ consumes $1.2\times$, $1.6\times$, and $2.4\times$ more power than the SC$_{\rm{share/reverse/shift}}$,
respectively. Importantly, although the area and power increase with higher $m$ values,
the proposed design achieves a substantial reduction in processing cycles, leading
to enhanced energy efficiency. The proposed architecture demonstrates an effective
trade-off between hardware complexity and processing speed while maintaining superior
edge detection quality.
Fig. 8. Hardware resource comparison of SC-based Robert’s cross-edge detectors.
Table 2. PSNR analysis of Robert’s cross-edge detection for various images at 32 cycles.
2. Median filter
The median filter is a widely used image processing technique that effectively removes
salt-and-pepper noise while preserving edge information, making it especially useful
in Internet of Thing (IoT)-based vision systems, where ensuring high image quality
is essential for real-time processing [35]. The median filter processes the image by analyzing each pixel’s neighborhood within
a defined window and replacing the center pixel value with the median of its surrounding
values. Among various window sizes, the $3\times3$ median filter is commonly used
due to its efficient noise removal, where its hardware implementation primarily relies
on sorting networks to arrange the nine pixels in ascending or descending order [36]. Fig. 9 shows the sorting network used in the hardware implementation of the median filter
[37].
Fig. 9(a) shows the basic sorter element, which takes two inputs, X and Y, and outputs their
minimum and maximum values, serving as the fundamental unit in the median filter implementation.
Fig. 9(b) represents a simplified symbol for the sorter, and Fig. 9(c) depicts the structure of the overall $3\times3$ median filter. Multiple sorter elements
combine in a specific pattern to create a network that sorts input values and selects
the median value among the nine input values. Fig. 10 illustrates an SC-based sorter architecture that efficiently determines the minimum
and maximum values from two inputs using simple logic gates. The AND gate produces
the minimum value while the OR gate determines the maximum value, offering a significantly
simplified architecture compared to conventional binary computing. A $3\times3$ median
filter utilizes a structured network of 19 sorter units, which process nine inputs
to efficiently compute the median value.
Fig. 9. Implementation of a median filter using the sorting network: (a) basic sorter
element, (b) sorter symbol, and (c) structure of the $3\times3$ median filter using
sorting network.
Fig. 10. SC-based sorter architecture.
To evaluate the noise reduction capabilities of SC-based median filter designs, we
assess filtered image quality using the PSNR metric. Fig. 11 exhibits the effectiveness of the median filter in removing noise under various configurations.
The noise-added image is generated by applying salt-and-pepper noise with a density
of 0.1. The proposed design achieves filtering quality equivalent to a fixed-point
design. Compared to the SC$_{\rm{share}}$ which requires 256 cycles, the proposed
SC with $m=2$, $3$, and $4$ improves remarkable reduction in processing time, requiring
only 16, 8, and 4 cycles, respectively. Additionally, Fig. 12 presents a comparative analysis of PSNR performance over processing cycles for different
SC-based median filter implementations. The SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$
approaches exhibit relatively low PSNR performance, maintaining approximately 15 dB
across all operating cycles. The SC$_{\rm{share}}$ shows a gradual increase, improving
from nearly 9 dB at 2 cycles to around 32 dB at 256 cycles. However, the proposed
designs achieve similar PSNR with significantly fewer cycles. The design with $m=2$
starts at 15.8 dB at 2 cycles and reaches approximately 31 dB at 16 cycles. The design
with $m=3$ begins at 23.2 dB at 2 cycles and surpasses 31 dB within 16 cycles. Additionally,
the proposed SC with $m=4$ achieves a PSNR of 28.8 dB at 2 cycles and reaches 31 dB
within only 4 cycles. The results indicate that as $m$ increases, the proposed designs
consistently achieve higher PSNR values.
Table 3 provides the PSNR comparison of various images at 32 cycles for the SC-based median
filter. The proposed design with $m=2$, $3$, and $4$ achieves significantly better
noise reduction quality than conventional SC methods for all images. Consistent with
Robert’s cross-edge detection, the proposed SC demonstrates enhanced PSNR convergence
characteristics with increasing $m$ values, achieving comparable quality up to 64
times faster than traditional SC-based median filter.
Fig. 13 presents the hardware utilization, specifically area and power, using different SC-based
median filter architectures implemented using the same design methodology applied
to the Robert’s cross-edge detectors. The proposed design exhibits increased area
and power consumption compared to the SC$_{\rm{share/reverse/shift}}$. For $m=2$,
$3$, and $4$, the area increases $2.0\times$, $3.6\times$, and $6.6\times$, respectively.
Similarly, power also increases with higher $m$ values, achieving increments of $1.4\times$,
$2.1\times$, and $3.4\times$ compared to the SC$_{\rm{share/reverse/shift}}$ for $m=2$,
$3$, and $4$, respectively. While the proposed design demands increased hardware resources,
this trade-off facilitates substantial enhancement in computational speed. Notably,
the proposed design with $m=4$ achieves equivalent image quality in only 4 cycles,
compared to the 256 cycles necessitated by the SC$_{\rm{share/reverse/shift}}$. Despite
the increase in area and power consumption, the proposed design achieves significantly
improved energy efficiency through a substantial reduction in processing cycles. Similar
to Robert's cross-edge detection results, the proposed architecture demonstrates an
effective trade-off between hardware complexity and processing efficiency in median
filtering applications. While requiring additional hardware resources, the significant
reduction in processing cycles validates the effectiveness of our parallel processing
approach across various SC applications.
Fig. 11. Noise-added input image and output images by fixed-point arithmetic and SC-based
median filters.
Fig. 12. PSNR performance of different SC-based median filter across processing cycles.
Fig. 13. Hardware resource analysis of SC-based median filters.
Table 3. PSNR comparison of SC-based median filter among various images at 32 cycles.}}
VI. CONCLUSION
In this paper, we addressed one of the primary limitations of SC, which is long computation
times. To achieve this, we have presented a novel parallel RNG design that significantly
enhances SC efficiency. The proposed design adopts a scalable partitioned structure
that combines m-bit fixed values and an ($n-m$)-bit random source to generate $k$
random numbers, each with n bits, simultaneously. The proposed design substantially
reduces hardware resource requirements compared to conventional parallel SC architectures
while dramatically improving processing speed. The proposed design demonstrates consistently
excellent performance across addition, subtraction, and multiplication operations.
When compared to the SC$_{\rm{serial}}$ and SC$_{\rm{parallel}}$, the proposed design
maintains competitive MSE for addition and multiplication, while significantly improving
subtraction. As the number of $m$-bits increases from 2 to 4, the convergence characteristics
of the design improve with higher $m$-bit numbers, achieving faster convergence within
4 to 8 cycles with 4-bit compared to within 16 to 32 cycles with 2-bit. Furthermore,
to validate the effectiveness of the proposed design in practical applications, we
adopted two digital image processing applications, Robert’s cross-edge detector and
median filter, achieving substantial processing speed improvements while comparable
image quality. Notably, the proposed SC design with $m=4$ achieves equivalent performance
in only 4 cycles, whereas SC$_{\rm{share}}$, SC$_{\rm{reverse}}$, and SC$_{\rm{shift}}$
counterparts require significantly more cycles to reach the same quality of processed
images. The proposed design advances the SC-based image processing field through its
highly efficient architecture, which achieves a substantial reduction in computational
cycles while maintaining image quality.
ACKNOWLEDGMENTS
This work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korea government (MSIT) (RS-2023-00279770).
References
B. R. Gaines, `'Stochastic computing systems,'' in Tou, J.T. (eds) Advances in Information
Systems Science, Springer, pp. 37-172, 1969.

A. Alaghi, W. Qian, and J. P. Hayes, ``The promise and challenge of stochastic computing,''
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.
37, no. 8, pp. 1515-1531, 2018.

S. S. Tehrani, S. Mannor, and W. J. Gross, ``Fully parallel stochastic LDPC decoders,''
IEEE Transactions on Signal Processing, vol. 56, no. 11, pp. 5692-5703, 2008.

D. Lee, J. Baik, and Y. Kim, ``An accurate and efficient stochastic computing adder
exploiting bit shuffle control scheme,'' Proc. of 19th International SoC Design Conference
(ISOCC), pp. 51-52, 2022.

Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han, ``A survey of stochastic computing
neural networks for machine learning applications,'' IEEE Transactions on Neural Networks
and Learning Systems, vol. 32, no. 7, pp. 2809-2824, 2021.

S. Liu and J. Han, ``Dynamic stochastic computing for digital signal processing applications,''
Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 604-609,
2020.

P. K. Muthappa, F. Neugebauer, I. Polian, and J. P. Hayes, ``Hardware-based fast real-time
image classification with stochastic computing,'' Proc. of IEEE 38th International
Conference on Computer Design (ICCD), pp. 340-347, 2020.

D. Lee and Y. Kim, ``Exploiting omega network and inexact accumulative parallel counter
to enhance energy efficiency in stochastic computing,'' Proc. of 40th ACM/SIGAPP Symposium
on Applied Computing (SAC), pp. 524–531, 2025.

Z. Chen, Y. Ma, and Z. Wang, ``Hybrid stochastic-binary computing for low-latency
and high-precision inference of CNNs,'' IEEE Transactions on Circuits and Systems
I: Regular Papers, vol. 69, no. 7, pp. 2707-2720, 2022.

H. Seo and Y. Kim. ``A low latency approximate adder design based on dual sub-adders
with error recovery,'' IEEE Transactions on Emerging Topics in Computing, vol. 11,
no. 3, pp. 811-816, 2023.

Y. Kim, ``A novel approximate adder with enhanced low-cost carry prediction for error
tolerant computing,'' IEIE Transactions on Smart Processing & Computing, vol. 8, no.
6, pp. 506-510, 2019.

Y. Chung and Y. Kim, ``Comparison of approximate computing with sobel edge detection,''
IEIE Transactions on Smart Processing & Computing, vol. 10, no. 4, pp. 355-361, 2021.

H. Seok, H. Seo, J. Lee, Y. Kim, ``Design optimization of a 4-2 compressor for low-cost
approximate multipliers,'' IEIE Transactions on Smart Processing & Computing, vol.
11, no. 6, pp. 455-461, 2022.

D. Lee, H. Seo, and Y. Kim, ``Design of an efficient parallel random number generator
using a single LFSR for stochastic computing,'' Proc. International Conference on
Artificial Intelligence in Information and Communication (ICAIIC), pp. 775-777, 2024.

P. Choi, J. Kim, and D. Kim, ``Improving ring-oscillator-based true random number
generators using multiple sampling,'' Journal of Semiconductor Technology and Science,
vol. 19, no. 3, pp. 305-309, 2019.

E. Kim and J. Kim, ``A single-ring-oscillator based true-random-number generator with
3-edges collapse,'' Journal of Semiconductor Technology and Science, vol. 23, no.
6, pp. 396-402, 2023.

E. Lee, S.-M. Yu, Y. Kang, and J. Song, ``Phase-locked Loop with true random number
generator for reference spur reduction,'' IEIE Transactions on Smart Processing &
Computing, vol. 10, no. 2, pp. 146-150, 2021.

A. Zhakatayev, K. Kim, K. Choi, and J. Lee, ``An efficient and accurate stochastic
number generator using even-distribution coding,'' IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 37, no. 12, pp. 3056-3066, 2018.

A. Alaghi and J. P. Hayes, ``Fast and accurate computation using stochastic circuits,''
in Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp.
1-4, 2014.

H. Ichihara, T. Sugino, S. Ishii, T. Iwagaki, and T. Inoue, ``Compact and accurate
digital filters based on stochastic computing,'' EEE Transactions on Emerging Topics
in Computing , vol. 7, no. 1, pp. 31-43, 2019.

S. A. Salehi, ``Low-cost stochastic number generators for stochastic computing,''
IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 28, no. 4,
pp. 992-1001, 2020.

Y. Xie, S. Liao, B. Yuan, Y. Wang, and Z. Wang, ``Fully-parallel area-efficient deep
neural network design using stochastic computing,'' IEEE Transactions on Circuits
and Systems II: Express Briefs, vol. 64, no. 12, pp. 1382-1386, Dec. 2017.

J. Kim, W. S. Jeong, Y. Jeong, and S. E. Lee, ``Parallel stochastic computing architecture
for computationally intensive applications,'' Electronics, vol. 12, no. 7, 1749,
2023.

H. Joe and Y. Kim, ``Compact and power-efficient sobel edge detection with fully connected
cube-network-based stochastic computing,'' Journal of Semiconductor Technology and
Science, vol. 20, no. 5, pp. 436-446, 2020.

H. Joe and Y. Kim, ``Novel stochastic computing for energy-efficient image processors,''
Electronics, vol. 8, no. 6, 720, 2019.

D. Lee and Y. Kim, ``Design of a low-cost stochastic computing-based median filter
for digital image processing,'' Proc. of 21th International SoC Design Conference
(ISOCC), pp. 298-299, 2024.

D. Lee, J. Baik, and Y. Kim, ``Enhancing stochastic computing using a novel hybrid
random number generator integrating LFSR and Halton sequence,'' Proc. of 20th International
SoC Design Conference (ISOCC), pp. 7-8, 2023.

D. Lee and Y. Kim, ``Towards quantized stochastic computing by leveraging reduced
precision binary numbers through bit truncation,'' Proc. of IEEE 41st International
Conference on Computer Design (ICCD), pp. 419-422, 2023.

Z. Wang, D. Larso, M. Barker, S. Mohajer, and K. Bazargan, ``Deterministic shuffling
networks to implement stochastic circuits in parallel,'' IEEE Transactions on Very
Large Scale Integration (VLSI) Systems , vol. 28, no. 8, pp. 1821-1832, 2020.

X. Wei and L. Xiu, ``A VLSI digital circuit platform for performing deterministic
stochastic computing in the time dimension using fraction operations on rational numbers,''
EEE Transactions on Emerging Topics in Computing , vol. 11, no. 1, pp. 194-207, 2023.

Z. Lin, G. Xie, W. Xu, J. Han, and Y. Zhang, ``Accelerating stochastic computing using
deterministic Halton sequences,'' IEEE Transactions on Circuits and Systems II: Express
Briefs, vol. 68, no. 10, pp. 3351-3355, 2021.

V. Sehwag, N. Prasad, and I. Chakrabarti, ``A parallel stochastic number generator
with bit permutation networks,'' IEEE Transactions on Circuits and Systems II: Express
Briefs, vol. 65, no. 2, pp. 231-235, 2018.

A. Alaghi and J. P. Hayes, ``Survey of stochastic computing,'' ACM Transactions on
Embedded Computing Systems, vol. 12, no. 2, p. 92, 2013.

M. I. Freidlin, ``Quasi-deterministic approximation, metastability, and stochastic
resonance,'' Physica D, vol. 137, pp. 333-352, 2000.

A. Roy, S. Bandopadhaya, S. Chandra, and A. Suhag, ``Removal of impulse noise for
multimedia-IoT applications at gateway level,'' Multimedia Tools and Applications,
vol. 81, pp. 34463-34480, 2022.

D. E. Knuth, The Art of Computer Programming: Sorting and Searching, vol. 3, Pearson
Education, 1998.

S. Liu and J. Han, ``Toward energy-efficient stochastic circuits using parallel Sobol
sequences,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.
26, no. 7, pp. 1326-1339, 2018.

Donghui Lee received his M.S. degree from the School of Computer Science and Engineering
at Kyungpook National University, Daegu, Republic of Korea, in 2023, where he is pursuing
a Ph.D. degree. His research interests include stochastic computing (SC) and hardware
design.
Yongtae Kim received his B.S. and M.S. degrees in electrical engineering from the
Korea University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and a
Ph.D. degree from the Department of Electrical and Computer Engineering from the Texas
A&M University, College Station, TX, in 2013. From 2013 to 2018, he was a software
engineer with Intel Corporation, Santa Clara, CA. Since 2018, he has been with the
School of Computer Science and Engineering at Kyungpook National University, Daegu,
South Korea, where he is currently an Associate Professor. His research interests
are in energy-efficient integrated circuits and systems, particularly, neuromorphic
computing, approximate computing, quantum computing, and new memory devices and architecture.