Energy Efficient CMOS Stochastic Bit-based Bayesian Inference Accelerator
KimHonggu1
ShimYong1
-
(Department of Intelligent Semiconductor Engineering, Chung-Ang University, Seoul,
Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index terms
Stochastic computing, Bayesian inference, CMOS stochastic bit
I. INTRODUCTION
Bayesian inference using stochastic computing is a highly effective technique for
applications demanding statistical analysis, such as transportation systems [1] and medical diagnostics [2]. However, its computational complexity is substantially higher than that of traditional
arithmetic-based computing, due to the requirement of generating probabilistic posterior
distributions [3]. This increased computational complexity, coupled with the exponential growth in
data volumes associated with the IoT era [4], has led to a substantial rise in the energy requirements for executing computationally
intensive Bayesian inference [5].
To address this energy challenge, prior work [5] introduced a CMOS-based Bayesian inference accelerator that improves the energy efficiency
by optimizing data paths. However, this work still relies on a significant number
of arithmetic logic units for complex probabilistic computations, limiting the comprehensive
energy optimization in Bayesian inference operations.
In this work, we proposed an energy-efficient CMOS stochastic bit-based Bayesian inference
accelerator. The CMOS stochastic bit in our design serves dual functions: 1) as a
stochastic computing unit and 2) as a 1-bit memory element. This dual-role functionality
significantly improves the energy efficiency of the Bayesian inference tasks by leveraging
its in-memory stochastic computing mechanism. The effectiveness of the proposed system
was demonstrated using a 3-layer, 4-variable Bayesian network model, achieving an
energy consumption of 1.5 nJ.
II. PRELIMINARIES
1. Bayesian Inference
Bayesian inference is a statistical inference that uses Bayes’ theorem to update the
probability estimate for a hypothesis (event H) using new incoming data (event D).
It is mathematically expressed as follows:
, where two key operations are highlighted: the joint probability of events H and
D (multiplication), and the normalization by the probability of the conditioning event
D (division).
Fig. 1 illustrates an exemplified diagram of a Bayesian network model for predicting the
likelihood of wet grass based on multiple correlated factors. Given a prior probability
of cloudy weather, $P(C = T)$, equal to 0.5, the probability of rain, $P(R = T)$,
is 0.8. If it rains, the probability of the grass being wet, $P(W = T)$, is 0.9. Meanwhile,
another factor influencing whether the grass is wet is the operation of a sprinkler.
When the weather is not cloudy, the likelihood of the sprinkler being turned on, $P(S
= T)$, is 0.5. If the sprinkler is on, the probability of the grass being wet increases
accordingly. As depicted in the diagram of Fig. 1, when it rains and the sprinkler is on ($R = T, S = T$), the probability of wet grass,
$P(W = T)$, rises significantly to 0.99.
Fig. 1. Exemplified diagram of Bayesian inference in wet grass prediction scenario.
The objective of the Bayesian inference is to calculate the probability of prior event
(e.g., sprinkler is on) given the occurrence of the posterior event (e.g. grass is
wet). This can be formulated as follows:
, where the calculation of the joint probability of sprinkler and wet grass (multiplication),
and the normalization by the overall probability of wet grass (division) are involved.
2. Stochastic Bit
The CMOS stochastic bit employed in this study, originally introduced in [6], functions as both 1) a stochastic computing unit and 2) an 1-bit memory element.
The schematic of the stochastic bit is depicted in Fig. 2(a), comprising a pair of cross-coupled inverters, a PMOS Noise Cell (PNC), and an NMOS
Footer Cell (NFC). The PNC introduces asymmetric pull-up currents to the pull-up paths
of the cross-coupled inverters. This asymmetry, in conjunction with 1) the noise sources
inherent in CMOS device (thermal noise and flicker noise) and 2) the metastability
inherent in the cross-coupled inverter configuration (when the input and output are
shorted), induces proportional stochastic behavior. Additionally, NFC is an optional
cell for offset calibration within the cross-coupled inverter by selectively activating
codes on either side to compensate for strength mismatches. The calibration can be
performed iteratively by comparing the stochastic bit’s spike count with a reference
value representing the zero-offset condition, updating the NFC codes over multiple
time steps until the spike probability converges to the target probability [7].
Fig. 2. (a) Schematic of CMOS Stochastic bit, (b) operational waveform of the Stochastic
bit and (c) switching probability of Stochastic bit with entire PNC code (PNCC) range.
The operational waveform of the stochastic bit is illustrated in Fig. 2(b). Initially, when the enable signal (EN) is low, all pull-up and pull-down current
paths (PNC and NFC) are disconnected, and the inputs and outputs of the cross-coupled
inverters are shorted to establish an equivalent voltage potential. When EN transitions
to high, the pull-up and pull-down paths are activated, and the a/b nodes of the cross-coupled
inverters are probabilistically switched to either 1/0 or 0/1 based on the pre-programmed
PNC Codes (PNCCs) - functioning as 1-bit stochastic computing element. After the probabilistic
switching is complete, the cross-coupled inverter latches and holds the resulting
bit until the subsequent reset phase (EN low). In this phase, the switched bit remains
stored in the cross-coupled inverter, similar to an SRAM cell - functioning as a 1-bit
memory element.
The upper side of Fig. 2(c) demonstrates the mapping strategies PNCCs to the left and right pull-up noise cells.
When the MSB bit of PNCC ($PNCC<5>$) is low, all RCs are deactivated (set to high),
and the remaining 5 bits ($PNCC<4:0>$) are directly mapped to the left pull-up noise
cells (LCs). Conversely, when $PNCC<5>$ is set to 1, all LCs are deactivated (set
to high), and the remaining 5 bits are inverted and mapped to the RCs. As a result,
the output pulse probability of the stochastic bit exhibits a sigmoid response across
the entire PNCC code range, as shown in Fig. 2(c).
To sum up, the CMOS stochastic bit exhibits two main behavioral characteristics: 1)
it performs 1-bit-wise stochastic computing, and 2) the computed value is stored within
the cross-coupled inverter, thereby concurrently functioning as a 1-bit memory element.
III. CMOS STOCHASTIC BIT-BASED BAYSIAN INFERENCE ACCELERATOR
Fig. 3 shows the overall architecture of the proposed Bayesian inference accelerator, tailored
for the Bayesian inference process in the wet-grass prediction scenario outlined in
Fig. 1. The architecture implements a 3-layer, 4-variable Bayesian network, where each dice
(representing the individual variable state) incorporates a single Stochastic bit.
Each Stochastic bit generates Poisson distributed pulse trains (e.g., OA1), where
its pulse generation rate represents the corresponding conditional probabilities.
The Bayesian inference logic unit selects two pulse trains (e.g., OA22 and OA3) and
performs the final Bayesian inference (e.g. $P(S = T | W = T)$). A decoder is used
to update the PNCCs for each dice, while a timing controller (TCON) synchronizes all
control signals for the Bayesian network and Bayesian inference logic.
Fig. 3. Overall architecture of the proposed Bayesian inference accelerator.
1. CMOS Stochastic bit-based Bayesian Dice
Fig. 4(a) shows the detailed schematic of a single Bayesian Dice unit, consisting of a single
Stochastic bit and two clocked latches (C-LATs). The switching probability of the
Stochastic bit is modulated by PMOS Noise Cell Codes ($PNCC<5:0>$), which are stored
in registers. The first clocked latch, right after the Stochastic bit, latches the
evaluated output pulse, generating the OA_LAT signal. The subsequent C-LAT captures
the current state of the dice (OA_LAT) and propagates it to subsequent layers via
the SEL signal.
Fig. 4(b) shows the detailed operational timing diagram of a single Bayesian Dice. When EN
is low (Reset phase), the a/b nodes of the cross-coupled inverter are equalized at
the same voltage potential. Subsequently, EN transitions to high (Eval phase), and
the a/b nodes of the cross-coupled inverters probabilistically switch to 1/0, generating
the output pulse (OA signal) when RD = 1. At the same time, when RD = 1, the first
C-LAT captures the OA value, resulting in OA_LAT signal. Subsequently, when CC = 1,
then the second C-LAT latches the OA_LAT value, generating SEL signal.
Fig. 4. (a) Schematic of a single Bayesian Dice and (b) its operational timing diagram.
2. CMOS Stochastic Bit-based Bayesian network
Fig. 5 shows the detailed implementation of the 3-layer, 4-variable Bayesian network. Each
PNCC, stored in the registers, makes the Bayesian dice to generate the corresponding
conditional probability based on its logical state (T or F). While the Bayesian Dice
at the 1st layer has a single PNCC bundle for 50% probability configuration (i.e.,
$P(C = T)$), the dices from the 2nd layer have an additional multiplexer along with
multiple PNCC register bundles for conditional probability computation (e.g., $P(S
= T | C)$). For instance, if DICE 1 generates SEL1=1 (Cloudy=T), then the 2-to-1 multiplexer in the DICE 21 selects PNCC_R_T code bundle
and the stochastic bit generates the output pulse (OA21) with a 90% probability. Furthermore, if DICE21 and DICE22 generate SEL21=1 and SEL22=0 (Rain=T and Sprinkler=F), then 4-to-1 multiplexer in DICE 3 selects PNCC_W_TF code
bundle, making OA3 to be generated at 90% percent probability (i.e., $P(W = T | R = T, S = F)$). In
this way, the conditional probability of each variable is computed step by step using
PNCC registers and selection logic. This allows the Bayesian network to perform probabilistic
inference directly in hardware.
Fig. 5. Detailed implementation of 3-layer, 4-variable Bayesian network.
3. CMOS Stochastic Bit-based Bayesian Inference Logic
Fig. 6 illustrates the detailed schematic of the Bayesian inference logic. Two 4-to-1 multiplexers
are incorporated to select two pulses (OAA and OAB) from among the four variable states of the Bayesian network (OA1, OA21, OA22, OA3).
These selected pulse trains represent the probabilistic states required for inference.
Subsequently, two AND gates perform stochastic bitwise multiplication on the selected
signals. The upper AND gate performs the stochastic multiplication between OAA and OAB, generating OANum ($P(Num = T) = P(A = T \cap B = T)$). The lower AND gate performs stochastic multiplication
between OAB and OADiv, resulting in OADen ($P(Den = T) = P(B = T) * P(Div = T)$). The OADiv pulses, generated at the Stochastic bit (behind Rate Equalizer), are initially configured
to generate output pulse train at 50 %, by initializing the code modulator to zero
($PNCC\_DIV <5:0>=000000$).
A Rate Equalizer - consisting of two counters, a digital comparator, and a code modulator
- is used to synchronize the data rates between OADen and OANum pulses. Specifically, each counter tracks the pulse occurrences (OADen and OANum) over one iteration window of 255 read cycles, and the subsequent digital comparator
compares which of the counted values (N and D) are bigger. If N > D, then Code Modulator
decreases the PNCC_DIV codes to increase the data rate of OADiv ($P(Div = T)$). Conversely, if N < D, then Code modulator increases the PNCC_DIV
codes to decrease the data rate of OADiv. After multiple iterations, the data rate of OA_Den ($OA_B * OA_{Div}$) asymptotically
approaches that of OANum (i.e., $P(B = T) * P(Div = T) \cong P(Num = T)$), meaning that OADiv generates the output pulse at the rate of division of data rates of OANum and OAB ($P(Div = T) = P(Num = T)/P(B = T)$). When substituting the factor that $P(Num =
T) = P(A = T \cap B = T)$, the output pulse rate of OADiv is finalized as follows:
, representing the fundamental outcome of the Bayesian inference operation – i.e.,
the probability of the prior event given the occurrence of the posterior event.
Fig. 6. Detailed implementation of the Bayesian inference logic.
IV. PERFORMANCE ANALYSIS
Fig. 7(a) shows the layout of the proposed Stochastic bit-based Bayesian inference accelerator
chip, implemented using the TSMC 65 nm GP process technology. The chip operates with
a supply voltage of 1 V and a main clock frequency of 200 MHz. The total power consumption
of the entire chip is 122.5 µW. Fig. 7(b) illustrates the breakdown of both area and power consumption. Notably, the Bayesian
network and inference logic occupy the largest portion of both the silicon area and
the power consumption, largely due to the high count of Stochastic bits. Nevertheless,
the power consumption of each Bayesian network (containing 4 CMOS Stochastic bits)
is quite comparable to simple digital based Timing Controller, indicating that the
CMOS stochastic bit used in this work is energy efficient.
Additionally, each Stochastic bit has been intentionally designed with a redundant
silicon area to mitigate parasitic coupling noise with surrounding digital logic.
This design consideration is crucial, as the Stochastic bits generate Poisson-distributed
pulses influenced by multiple noise sources (thermal noise and flicker noise). The
presence of parasitic coupling noise has the potential to significantly degrade the
performance of the stochastic bits. Future work aims to optimize the area efficiency
of the Stochastic bit by eliminating redundant silicon space, while still ensuring
the integrity of stochastic behavior.
Fig. 7. (a) Implemented layout of the Bayesian inference accelerator chip and (b)
area and power breakdowns.
Fig. 8(a) shows the simulated result of the switching events occurred at the 3-layer 4-variable
based Bayesian for 255 iterations, following the conditional probability diagram shown
in Fig. 1. In Fig. 8(b), the computed probability table is shown alongside the absolute error (|err|) for
each variable, calculated with respect to its target probability. Each Bayesian dice
successfully generates Poisson distributed pulse trains, closely aligning with the
target conditional probabilities across variable states, achieving a worst-case absolute
error (|err|) of only 2.3%.
Fig. 8. (a) Simulation result of the switching events in 3-layer, 4-variable Bayesian
network reflecting the conditional probability-based output pulse generation rate
and (b) the computed probability table with absolute error (|err|) relative to the
target probability.
Fig. 9 shows the simulated result of the Bayesian inference logic across multiple iterations.
In this simulation, Bayesian inference logic is configured to generate the output
pulse at the probability of sprinkler’s operation given that the grass is wet ($P(S
= T | W = T)$), as given in equation (2). At initial phase, the output pulse generation
rate is about 50%, since PNCC_DIV code is initially set to zero. Over multiple Bayesian
inference iterations, the switching probability of OADIV (i.e., $P(Div = T)$) asymptotically converges to the target probability of 28.35%.
Consequently, $P(Div = T)$ first crosses the target probability at the fifth iteration,
resulting in a total elapsed time of 12.25 µs.
Fig. 9. Simulated result of the Bayesian inference logic across multiple iterations,
showing an asymptotic adjustment of P(Div=T) to converge toward the target probability
defined in Equation (3).
To evaluate the scalability of the proposed stochastic bit-based Bayesian inference,
we developed a PyTorch-based software emulation framework to simulate a 37-node ALARM
(A Logical Alarm Reduction and Monitoring) Bayesian network [5] with embedded stochastic bit operations. Each variable node was implemented following
the identical structure of the Bayesian dice shown in Fig. 4(a).
As the number of variable nodes and inter-node connections increases, the area overhead
associated with the PNCC registers grows correspondingly. Furthermore, for a Bayesian
dice with N prior event dependencies, $2^N$ PNCC register bundles are required, leading
to a significant increase in area overhead. Meanwhile, since all stochastic bits in
the network utilize identical PNCC values to define the pulse generation rate (e.g.,
PNCC = b’100000 for a 50% probability), maintaining separate PNCC registers for each
Bayesian unit introduces area redundancy. To enhance area efficiency, a single global
PNCC Look-Up Table (LUT) can be shared across the entire network, eliminating the
need for multiple PNCC registers for individual Bayesian dices.
Using the PyTorch-based software emulation framework, the spike probabilities of all
37 variable nodes were accurately generated, with a worst-case absolute error (|err|)
of 4.2%. Furthermore, the Bayesian inference logic can be realized using the same
structure shown in Fig. 6, with the only modification of extending the input ports of the two pulse (OA) selection
multiplexers from 4 to 37. The PyTorch emulated-based Bayesian inference logic achieved
the target probability by the sixth iteration, corresponding to a total elapsed time
of 15 µs. The scaled stochastic bit-based Bayesian network consumed a total power
of 997.8 µW, resulting in an energy consumption of 14.7 nJ for the Bayesian inference
in the ALARM network.
Table 1 compares this work with other state-of-the-art studies. Unlike previous approaches
that rely on arithmetic logic for stochastic computation [5], which may not achieve optimal energy efficiency, this work leverages a stochastic
bit-based operation specifically designed for energy-efficient Bayesian inference.
Consequently, our approach achieved the normalized energy of 1.5 nJ on the 4 variable-based
Wet grass prediction task, demonstrating a 5.5-fold reduction in normalized energy
consumption compared to [5]. Even when scaled to the 37-variable ALARM network, the normalized energy is 1.59
nJ, a marginal increase over the 4-variable case, corresponding to a 5.18-fold energy
reduction relative to [5]. This indicates that the stochastic bit-based accelerator is feasible for large-scale
Bayesian networks.
Table 1. Comparison between arithmetic logic-based and CMOS stochastic bit-based computing
|
|
Arithmetic logic [5]
|
CMOS Stochastic Bit
|
|
Process
|
65 nm
|
65 nm
|
|
Supply voltage
|
0.5 V
|
1 V
|
|
Main clock frequency
|
33 MHz
|
200 MHz
|
|
Stochastic computing unit
|
Arithmetic logic
|
CMOS stochastic bit
|
|
Application
|
ALARM network
|
Wet grass prediction
|
ALARM network*
|
|
Number of variables (N)
|
37
|
4
|
37
|
|
Execution time
|
350 $\mu$s
|
12.25 $\mu$s
|
15 $\mu$s*
|
|
Total energy (E)
|
76.2 nJ
|
1.5 nJ
|
11.7 nJ*
|
|
Normalized energy $((4 \cdot E)/N)^{**}$
|
8.24 nJ
|
1.5 nJ
|
1.59 nJ*
|
V. CONCLUSIONS
In this work, we proposed an energy-efficient stochastic bit-based accelerator for
Bayesian inference. The stochastic bit in this design serves dual roles: 1) as a stochastic
computing unit and 2) as a 1-bit memory element, substantially improving the energy
efficiency of Bayesian inference procedures. Implemented using TSMC 65nm GP process
technology, the architecture demonstrates high energy efficiency on a 3-layer, 4-variable
Bayesian network model, achieving a total energy consumption of 1.5 nJ.
ACKNOWLEDGEMENT
This research was supported in part by the National Research Foundation of Korea (NRF)
grant through Korea government (MSIT) under Grant No. 2021R1C1C100875214, and in part
by the KIST Institutional Program through Korea government (MSIT) under Grant No.
2E33581. The EDA tool was supported by the IC Design Education Center (IDEC), Korea.
REFERENCES
Gindele T., Brechtel S., Dillmann R., 2010, A probabilistic model for estimating driver
behaviors and vehicle trajectories in traffic environments, Proc. of 13th International
IEEE Conference on Intelligent Transportation Systems, pp. 1625-1631

Abideen Z. U., Ghahoor M., Munir K., Saqib M., Ullah A., Zia T., Tariq S. A., Ahmed
G., Zahra A., 2020, Uncertainty Assisted Robust Tuberculosis Identification With Bayesian
Convolutional Neural Networks, IEEE Access, Vol. 8, pp. 22812-22825

Dorrance R., Dasalukunte D., Wang H., Liu R., Carlton B. R., 2023, An energy-efficient
Bayesian neural network accelerator with CiM and a time-interleaved Hadamard digital
GRNG using 22-nm FinFET, IEEE Journal of Solid-State Circuits, Vol. 58, No. 10, pp.
2826-2838

Atzori L., Iera A., Morabito G., 2010, The Internet of Things: A survey, Computer
Networks, Vol. 54, No. 15, pp. 2787-2805

Khan O. U., Wentzloff D. D., 2016, Hardware accelerator for probabilistic inference
in 65-nm CMOS, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.
24, No. 3, pp. 837-845

Koo M., Srinivasan G., Shim Y., Roy K., 2020, sBSNN: Stochastic-bits enabled binary
spiking neural network with on-chip learning for energy efficient neuromorphic computing
at the edge, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 67,
No. 8, pp. 2546-2555

Kim H., An Y., Kim M., Heo G.-C., Shim Y., 2025, All stochastic-spiking neural network
(AS-SNN): Noise induced spike pulse generator for input and output neurons with resistive
synaptic array, IEEE Transactions on Circuits and Systems II: Express Briefs, Vol.
72, No. 1, pp. 78-82

Heckerman D., 2008, A Tutorial on Learning with Bayesian Networks, pp. 33-82

Honggu Kim received his B.S. degrees from the School of Electrical and Electronics
Engineering, Chung-Ang University (CAU), Seoul, Korea, in 2022. He is currently working
toward an Integrated M.S and Ph.D. degree course in intelligent semiconductor engineering.
His research interests include neuromorphic hardware and software co-optimization
and analog compute-in-memory architecture.
Yong Shim received his B.S. and M.S. degrees in electronics engineering from Korea
University, in 2004 and 2006, respectively, and a Ph.D. degree from the School of
Electrical and Computer Engineering, Purdue University, West Lafayette, IN, in 2018.
He was a Memory Interface Designer with Samsung Electronics, Hwaseong, from 2006 to
2013. At Samsung, he has worked on the design and development of a memory interface
for synchronous DRAMs (DDR1 and DDR4). He is currently an Assistant Professor with
Chung-Ang University. Prior to joining Chung-Ang University, in 2020, he was an SRAM
Designer with Intel Corporation, Hillsboro, OR, from 2018 to 2020, where he was involved
in designing circuits for super-scaled next generation SRAM cache design. His research
interests include neuromorphic hardware and algorithm, in-memory computing, robust
memory interface design, as well as emerging devices (RRAM, MRAM, and STO) based unconventional
computing models.