Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 25, No. 06, p.688-695

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 3 Jul. 2025Revised : 11 Sep. 2025Accepted : 11 Oct. 2025

DOI :

https://doi.org/10.5573/JSTS.2025.25.6.688

Energy Efficient CMOS Stochastic Bit-based Bayesian Inference Accelerator

KimHonggu¹ ShimYong¹

(Department of Intelligent Semiconductor Engineering, Chung-Ang University, Seoul, Korea)

E-mail: yongshim@cau.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Stochastic computing-based Bayesian inference has emerged as a powerful approach for statistical computation, particularly in domains requiring high-dimensional probabilistic analysis. However, in conventional Von Neumann architectures, stochastic computing faces significant energy challenges due to the exponential growth of data volume associated with the Internet of Things (IoT). In this work, we proposed a CMOS stochastic bit-based Bayesian inference accelerator designed for energy-efficient stochastic computation. The stochastic bit in our design performs dual functions as both: 1) stochastic computation unit and 2) memory element, enabling an energy-optimized implementation of Bayesian inference. The proposed design is validated through a case study involving a 3-layer, 4-variable Bayesian network model, implemented using TSMC 65 nm GP process technology, with total energy of 1.5 nJ.

Index terms

Stochastic computing, Bayesian inference, CMOS stochastic bit

I. INTRODUCTION

Bayesian inference using stochastic computing is a highly effective technique for applications demanding statistical analysis, such as transportation systems ^[1] and medical diagnostics ^[2]. However, its computational complexity is substantially higher than that of traditional arithmetic-based computing, due to the requirement of generating probabilistic posterior distributions ^[3]. This increased computational complexity, coupled with the exponential growth in data volumes associated with the IoT era ^[4], has led to a substantial rise in the energy requirements for executing computationally intensive Bayesian inference ^[5].

To address this energy challenge, prior work ^[5] introduced a CMOS-based Bayesian inference accelerator that improves the energy efficiency by optimizing data paths. However, this work still relies on a significant number of arithmetic logic units for complex probabilistic computations, limiting the comprehensive energy optimization in Bayesian inference operations.

In this work, we proposed an energy-efficient CMOS stochastic bit-based Bayesian inference accelerator. The CMOS stochastic bit in our design serves dual functions: 1) as a stochastic computing unit and 2) as a 1-bit memory element. This dual-role functionality significantly improves the energy efficiency of the Bayesian inference tasks by leveraging its in-memory stochastic computing mechanism. The effectiveness of the proposed system was demonstrated using a 3-layer, 4-variable Bayesian network model, achieving an energy consumption of 1.5 nJ.

II. PRELIMINARIES

1. Bayesian Inference

Bayesian inference is a statistical inference that uses Bayes’ theorem to update the probability estimate for a hypothesis (event H) using new incoming data (event D). It is mathematically expressed as follows:

(1)

P ( H = T | D = T ) = P ( H = T ∩ D = T ) P ( D = T )

, where two key operations are highlighted: the joint probability of events H and D (multiplication), and the normalization by the probability of the conditioning event D (division).

Fig. 1 illustrates an exemplified diagram of a Bayesian network model for predicting the likelihood of wet grass based on multiple correlated factors. Given a prior probability of cloudy weather, $P(C = T)$, equal to 0.5, the probability of rain, $P(R = T)$, is 0.8. If it rains, the probability of the grass being wet, $P(W = T)$, is 0.9. Meanwhile, another factor influencing whether the grass is wet is the operation of a sprinkler. When the weather is not cloudy, the likelihood of the sprinkler being turned on, $P(S = T)$, is 0.5. If the sprinkler is on, the probability of the grass being wet increases accordingly. As depicted in the diagram of Fig. 1, when it rains and the sprinkler is on ($R = T, S = T$), the probability of wet grass, $P(W = T)$, rises significantly to 0.99.

Fig. 1. Exemplified diagram of Bayesian inference in wet grass prediction scenario.

The objective of the Bayesian inference is to calculate the probability of prior event (e.g., sprinkler is on) given the occurrence of the posterior event (e.g. grass is wet). This can be formulated as follows:

(2)

P ( S = T | W = T ) = P ( S = T ∩ W = T ) P ( W = T )

, where the calculation of the joint probability of sprinkler and wet grass (multiplication), and the normalization by the overall probability of wet grass (division) are involved.

2. Stochastic Bit

The CMOS stochastic bit employed in this study, originally introduced in ^[6], functions as both 1) a stochastic computing unit and 2) an 1-bit memory element. The schematic of the stochastic bit is depicted in Fig. 2(a), comprising a pair of cross-coupled inverters, a PMOS Noise Cell (PNC), and an NMOS Footer Cell (NFC). The PNC introduces asymmetric pull-up currents to the pull-up paths of the cross-coupled inverters. This asymmetry, in conjunction with 1) the noise sources inherent in CMOS device (thermal noise and flicker noise) and 2) the metastability inherent in the cross-coupled inverter configuration (when the input and output are shorted), induces proportional stochastic behavior. Additionally, NFC is an optional cell for offset calibration within the cross-coupled inverter by selectively activating codes on either side to compensate for strength mismatches. The calibration can be performed iteratively by comparing the stochastic bit’s spike count with a reference value representing the zero-offset condition, updating the NFC codes over multiple time steps until the spike probability converges to the target probability ^[7].

Fig. 2. (a) Schematic of CMOS Stochastic bit, (b) operational waveform of the Stochastic bit and (c) switching probability of Stochastic bit with entire PNC code (PNCC) range.

The operational waveform of the stochastic bit is illustrated in Fig. 2(b). Initially, when the enable signal (EN) is low, all pull-up and pull-down current paths (PNC and NFC) are disconnected, and the inputs and outputs of the cross-coupled inverters are shorted to establish an equivalent voltage potential. When EN transitions to high, the pull-up and pull-down paths are activated, and the a/b nodes of the cross-coupled inverters are probabilistically switched to either 1/0 or 0/1 based on the pre-programmed PNC Codes (PNCCs) - functioning as 1-bit stochastic computing element. After the probabilistic switching is complete, the cross-coupled inverter latches and holds the resulting bit until the subsequent reset phase (EN low). In this phase, the switched bit remains stored in the cross-coupled inverter, similar to an SRAM cell - functioning as a 1-bit memory element.

The upper side of Fig. 2(c) demonstrates the mapping strategies PNCCs to the left and right pull-up noise cells. When the MSB bit of PNCC ($PNCC<5>$) is low, all RCs are deactivated (set to high), and the remaining 5 bits ($PNCC<4:0>$) are directly mapped to the left pull-up noise cells (LCs). Conversely, when $PNCC<5>$ is set to 1, all LCs are deactivated (set to high), and the remaining 5 bits are inverted and mapped to the RCs. As a result, the output pulse probability of the stochastic bit exhibits a sigmoid response across the entire PNCC code range, as shown in Fig. 2(c).

To sum up, the CMOS stochastic bit exhibits two main behavioral characteristics: 1) it performs 1-bit-wise stochastic computing, and 2) the computed value is stored within the cross-coupled inverter, thereby concurrently functioning as a 1-bit memory element.

III. CMOS STOCHASTIC BIT-BASED BAYSIAN INFERENCE ACCELERATOR

Fig. 3 shows the overall architecture of the proposed Bayesian inference accelerator, tailored for the Bayesian inference process in the wet-grass prediction scenario outlined in Fig. 1. The architecture implements a 3-layer, 4-variable Bayesian network, where each dice (representing the individual variable state) incorporates a single Stochastic bit. Each Stochastic bit generates Poisson distributed pulse trains (e.g., OA1), where its pulse generation rate represents the corresponding conditional probabilities. The Bayesian inference logic unit selects two pulse trains (e.g., OA22 and OA3) and performs the final Bayesian inference (e.g. $P(S = T | W = T)$). A decoder is used to update the PNCCs for each dice, while a timing controller (TCON) synchronizes all control signals for the Bayesian network and Bayesian inference logic.

Fig. 3. Overall architecture of the proposed Bayesian inference accelerator.

1. CMOS Stochastic bit-based Bayesian Dice

Fig. 4(a) shows the detailed schematic of a single Bayesian Dice unit, consisting of a single Stochastic bit and two clocked latches (C-LATs). The switching probability of the Stochastic bit is modulated by PMOS Noise Cell Codes ($PNCC<5:0>$), which are stored in registers. The first clocked latch, right after the Stochastic bit, latches the evaluated output pulse, generating the OA_LAT signal. The subsequent C-LAT captures the current state of the dice (OA_LAT) and propagates it to subsequent layers via the SEL signal.

Fig. 4(b) shows the detailed operational timing diagram of a single Bayesian Dice. When EN is low (Reset phase), the a/b nodes of the cross-coupled inverter are equalized at the same voltage potential. Subsequently, EN transitions to high (Eval phase), and the a/b nodes of the cross-coupled inverters probabilistically switch to 1/0, generating the output pulse (OA signal) when RD = 1. At the same time, when RD = 1, the first C-LAT captures the OA value, resulting in OA_LAT signal. Subsequently, when CC = 1, then the second C-LAT latches the OA_LAT value, generating SEL signal.

Fig. 4. (a) Schematic of a single Bayesian Dice and (b) its operational timing diagram.

2. CMOS Stochastic Bit-based Bayesian network

Fig. 5 shows the detailed implementation of the 3-layer, 4-variable Bayesian network. Each PNCC, stored in the registers, makes the Bayesian dice to generate the corresponding conditional probability based on its logical state (T or F). While the Bayesian Dice at the 1st layer has a single PNCC bundle for 50% probability configuration (i.e., $P(C = T)$), the dices from the 2nd layer have an additional multiplexer along with multiple PNCC register bundles for conditional probability computation (e.g., $P(S = T | C)$). For instance, if DICE 1 generates SEL₁=1 (Cloudy=T), then the 2-to-1 multiplexer in the DICE 21 selects PNCC_R_T code bundle and the stochastic bit generates the output pulse (OA₂₁) with a 90% probability. Furthermore, if DICE21 and DICE22 generate SEL₂₁=1 and SEL₂₂=0 (Rain=T and Sprinkler=F), then 4-to-1 multiplexer in DICE 3 selects PNCC_W_TF code bundle, making OA₃ to be generated at 90% percent probability (i.e., $P(W = T | R = T, S = F)$). In this way, the conditional probability of each variable is computed step by step using PNCC registers and selection logic. This allows the Bayesian network to perform probabilistic inference directly in hardware.

Fig. 5. Detailed implementation of 3-layer, 4-variable Bayesian network.

3. CMOS Stochastic Bit-based Bayesian Inference Logic

Fig. 6 illustrates the detailed schematic of the Bayesian inference logic. Two 4-to-1 multiplexers are incorporated to select two pulses (OA_A and OA_B) from among the four variable states of the Bayesian network (OA1, OA21, OA22, OA3). These selected pulse trains represent the probabilistic states required for inference.

Subsequently, two AND gates perform stochastic bitwise multiplication on the selected signals. The upper AND gate performs the stochastic multiplication between OA_A and OA_B, generating OA_Num ($P(Num = T) = P(A = T \cap B = T)$). The lower AND gate performs stochastic multiplication between OA_B and OA_Div, resulting in OA_Den ($P(Den = T) = P(B = T) * P(Div = T)$). The OA_Div pulses, generated at the Stochastic bit (behind Rate Equalizer), are initially configured to generate output pulse train at 50 %, by initializing the code modulator to zero ($PNCC\_DIV <5:0>=000000$).

A Rate Equalizer - consisting of two counters, a digital comparator, and a code modulator - is used to synchronize the data rates between OA_Den and OA_Num pulses. Specifically, each counter tracks the pulse occurrences (OA_Den and OA_Num) over one iteration window of 255 read cycles, and the subsequent digital comparator compares which of the counted values (N and D) are bigger. If N > D, then Code Modulator decreases the PNCC_DIV codes to increase the data rate of OA_Div ($P(Div = T)$). Conversely, if N < D, then Code modulator increases the PNCC_DIV codes to decrease the data rate of OA_Div. After multiple iterations, the data rate of OA_Den ($OA_B * OA_{Div}$) asymptotically approaches that of OA_Num (i.e., $P(B = T) * P(Div = T) \cong P(Num = T)$), meaning that OA_Div generates the output pulse at the rate of division of data rates of OA_Num and OA_B ($P(Div = T) = P(Num = T)/P(B = T)$). When substituting the factor that $P(Num = T) = P(A = T \cap B = T)$, the output pulse rate of OA_Div is finalized as follows:

(3)

P ( D i v = T ) = P ( A = T ∩ B = T ) P ( B = T )

, representing the fundamental outcome of the Bayesian inference operation – i.e., the probability of the prior event given the occurrence of the posterior event.

Fig. 6. Detailed implementation of the Bayesian inference logic.

IV. PERFORMANCE ANALYSIS

Fig. 7(a) shows the layout of the proposed Stochastic bit-based Bayesian inference accelerator chip, implemented using the TSMC 65 nm GP process technology. The chip operates with a supply voltage of 1 V and a main clock frequency of 200 MHz. The total power consumption of the entire chip is 122.5 µW. Fig. 7(b) illustrates the breakdown of both area and power consumption. Notably, the Bayesian network and inference logic occupy the largest portion of both the silicon area and the power consumption, largely due to the high count of Stochastic bits. Nevertheless, the power consumption of each Bayesian network (containing 4 CMOS Stochastic bits) is quite comparable to simple digital based Timing Controller, indicating that the CMOS stochastic bit used in this work is energy efficient.

Additionally, each Stochastic bit has been intentionally designed with a redundant silicon area to mitigate parasitic coupling noise with surrounding digital logic. This design consideration is crucial, as the Stochastic bits generate Poisson-distributed pulses influenced by multiple noise sources (thermal noise and flicker noise). The presence of parasitic coupling noise has the potential to significantly degrade the performance of the stochastic bits. Future work aims to optimize the area efficiency of the Stochastic bit by eliminating redundant silicon space, while still ensuring the integrity of stochastic behavior.

Fig. 7. (a) Implemented layout of the Bayesian inference accelerator chip and (b) area and power breakdowns.

Fig. 8(a) shows the simulated result of the switching events occurred at the 3-layer 4-variable based Bayesian for 255 iterations, following the conditional probability diagram shown in Fig. 1. In Fig. 8(b), the computed probability table is shown alongside the absolute error (|err|) for each variable, calculated with respect to its target probability. Each Bayesian dice successfully generates Poisson distributed pulse trains, closely aligning with the target conditional probabilities across variable states, achieving a worst-case absolute error (|err|) of only 2.3%.

Fig. 8. (a) Simulation result of the switching events in 3-layer, 4-variable Bayesian network reflecting the conditional probability-based output pulse generation rate and (b) the computed probability table with absolute error (|err|) relative to the target probability.

Fig. 9 shows the simulated result of the Bayesian inference logic across multiple iterations. In this simulation, Bayesian inference logic is configured to generate the output pulse at the probability of sprinkler’s operation given that the grass is wet ($P(S = T | W = T)$), as given in equation (2). At initial phase, the output pulse generation rate is about 50%, since PNCC_DIV code is initially set to zero. Over multiple Bayesian inference iterations, the switching probability of OA_DIV (i.e., $P(Div = T)$) asymptotically converges to the target probability of 28.35%. Consequently, $P(Div = T)$ first crosses the target probability at the fifth iteration, resulting in a total elapsed time of 12.25 µs.

Fig. 9. Simulated result of the Bayesian inference logic across multiple iterations, showing an asymptotic adjustment of P(Div=T) to converge toward the target probability defined in Equation (3).

To evaluate the scalability of the proposed stochastic bit-based Bayesian inference, we developed a PyTorch-based software emulation framework to simulate a 37-node ALARM (A Logical Alarm Reduction and Monitoring) Bayesian network ^[5] with embedded stochastic bit operations. Each variable node was implemented following the identical structure of the Bayesian dice shown in Fig. 4(a).

As the number of variable nodes and inter-node connections increases, the area overhead associated with the PNCC registers grows correspondingly. Furthermore, for a Bayesian dice with N prior event dependencies, $2^N$ PNCC register bundles are required, leading to a significant increase in area overhead. Meanwhile, since all stochastic bits in the network utilize identical PNCC values to define the pulse generation rate (e.g., PNCC = b’100000 for a 50% probability), maintaining separate PNCC registers for each Bayesian unit introduces area redundancy. To enhance area efficiency, a single global PNCC Look-Up Table (LUT) can be shared across the entire network, eliminating the need for multiple PNCC registers for individual Bayesian dices.

Using the PyTorch-based software emulation framework, the spike probabilities of all 37 variable nodes were accurately generated, with a worst-case absolute error (|err|) of 4.2%. Furthermore, the Bayesian inference logic can be realized using the same structure shown in Fig. 6, with the only modification of extending the input ports of the two pulse (OA) selection multiplexers from 4 to 37. The PyTorch emulated-based Bayesian inference logic achieved the target probability by the sixth iteration, corresponding to a total elapsed time of 15 µs. The scaled stochastic bit-based Bayesian network consumed a total power of 997.8 µW, resulting in an energy consumption of 14.7 nJ for the Bayesian inference in the ALARM network.

Table 1 compares this work with other state-of-the-art studies. Unlike previous approaches that rely on arithmetic logic for stochastic computation ^[5], which may not achieve optimal energy efficiency, this work leverages a stochastic bit-based operation specifically designed for energy-efficient Bayesian inference. Consequently, our approach achieved the normalized energy of 1.5 nJ on the 4 variable-based Wet grass prediction task, demonstrating a 5.5-fold reduction in normalized energy consumption compared to ^[5]. Even when scaled to the 37-variable ALARM network, the normalized energy is 1.59 nJ, a marginal increase over the 4-variable case, corresponding to a 5.18-fold energy reduction relative to ^[5]. This indicates that the stochastic bit-based accelerator is feasible for large-scale Bayesian networks.

Table 1. Comparison between arithmetic logic-based and CMOS stochastic bit-based computing

	Arithmetic logic ^[5]	CMOS Stochastic Bit
Process	65 nm	65 nm
Supply voltage	0.5 V	1 V
Main clock frequency	33 MHz	200 MHz
Stochastic computing unit	Arithmetic logic	CMOS stochastic bit
Application	ALARM network	Wet grass prediction	ALARM network*
Number of variables (N)	37	4	37
Execution time	350 $\mu$s	12.25 $\mu$s	15 $\mu$s*
Total energy (E)	76.2 nJ	1.5 nJ	11.7 nJ*
Normalized energy $((4 \cdot E)/N)^{**}$	8.24 nJ	1.5 nJ	1.59 nJ*

*Software emulation based **Normalized to 4-variable Wet grass prediction case

V. CONCLUSIONS

In this work, we proposed an energy-efficient stochastic bit-based accelerator for Bayesian inference. The stochastic bit in this design serves dual roles: 1) as a stochastic computing unit and 2) as a 1-bit memory element, substantially improving the energy efficiency of Bayesian inference procedures. Implemented using TSMC 65nm GP process technology, the architecture demonstrates high energy efficiency on a 3-layer, 4-variable Bayesian network model, achieving a total energy consumption of 1.5 nJ.

ACKNOWLEDGEMENT

This research was supported in part by the National Research Foundation of Korea (NRF) grant through Korea government (MSIT) under Grant No. 2021R1C1C100875214, and in part by the KIST Institutional Program through Korea government (MSIT) under Grant No. 2E33581. The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

REFERENCES

Gindele T., Brechtel S., Dillmann R., 2010, A probabilistic model for estimating driver behaviors and vehicle trajectories in traffic environments, Proc. of 13th International IEEE Conference on Intelligent Transportation Systems, pp. 1625-1631

Abideen Z. U., Ghahoor M., Munir K., Saqib M., Ullah A., Zia T., Tariq S. A., Ahmed G., Zahra A., 2020, Uncertainty Assisted Robust Tuberculosis Identification With Bayesian Convolutional Neural Networks, IEEE Access, Vol. 8, pp. 22812-22825

Dorrance R., Dasalukunte D., Wang H., Liu R., Carlton B. R., 2023, An energy-efficient Bayesian neural network accelerator with CiM and a time-interleaved Hadamard digital GRNG using 22-nm FinFET, IEEE Journal of Solid-State Circuits, Vol. 58, No. 10, pp. 2826-2838

Atzori L., Iera A., Morabito G., 2010, The Internet of Things: A survey, Computer Networks, Vol. 54, No. 15, pp. 2787-2805

Khan O. U., Wentzloff D. D., 2016, Hardware accelerator for probabilistic inference in 65-nm CMOS, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 24, No. 3, pp. 837-845

Koo M., Srinivasan G., Shim Y., Roy K., 2020, sBSNN: Stochastic-bits enabled binary spiking neural network with on-chip learning for energy efficient neuromorphic computing at the edge, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 67, No. 8, pp. 2546-2555

Kim H., An Y., Kim M., Heo G.-C., Shim Y., 2025, All stochastic-spiking neural network (AS-SNN): Noise induced spike pulse generator for input and output neurons with resistive synaptic array, IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 72, No. 1, pp. 78-82

Heckerman D., 2008, A Tutorial on Learning with Bayesian Networks, pp. 33-82

Honggu Kim

Honggu Kim received his B.S. degrees from the School of Electrical and Electronics Engineering, Chung-Ang University (CAU), Seoul, Korea, in 2022. He is currently working toward an Integrated M.S and Ph.D. degree course in intelligent semiconductor engineering. His research interests include neuromorphic hardware and software co-optimization and analog compute-in-memory architecture.

Yong Shim

Yong Shim received his B.S. and M.S. degrees in electronics engineering from Korea University, in 2004 and 2006, respectively, and a Ph.D. degree from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, in 2018. He was a Memory Interface Designer with Samsung Electronics, Hwaseong, from 2006 to 2013. At Samsung, he has worked on the design and development of a memory interface for synchronous DRAMs (DDR1 and DDR4). He is currently an Assistant Professor with Chung-Ang University. Prior to joining Chung-Ang University, in 2020, he was an SRAM Designer with Intel Corporation, Hillsboro, OR, from 2018 to 2020, where he was involved in designing circuits for super-scaled next generation SRAM cache design. His research interests include neuromorphic hardware and algorithm, in-memory computing, robust memory interface design, as well as emerging devices (RRAM, MRAM, and STO) based unconventional computing models.