Mobile QR Code QR CODE

  1. (Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea)
  2. (Memory Business Division, Samsung Electronics, Hwaseong, Republic of Korea)



Computing-in-memory, mixed-mode computing, mixed-precision, energy-efficient, AI accelerator

I. INTRODUCTION

Computing-in-memory (CIM) processors have emerged as a promising solution for accelerating deep neural network (DNN) computations, particularly to address the high energy consumption associated with traditional Von-Neumann architectures. These conventional architectures often face significant challenges due to the extensive data transfers required between memory and processing units, contributing to substantial power usage. CIM processors [1-6] mitigate this issue by performing computations directly within the memory, greatly minimizing data movement. Recent advanced CIM processors, such as [4], have demonstrated impressive energy efficiency by enabling the simultaneous activation of thousands of word lines for multiply-and-accumulate (MAC) operations, accumulating partial products along bit lines, and subsequently converting the accumulated results via analog-to-digital converters (ADCs), as shown in Fig. 1.

Despite these advancements, prior CIM processors have been limited to supporting uniform bit-precision across layers or only layer-wise variability. However, previous research [7] indicates that optimal bit-precision may vary not only between layers but even within different elements of the same layer. Utilizing mixed-precision processing, which assigns lower bit-widths for the majority of inputs while reserving higher bit-widths for outliers, can significantly reduce average bit precision without sacrificing accuracy. Mixed-precision strategies applied to the CIFAR100 dataset using ResNet18 [8] showed that 97% of input features could be processed at fixed-point 4bit precision, with only 3% needing fixed-point 8bit precision, leading to nearly 48.5% reduction in average bit-width compared to layer-wise variable precision.

Fig. 2(a) shows the multi-bit input extension architecture, where different input precisions can be processed in parallel through the decoder and ADC stages. Fig. 2(b) illustrates the relationship between input bit-width and energy efficiency, highlighting how reducing bit precision substantially improves TOPS/W. Fig. 2(c) demonstrates a mixed-precision DNN processing scheme, in which the majority of inputs are handled at low precision while only a small fraction of outliers require higher precision, thus reducing the overall bit-width without degrading accuracy.

However, existing CIM architectures with bit-serial (BS) operations struggle to fully leverage mixed-precision processing due to limitations in throughput and ADC range. The previous CIM processors utilized BS operations, processing each bit position of input feature maps sequentially over multiple cycles. The BS operations for mixed-precision feature maps are divided into two distinct stages, the Dense Accumulation Slice (DAS) stage and the Sparse Accumulation Slice (SAS) stage.

The operational sequence begins with the CIM processor handling both low-bit inliers and high-bit outliers during the DAS stage, which involves accumulating a large volume of inputs. Subsequently, in the SAS stage, the processor focuses solely on outliers, requiring sparse accumulation for a small number of inputs that are irregularly scattered across the WLs. As a result, the throughput for mixed-precision feature maps is constrained by the high-bit outliers. Even if only a single outlier exists among many WLs, the previous CIM architectures necessitate executing the SAS stage. Additionally, the ADC must be configured with high output precision to support the DAS stage, leading to inefficient ADC power usage since the SAS stage requires significantly lower output precision compared to the DAS stage. To overcome these challenges, a more sophisticated approach that combines different processing techniques is needed.

This paper presents an energy-efficient mixed-mode CIM processor that includes two key features: 1) the mixed-mode mixed-precision CIM (mixed-CIM) architecture, which uses both analog and digital accumulation paths to enhance energy efficiency, and 2) a novel digital CIM design for in-memory MAC operations that optimizes throughput. By addressing the specific limitations of mixed-precision processes with both accumulation paths, energy efficiency improved by 55.5%. With the digital-CIM design, throughput improved by 41.3%. Consequently, the mixed-CIM processor achieves 85.7 TOPS/W on CIFAR100. Section II shows the overall architecture of this work. Section III shows the detailed CIM architecture. Section IV and V show the measurement results and conclusion.

Fig. 1. Conventional analog-based accumulation.

../../Resources/ieie/JSTS.2025.25.4.346/fig1.png

Fig. 2. (a) Multi-bit input processing. (b) Energy efficiency across input precision. (c) Mixed precision processing in DNN.

../../Resources/ieie/JSTS.2025.25.4.346/fig2.png

II. OVERALL ARCHITECTURE

The proposed CIM processor incorporates a mixed-mode CIM macro to facilitate energy-efficient mixed-precision DNN processing. The overall architecture, depicted in Fig. 3, comprises a 33 kB input memory, 4 KB output memory, four mixed-CIM cores, an aggregation core, a top controller, and a precision encoding unit. These components are interconnected through a 2-D mesh network-on-chip (NoC). The input memory includes 1 KB dedicated to sparse accumulation slice (SAS) memory and 32 KB allocated for dense accumulation slice (DAS) memory. The aggregation core gathers partial sums produced by the mixed-CIM cores and transfers the finalized results to the output memory. The top controller coordinates communication among components and executes activation functions.

Each mixed-CIM core comprises buffers for DAS and SAS, an analog decoder, a digital decoder, an $1152 \times 256$ mixed-CIM cell array, and a peripheral accumulation circuit. DAS bits are processed through the analog-CIM path utilizing the DAS buffer and analog decoder, while SAS bits are decoded and handled by the digital-CIM path with the SAS buffer and digital decoder. Outputs from both analog-CIM and digital-CIM paths are combined in the peripheral accumulation circuit. Fig. 4 shows the proposed analog-digital combined CIM architecture, demonstrating how inlier and outlier bits are respectively directed to analog-CIM and digital-CIM paths for efficient mixed-precision processing.

Fig. 3. Overall architecture.

../../Resources/ieie/JSTS.2025.25.4.346/fig3.png

Fig. 4. Proposed analog-digital combined CIM.

../../Resources/ieie/JSTS.2025.25.4.346/fig4.png

III. PROPOSED CIM ARCHITECTURE

1. Mixed-CIM Architecture

One of the primary challenges in mixed-precision DNN processing arises from the different accumulation properties of the DAS and SAS stages. Analog CIM enhances energy efficiency by distributing the significant power demand of the ADC across a large number of accumulation operations. For the DAS stage, where numerous accumulations are consistently required, conventional analog-CIM architectures continue to perform efficiently. Conversely, the energy efficiency of the SAS stage is restricted due to its irregular accumulation of fewer inputs, followed by power-intensive ADC operations. Therefore, replacing ADC operations with digital logic for handling sparse accumulations in the SAS stage can significantly improve energy efficiency.

Fig. 5(a) compares the energy efficiency of analog-CIM and digital-CIM under varying levels of input sparsity. The distinction between DAS and SAS is determined based on an input sparsity threshold of 95.8%, where the energy efficiency of analog CIM and digital CIM is balanced.

Fig. 5. (a) Energy-efficiency comparison of analog-CIM and digital-CIM. (b) Separate computation path for inlier & outlier. (c) Shared bitcell.

../../Resources/ieie/JSTS.2025.25.4.346/fig5.png

The mixed-CIM architecture incorporates heterogeneous data paths to manage DAS and SAS stage operations independently. The overall data flow involves computing DAS bits with analog CIM and processing SAS bits with digital CIM. For the dense DAS workload, the architecture utilizes a method similar to previous CIM designs that leverage multiple word-line activations and ADCs. In contrast, the sparse accumulation of SAS bits employs digital CIM, which substitutes the power-hungry ADC with digital peripheral circuits. Digital CIM approaches have proven advantageous for handling sparse and irregular accumulations.

Fig. 5(b) illustrates how the mixed-CIM framework separates the computation path for inlier (DAS) and outlier (SAS) bits, enabling the use of analog-CIM for dense accumulations and digital-CIM for sparse accumulations.

The proposed mixed-CIM offers advantages in two key areas. First, throughput is enhanced without requiring additional memory by separately managing the SAS stage with digital CIM, allowing simultaneous processing of analog CIM and digital CIM within each cell. This approach eliminates the need for duplicating weight data for analog CIM. Second, by using digital peripherals, energy efficiency during sparse accumulation improves, eliminating the need for ADC operations.

Fig. 5(c) shows the shared 8T 1C bit-cell structure, which supports both analog-CIM and digital-CIM operations within a single memory array, thereby minimizing hardware overhead while maximizing parallelism.

2. Detailed Mixed-CIM Architecture

As illustrated in Fig. 6, the mixed-CIM includes $1152 \times 256$ 8T 1C SRAM cells. Each cell comprises components such as Compute-WL (CWL), Compute-BL (CBL), cell capacitor, bit-line (BL), bit-line (BLB), and a word-line (WL). The cell capacitors are linked to each 6T SRAM using a 2T transmission gate, with 1.14 fF cell capacitors integrated as custom-layout metal-oxide-metal (MOM) capacitors stacked above each 8T SRAM cell. The analog-CIM and digital-CIM data paths within each cell operate independently.

The analog-CIM data path involves cell capacitors, CWLs, CBLs, and 8-bit ADCs. Each CBL connects to 1152 cells through cell capacitors, which are driven by DAS bit values activating the CWLs. The partial product of each DAS bit and corresponding weight bit is bootstrapped to the CBL, where accumulated voltages form a partial sum. Fig. 7 shows analog-CIM datapath for dense accumulations, and waveforms. The operation sequence begins with resetting CBLs to zero using RST signals. Next, DAS bit values trigger CWL drivers inside the analog decoder, followed by 8-bit ADCs converting the accumulated analog voltage to digital. The ADC operation requires nine cycles, during which CWL drivers must pause until the next DAS bit is processed.

Fig. 6. Detailed architecture of proposed mixed-CIM.

../../Resources/ieie/JSTS.2025.25.4.346/fig6.png

Fig. 7. Analog-CIM datapath and waveforms.

../../Resources/ieie/JSTS.2025.25.4.346/fig7.png

For the digital-CIM data path, BLs, BLBs, and WLs are utilized within each cell. To achieve high energy efficiency in digital-CIM operations, a hierarchical cell array configuration is employed. Each BL with 1152 cells is segmented into four Local Cell Array Groups (LAGs), which operate in parallel using independent digital peripheral circuits. The outputs from each LAG are accumulated by the accumulation circuit. Each 288-cell LAG comprises 18 Local Cell Arrays (LCAs), each grouping 16 8T 1C cells, a local pre-charger, and Local BL (LBL) and LBLB.

The energy distribution of the mixed-CIM per 1-bit input operation on CIFAR100 (ResNet18) is depicted in Fig. 8. Analog CIM and digital CIM account for 81% and 19% of energy usage, respectively. Replacing analog CIM with digital CIM for SAS bits results in a 76.6% reduction in computation energy for these bits. The parallel processing of analog CIM and digital CIM enables up to a $2\times$ increase in throughput, leading to a 55.5% improvement in energy efficiency compared to analog CIM alone as shown in Fig. 9.

Fig. 8. Breakdown of mixed-CIM.

../../Resources/ieie/JSTS.2025.25.4.346/fig8.png

Fig. 9. Results of proposed mixed-CIM architecture.

../../Resources/ieie/JSTS.2025.25.4.346/fig9.png

3. Digital-CIM for In-Memory MAC

By processing DAS and SAS bits concurrently with analog CIM and digital CIM, the mixed CIM can significantly enhance throughput. The processing duration of analog CIM depends on the number of DAS bits multiplied by the ADC processing time per bit. To maximize throughput, the processing time of digital CIM must be shorter than that of analog CIM. However, conventional digital-CIM architectures [9,10] are unable to fulfill the throughput requirements of the mixed-CIM, mainly due to limited parallelism, as they typically handle only one multiplication or addition per cycle.

This paper introduces a new digital-CIM architecture capable of performing two multiplications and one addition within a single cycle. Fig. 10 illustrates the operational flow of this new Digital-CIM design, which has two significant improvements over previous implementations. First, it leverages the logical AND results of two input feature maps for pre-charging the local bit-lines (LBLs), instead of a single input value.

Second, word lines (WLs) are activated with the input feature map values. These advancements enable the digital CIM to perform the accumulation of two partial products within a single Local Cell Array (LCA).

Fig. 11 illustrates the digital-CIM datapath for sparse accumulation, and Fig. 12 presents the corresponding waveforms. The digital CIM operation consists of four main stages. Initially, both the LBL and its LBLB are pre-charged using the logical AND of the inputs and the supply voltage (VDD). Next, one WL is driven by the first input feature map value, enabling a bit-wise AND operation between the LBL and the cell data value if the feature map is non-zero. The voltage of the WL is set at 70% of VDD to maintain proper AND operation support. Subsequently, the second WL is driven with another input feature map value. Finally, the Global Bit-Line (GBL) drivers adjust GBL and GBLB based on LBL and LBLB statuses. The 2Cell-Adder and Full-Adder digital circuits complete the MAC operations. Pipelining is used for the LCA stages to prevent throughput delays.

The proposed design belongs to the category of Digital-CIM, as it performs computations using an 8T 1C cell and a digital logic-based adder. A key distinction from conventional CIM architectures is that it activates dual WLs to execute two partial products simultaneously in a single operation. This innovative Digital-CIM design not only boosts throughput but also minimizes BL pre-charge power, as shown in Fig. 13. By reducing the bit-line capacitance, the hierarchical cell arrays achieve a 94.5% decrease in pre-charge power. Utilizing input-based logical AND for pre-charging further lowers total pre-charge power by up to 99.7%. The processing time of the Digital-CIM can be concealed within the cycle time of the Analog-CIM, resulting in an overall mixed-CIM throughput increase of 41.3% for CIFAR100 (ResNet18, 5% outliers) compared to conventional Digital-CIM designs. Detailed power breakdowns for the proposed Digital-CIM are presented in Fig. 14.

Fig. 10. (a) Single AND per cycle. (b) Dual-WL activation. (c) Precharging with logical AND of two input values.

../../Resources/ieie/JSTS.2025.25.4.346/fig10.png

Fig. 11. Digital-CIM datapath.

../../Resources/ieie/JSTS.2025.25.4.346/fig11.png

Fig. 12. Operational waveforms of digital-CIM.

../../Resources/ieie/JSTS.2025.25.4.346/fig12.png

Fig. 13. Results of digital-CIM design.

../../Resources/ieie/JSTS.2025.25.4.346/fig13.png

Fig. 14. Digital-CIM power breakdown.

../../Resources/ieie/JSTS.2025.25.4.346/fig14.png

IV. IMPLEMENTATION RESULTS

The implementation of the proposed mixed-CIM processor was conducted using 28 nm CMOS technology, with the design consuming 27.6 mW and occupying a die area of 1.96 mm$^2$. Fig. 15 illustrates the 8T 1C cell layout, which has a total cell area of 0.912 $\mu$m$^2$. The chip layout, shown in Fig. 16, comprises four mixed-CIM cores, each with a memory capacity of 36 KB.

Fig. 15. 8T 1C Cell layout.

../../Resources/ieie/JSTS.2025.25.4.346/fig15.png

Fig. 16. Layout photograph.

../../Resources/ieie/JSTS.2025.25.4.346/fig16.png

Table 1 presents the performance summary of this chip. The analog-CIM path operates at a clock frequency of 200 MHz for the cell array and 20 MHz for the ADCs, using a supply voltage of 1.0 V. In contrast, the digital-CIM path runs at 200 MHz with a supply voltage of 1.0 V for the cell array and 0.7 V for decoders and peripheral circuits.

Table 2 shows the DNN benchmark results. The proposed design achieves the highest energy efficiency and throughput while maintaining an accuracy of 77.4% on CIFAR-100 with ResNet18 at TT corner, 1.0 V (Core), $25^\circ$C. The impact of PVT variations was analyzed, and the results showed that the worst case occurred at SS corner, 0.9 V (Core), $-10^\circ$C, while the best case was observed at FF corner, 1.1 V (Core), $25^\circ$C. Under these conditions, the accuracy at 4-bit input activation was measured to be 70.1% and 78.2%, respectively.

Table 3 compares the proposed mixed-CIM with previous SRAM-based CIM architectures. Notably, this design is the first to integrate both analog and digital computation methods within a CIM processor. Compared to other analog-based CIM designs [6], the mixed-CIM provides superior throughput and energy efficiency. The digital CIM approach in [9] is deficient in both energy efficiency and the capability to handle scalable input precision.

Table 1. Performance summary

Process

28 nm CMOS

Die Area

1.4 × 1.4 mm2

Supply Voltage(V)

1.0 V (Core) / 0.7 V (Peri.)

Macro size

36 KB

Clock Frequency

200 MHz

Power

27.56 mW

Table 2. DNN benchmarks

Dataset

CIFAR100

Model

ResNet18

Weight Precision

6

Input Precision

4 / 8

Outlier Ratio

3 % / 5 %

Accuracy

77.1 % / 77.4 %

Table 3. Comparison table

../../Resources/ieie/JSTS.2025.25.4.346/tb1.png

1) CIFAR100 (ResNet18), outlier ratio 3%, no retraining after quantization

2) Normalized per macro, scaled with bits for input feature map and weight

3) Scaled with 6‑bit weight

V. CONCLUSION

This paper presents a mixed-mode computing-in-memory processor designed to support energy-efficient mixed-precision DNN processing. The mixed-CIM processor enhances both energy efficiency and throughput through two primary features: 1) The mixed-precision mixed-mode CIM (mixed-CIM) architecture, which leverages analog CIM for dense inlier accumulation and digital CIM for sparse outlier accumulation, resulting in a 55.5% boost in energy efficiency compared to solely using analog CIM. 2) The digital CIM for in-memory MAC, which delivers a 41.3% increase in throughput and a 99.7% reduction in pre-charge power by employing AND gate pre-charging and WL activation with input data.

These innovations enable the proposed processor to achieve a power consumption of 27.6 mW and a peak throughput of 398.5 GOPS, while maintaining high accuracy levels of 77.4% on CIFAR100 (ResNet18). The mixed-CIM sets a new benchmark with state-of-the-art energy efficiency, achieving 85.7 TOPS/W for CIFAR100.

ACKNOWLEDGMENTS

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the Graduate School of Artificial Intelligence Semiconductor (IITP-2024-RS-2023-00256472) grant funded by the Korea government (MSIT).

References

1 
A. Biswas and A. P. Chandrakasan, ``Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN based machine learning applications,'' Proc. of 2018 IEEE International Solid-State Circuits Conference - (ISSCC), pp. 488-490, 2018DOI
2 
X. Si, Y.-N. Tu, W.-H. Huang, J.-W. Su, P.-J. Lu, and J.-H. Wang, ``15.5 A 28nm 64Kb 6T SRAM computing-in-memory macro with 8b MAC operation for AI edge chips,'' 2020 IEEE International Solid-State Circuits Conference - (ISSCC), pp. 246-248, 2020DOI
3 
M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, ``A multi-functional in-memory inference processor using a standard 6T SRAM array,'' IEEE Journal of Solid-State Circuits, vol. 53, no. 2, pp. 642-655, Feb. 2018DOI
4 
H. Jia, M. Ozatay, Y. Tang, H. Valavi, R. Pathak, J. Lee, and N. Verma, ``15.1 A programmable neural-network inference accelerator based on scalable in-memory computing,'' Proc. of 2021 IEEE International Solid-State Circuits Conference (ISSCC), pp. 236-238, 2021.DOI
5 
J. Yue, X. Feng, Y. He, Y. Huang, Y. Wang, and Z. Yuan, ``15.2 A 2.75-to-75.9TOPS/W computing-in-memory NN processor supporting set-associate block-wise zero skipping and ping-pong CIM with simultaneous computation and weight updating,'' Proc. of 2021 IEEE International Solid-State Circuits Conference (ISSCC), pp. 238-240, 2021.DOI
6 
R. Guo, Z. Yue, X. Si, T. Hu, H. Li, and L. Tang, ``15.4 A 5.99-to-691.1TOPS/W tensor-train in-memory-computing processor using bit-level-sparsity-based optimization and variable-precision quantization,'' Proc. of 2021 IEEE International Solid-State Circuits Conference (ISSCC), pp. 242-244, 2021.DOI
7 
E. Park, S. Yoo, and P. Vajda, ``Value-aware quantization for training and inference of neural networks,'' Proc. of the European Conference on Computer Vision (ECCV), pp. 580-595, 2018.DOI
8 
A. Krizhevsky and G. Hinton, ``Learning multiple layers of features from tiny images,'' University of Toronto, Toronto, ON, USA, Technical Report, vol. 1, no. 4, p. 7, 2009.URL
9 
J.-H. Kim, J. Lee, J. Lee, J. Heo, and J.-Y. Kim, ``Z-PIM: A sparsity-aware processing-in-memory architecture with fully variable weight bit-precision for energy-efficient deep neural networks,'' IEEE Journal of Solid-State Circuits, vol. 56, no. 4, pp. 1093-1104, Apr. 2021DOI
10 
J. Lee, J. Kim, W. Jo, S. Kim, S. Kim, and H.-J. Yoo, ``ECIM: Exponent computing in memory for an energy efficient heterogeneous floating-point DNN training processor,'' IEEE Micro, vol. 42, no. 1, pp. 99-107, 2022.DOI
11 
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F.-F. Li, ``ImageNet: A large-scale hierarchical image database,'' Proc. of 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009DOI
Byeongcheol Kim
../../Resources/ieie/JSTS.2025.25.4.346/au1.png

Byeongcheol Kim received his B.S. degree in electronic engineering from Sogang University, Seoul, South Korea, in 2018. In the same year, he joined Samsung Electronics, Hwaseong, South Korea, where he has been involved with the DRAM Design Team. He is currently pursuing the M.S. degree with the Graduate School of AI Semiconductor, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. His current research interests include energy-efficient processing-in-memory accelerators and deep learning processors.

Wooyoung Jo
../../Resources/ieie/JSTS.2025.25.4.346/au2.png

Wooyoung Jo received his B.S. and M.S. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2020 and 2022, respectively, where he is currently pursuing a Ph.D. degree. His current research interests include energy-efficient processing-in-memory accelerators, ASIC/SoCs especially focused on computer vision and deep-learning algorithms for efficient processing.

Sangjin Kim
../../Resources/ieie/JSTS.2025.25.4.346/au3.png

Sangjin Kim received his B.S., M.S., and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2019, 2021, and 2024, respectively. He is currently a Postdoctoral Associate at KAIST. His research interests include computing-in-memory for low-power AI accelerators, processing-in-memory for energy-efficient AI systems, and hardware-software co-optimization for generative AI models.

Soyeon Um
../../Resources/ieie/JSTS.2025.25.4.346/au4.png

Soyeon Um received her B.S., M.S., and Ph.D. degrees in the School of Electrical Engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2020, 2021, and 2025, respectively. She is currently a Postdoctoral Associate in Electrical Engineering and Computer Science from Massachusetts Institute of Technology (MIT), Cambridge, MA, USA. Her current research interests include low-power deep learning and intelligent vision system-on-chip (SoC) design, energy-efficient processing-memory architecture, and application-specific neuromorphic hardware.

Hoi-Jun Yoo
../../Resources/ieie/JSTS.2025.25.4.346/au5.png

Hoi-Jun Yoo served as a member of the Executive Committee for the International Solid-State Circuits Conference (ISSCC), the Symposium on Very Large-Scale Integration (VLSI), and the Asian Solid-State Circuits Conference (A-SSCC), the TPC Chair for the A-SSCC 2008 and the International Symposium on Wearable Computer (ISWC) 2010, the IEEE Distinguished Lecturer from 2010 to 2011, the Far East Chair for the ISSCC from 2011 to 2012, the Technology Direction Sub-Committee Chair for the ISSCC in 2013, the TPC Vice-Chair for the ISSCC in 2014, and the TPC Chair for the ISSCC in 2015. He is currently an ICT Chair Professor with the School of Electrical Engineering, KAIST, where he is also the Director of the System Design Innovation and Application Research Center (SDIA), PIM Semiconductor Design Research Center (AI-PIM), and the KAIST Institute of Information Technology Convergence, Daejeon, South Korea. He is the Dean of the Graduate School of AI Semiconductor, KAIST. More details are available at http://ssl.kaist.ac.kr