KimByeongcheol1,2
JoWooyoung1
KimSangjin1
UmSoyeon1
YooHoi-Jun1
-
(Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea)
-
(Memory Business Division, Samsung Electronics, Hwaseong, Republic of Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Computing-in-memory, mixed-mode computing, mixed-precision, energy-efficient, AI accelerator
I. INTRODUCTION
Computing-in-memory (CIM) processors have emerged as a promising solution for accelerating
deep neural network (DNN) computations, particularly to address the high energy consumption
associated with traditional Von-Neumann architectures. These conventional architectures
often face significant challenges due to the extensive data transfers required between
memory and processing units, contributing to substantial power usage. CIM processors
[1-6] mitigate this issue by performing computations directly within the memory, greatly
minimizing data movement. Recent advanced CIM processors, such as [4], have demonstrated impressive energy efficiency by enabling the simultaneous activation
of thousands of word lines for multiply-and-accumulate (MAC) operations, accumulating
partial products along bit lines, and subsequently converting the accumulated results
via analog-to-digital converters (ADCs), as shown in Fig. 1.
Despite these advancements, prior CIM processors have been limited to supporting uniform
bit-precision across layers or only layer-wise variability. However, previous research
[7] indicates that optimal bit-precision may vary not only between layers but even within
different elements of the same layer. Utilizing mixed-precision processing, which
assigns lower bit-widths for the majority of inputs while reserving higher bit-widths
for outliers, can significantly reduce average bit precision without sacrificing accuracy.
Mixed-precision strategies applied to the CIFAR100 dataset using ResNet18 [8] showed that 97% of input features could be processed at fixed-point 4bit precision,
with only 3% needing fixed-point 8bit precision, leading to nearly 48.5% reduction
in average bit-width compared to layer-wise variable precision.
Fig. 2(a) shows the multi-bit input extension architecture, where different input precisions
can be processed in parallel through the decoder and ADC stages. Fig. 2(b) illustrates the relationship between input bit-width and energy efficiency, highlighting
how reducing bit precision substantially improves TOPS/W. Fig. 2(c) demonstrates a mixed-precision DNN processing scheme, in which the majority of inputs
are handled at low precision while only a small fraction of outliers require higher
precision, thus reducing the overall bit-width without degrading accuracy.
However, existing CIM architectures with bit-serial (BS) operations struggle to fully
leverage mixed-precision processing due to limitations in throughput and ADC range.
The previous CIM processors utilized BS operations, processing each bit position of
input feature maps sequentially over multiple cycles. The BS operations for mixed-precision
feature maps are divided into two distinct stages, the Dense Accumulation Slice (DAS)
stage and the Sparse Accumulation Slice (SAS) stage.
The operational sequence begins with the CIM processor handling both low-bit inliers
and high-bit outliers during the DAS stage, which involves accumulating a large volume
of inputs. Subsequently, in the SAS stage, the processor focuses solely on outliers,
requiring sparse accumulation for a small number of inputs that are irregularly scattered
across the WLs. As a result, the throughput for mixed-precision feature maps is constrained
by the high-bit outliers. Even if only a single outlier exists among many WLs, the
previous CIM architectures necessitate executing the SAS stage. Additionally, the
ADC must be configured with high output precision to support the DAS stage, leading
to inefficient ADC power usage since the SAS stage requires significantly lower output
precision compared to the DAS stage. To overcome these challenges, a more sophisticated
approach that combines different processing techniques is needed.
This paper presents an energy-efficient mixed-mode CIM processor that includes two
key features: 1) the mixed-mode mixed-precision CIM (mixed-CIM) architecture, which
uses both analog and digital accumulation paths to enhance energy efficiency, and
2) a novel digital CIM design for in-memory MAC operations that optimizes throughput.
By addressing the specific limitations of mixed-precision processes with both accumulation
paths, energy efficiency improved by 55.5%. With the digital-CIM design, throughput
improved by 41.3%. Consequently, the mixed-CIM processor achieves 85.7 TOPS/W on CIFAR100.
Section II shows the overall architecture of this work. Section III shows the detailed
CIM architecture. Section IV and V show the measurement results and conclusion.
Fig. 1. Conventional analog-based accumulation.
Fig. 2. (a) Multi-bit input processing. (b) Energy efficiency across input precision.
(c) Mixed precision processing in DNN.
II. OVERALL ARCHITECTURE
The proposed CIM processor incorporates a mixed-mode CIM macro to facilitate energy-efficient
mixed-precision DNN processing. The overall architecture, depicted in Fig. 3, comprises a 33 kB input memory, 4 KB output memory, four mixed-CIM cores, an aggregation
core, a top controller, and a precision encoding unit. These components are interconnected
through a 2-D mesh network-on-chip (NoC). The input memory includes 1 KB dedicated
to sparse accumulation slice (SAS) memory and 32 KB allocated for dense accumulation
slice (DAS) memory. The aggregation core gathers partial sums produced by the mixed-CIM
cores and transfers the finalized results to the output memory. The top controller
coordinates communication among components and executes activation functions.
Each mixed-CIM core comprises buffers for DAS and SAS, an analog decoder, a digital
decoder, an $1152 \times 256$ mixed-CIM cell array, and a peripheral accumulation
circuit. DAS bits are processed through the analog-CIM path utilizing the DAS buffer
and analog decoder, while SAS bits are decoded and handled by the digital-CIM path
with the SAS buffer and digital decoder. Outputs from both analog-CIM and digital-CIM
paths are combined in the peripheral accumulation circuit. Fig. 4 shows the proposed analog-digital combined CIM architecture, demonstrating how inlier
and outlier bits are respectively directed to analog-CIM and digital-CIM paths for
efficient mixed-precision processing.
Fig. 3. Overall architecture.
Fig. 4. Proposed analog-digital combined CIM.
III. PROPOSED CIM ARCHITECTURE
1. Mixed-CIM Architecture
One of the primary challenges in mixed-precision DNN processing arises from the different
accumulation properties of the DAS and SAS stages. Analog CIM enhances energy efficiency
by distributing the significant power demand of the ADC across a large number of accumulation
operations. For the DAS stage, where numerous accumulations are consistently required,
conventional analog-CIM architectures continue to perform efficiently. Conversely,
the energy efficiency of the SAS stage is restricted due to its irregular accumulation
of fewer inputs, followed by power-intensive ADC operations. Therefore, replacing
ADC operations with digital logic for handling sparse accumulations in the SAS stage
can significantly improve energy efficiency.
Fig. 5(a) compares the energy efficiency of analog-CIM and digital-CIM under varying levels
of input sparsity. The distinction between DAS and SAS is determined based on an input
sparsity threshold of 95.8%, where the energy efficiency of analog CIM and digital
CIM is balanced.
Fig. 5. (a) Energy-efficiency comparison of analog-CIM and digital-CIM. (b) Separate
computation path for inlier & outlier. (c) Shared bitcell.
The mixed-CIM architecture incorporates heterogeneous data paths to manage DAS and
SAS stage operations independently. The overall data flow involves computing DAS bits
with analog CIM and processing SAS bits with digital CIM. For the dense DAS workload,
the architecture utilizes a method similar to previous CIM designs that leverage multiple
word-line activations and ADCs. In contrast, the sparse accumulation of SAS bits employs
digital CIM, which substitutes the power-hungry ADC with digital peripheral circuits.
Digital CIM approaches have proven advantageous for handling sparse and irregular
accumulations.
Fig. 5(b) illustrates how the mixed-CIM framework separates the computation path for inlier
(DAS) and outlier (SAS) bits, enabling the use of analog-CIM for dense accumulations
and digital-CIM for sparse accumulations.
The proposed mixed-CIM offers advantages in two key areas. First, throughput is enhanced
without requiring additional memory by separately managing the SAS stage with digital
CIM, allowing simultaneous processing of analog CIM and digital CIM within each cell.
This approach eliminates the need for duplicating weight data for analog CIM. Second,
by using digital peripherals, energy efficiency during sparse accumulation improves,
eliminating the need for ADC operations.
Fig. 5(c) shows the shared 8T 1C bit-cell structure, which supports both analog-CIM and digital-CIM
operations within a single memory array, thereby minimizing hardware overhead while
maximizing parallelism.
2. Detailed Mixed-CIM Architecture
As illustrated in Fig. 6, the mixed-CIM includes $1152 \times 256$ 8T 1C SRAM cells. Each cell comprises components
such as Compute-WL (CWL), Compute-BL (CBL), cell capacitor, bit-line (BL), bit-line
(BLB), and a word-line (WL). The cell capacitors are linked to each 6T SRAM using
a 2T transmission gate, with 1.14 fF cell capacitors integrated as custom-layout metal-oxide-metal
(MOM) capacitors stacked above each 8T SRAM cell. The analog-CIM and digital-CIM data
paths within each cell operate independently.
The analog-CIM data path involves cell capacitors, CWLs, CBLs, and 8-bit ADCs. Each
CBL connects to 1152 cells through cell capacitors, which are driven by DAS bit values
activating the CWLs. The partial product of each DAS bit and corresponding weight
bit is bootstrapped to the CBL, where accumulated voltages form a partial sum. Fig. 7 shows analog-CIM datapath for dense accumulations, and waveforms. The operation sequence
begins with resetting CBLs to zero using RST signals. Next, DAS bit values trigger
CWL drivers inside the analog decoder, followed by 8-bit ADCs converting the accumulated
analog voltage to digital. The ADC operation requires nine cycles, during which CWL
drivers must pause until the next DAS bit is processed.
Fig. 6. Detailed architecture of proposed mixed-CIM.
Fig. 7. Analog-CIM datapath and waveforms.
For the digital-CIM data path, BLs, BLBs, and WLs are utilized within each cell. To
achieve high energy efficiency in digital-CIM operations, a hierarchical cell array
configuration is employed. Each BL with 1152 cells is segmented into four Local Cell
Array Groups (LAGs), which operate in parallel using independent digital peripheral
circuits. The outputs from each LAG are accumulated by the accumulation circuit. Each
288-cell LAG comprises 18 Local Cell Arrays (LCAs), each grouping 16 8T 1C cells,
a local pre-charger, and Local BL (LBL) and LBLB.
The energy distribution of the mixed-CIM per 1-bit input operation on CIFAR100 (ResNet18)
is depicted in Fig. 8. Analog CIM and digital CIM account for 81% and 19% of energy usage, respectively.
Replacing analog CIM with digital CIM for SAS bits results in a 76.6% reduction in
computation energy for these bits. The parallel processing of analog CIM and digital
CIM enables up to a $2\times$ increase in throughput, leading to a 55.5% improvement
in energy efficiency compared to analog CIM alone as shown in Fig. 9.
Fig. 8. Breakdown of mixed-CIM.
Fig. 9. Results of proposed mixed-CIM architecture.
3. Digital-CIM for In-Memory MAC
By processing DAS and SAS bits concurrently with analog CIM and digital CIM, the mixed
CIM can significantly enhance throughput. The processing duration of analog CIM depends
on the number of DAS bits multiplied by the ADC processing time per bit. To maximize
throughput, the processing time of digital CIM must be shorter than that of analog
CIM. However, conventional digital-CIM architectures [9,10] are unable to fulfill the throughput requirements of the mixed-CIM, mainly due to
limited parallelism, as they typically handle only one multiplication or addition
per cycle.
This paper introduces a new digital-CIM architecture capable of performing two multiplications
and one addition within a single cycle. Fig. 10 illustrates the operational flow of this new Digital-CIM design, which has two significant
improvements over previous implementations. First, it leverages the logical AND results
of two input feature maps for pre-charging the local bit-lines (LBLs), instead of
a single input value.
Second, word lines (WLs) are activated with the input feature map values. These advancements
enable the digital CIM to perform the accumulation of two partial products within
a single Local Cell Array (LCA).
Fig. 11 illustrates the digital-CIM datapath for sparse accumulation, and Fig. 12 presents the corresponding waveforms. The digital CIM operation consists of four
main stages. Initially, both the LBL and its LBLB are pre-charged using the logical
AND of the inputs and the supply voltage (VDD). Next, one WL is driven by the first
input feature map value, enabling a bit-wise AND operation between the LBL and the
cell data value if the feature map is non-zero. The voltage of the WL is set at 70%
of VDD to maintain proper AND operation support. Subsequently, the second WL is driven
with another input feature map value. Finally, the Global Bit-Line (GBL) drivers adjust
GBL and GBLB based on LBL and LBLB statuses. The 2Cell-Adder and Full-Adder digital
circuits complete the MAC operations. Pipelining is used for the LCA stages to prevent
throughput delays.
The proposed design belongs to the category of Digital-CIM, as it performs computations
using an 8T 1C cell and a digital logic-based adder. A key distinction from conventional
CIM architectures is that it activates dual WLs to execute two partial products simultaneously
in a single operation. This innovative Digital-CIM design not only boosts throughput
but also minimizes BL pre-charge power, as shown in Fig. 13. By reducing the bit-line capacitance, the hierarchical cell arrays achieve a 94.5%
decrease in pre-charge power. Utilizing input-based logical AND for pre-charging further
lowers total pre-charge power by up to 99.7%. The processing time of the Digital-CIM
can be concealed within the cycle time of the Analog-CIM, resulting in an overall
mixed-CIM throughput increase of 41.3% for CIFAR100 (ResNet18, 5% outliers) compared
to conventional Digital-CIM designs. Detailed power breakdowns for the proposed Digital-CIM
are presented in Fig. 14.
Fig. 10. (a) Single AND per cycle. (b) Dual-WL activation. (c) Precharging with logical
AND of two input values.
Fig. 11. Digital-CIM datapath.
Fig. 12. Operational waveforms of digital-CIM.
Fig. 13. Results of digital-CIM design.
Fig. 14. Digital-CIM power breakdown.
IV. IMPLEMENTATION RESULTS
The implementation of the proposed mixed-CIM processor was conducted using 28 nm CMOS
technology, with the design consuming 27.6 mW and occupying a die area of 1.96 mm$^2$.
Fig. 15 illustrates the 8T 1C cell layout, which has a total cell area of 0.912 $\mu$m$^2$.
The chip layout, shown in Fig. 16, comprises four mixed-CIM cores, each with a memory capacity of 36 KB.
Fig. 15. 8T 1C Cell layout.
Fig. 16. Layout photograph.
Table 1 presents the performance summary of this chip. The analog-CIM path operates at a
clock frequency of 200 MHz for the cell array and 20 MHz for the ADCs, using a supply
voltage of 1.0 V. In contrast, the digital-CIM path runs at 200 MHz with a supply
voltage of 1.0 V for the cell array and 0.7 V for decoders and peripheral circuits.
Table 2 shows the DNN benchmark results. The proposed design achieves the highest energy
efficiency and throughput while maintaining an accuracy of 77.4% on CIFAR-100 with
ResNet18 at TT corner, 1.0 V (Core), $25^\circ$C. The impact of PVT variations was
analyzed, and the results showed that the worst case occurred at SS corner, 0.9 V
(Core), $-10^\circ$C, while the best case was observed at FF corner, 1.1 V (Core),
$25^\circ$C. Under these conditions, the accuracy at 4-bit input activation was measured
to be 70.1% and 78.2%, respectively.
Table 3 compares the proposed mixed-CIM with previous SRAM-based CIM architectures. Notably,
this design is the first to integrate both analog and digital computation methods
within a CIM processor. Compared to other analog-based CIM designs [6], the mixed-CIM provides superior throughput and energy efficiency. The digital CIM
approach in [9] is deficient in both energy efficiency and the capability to handle scalable input
precision.
Table 1. Performance summary
Process
|
28 nm CMOS
|
Die Area
|
1.4 × 1.4 mm2
|
Supply Voltage(V)
|
1.0 V (Core) / 0.7 V (Peri.)
|
Macro size
|
36 KB
|
Clock Frequency
|
200 MHz
|
Power
|
27.56 mW
|
Table 2. DNN benchmarks
Dataset
|
CIFAR100
|
Model
|
ResNet18
|
Weight Precision
|
6
|
Input Precision
|
4 / 8
|
Outlier Ratio
|
3 % / 5 %
|
Accuracy
|
77.1 % / 77.4 %
|
Table 3. Comparison table
1) CIFAR100 (ResNet18), outlier ratio 3%, no retraining after quantization
2) Normalized per macro, scaled with bits for input feature map and weight
3) Scaled with 6‑bit weight
V. CONCLUSION
This paper presents a mixed-mode computing-in-memory processor designed to support
energy-efficient mixed-precision DNN processing. The mixed-CIM processor enhances
both energy efficiency and throughput through two primary features: 1) The mixed-precision
mixed-mode CIM (mixed-CIM) architecture, which leverages analog CIM for dense inlier
accumulation and digital CIM for sparse outlier accumulation, resulting in a 55.5%
boost in energy efficiency compared to solely using analog CIM. 2) The digital CIM
for in-memory MAC, which delivers a 41.3% increase in throughput and a 99.7% reduction
in pre-charge power by employing AND gate pre-charging and WL activation with input
data.
These innovations enable the proposed processor to achieve a power consumption of
27.6 mW and a peak throughput of 398.5 GOPS, while maintaining high accuracy levels
of 77.4% on CIFAR100 (ResNet18). The mixed-CIM sets a new benchmark with state-of-the-art
energy efficiency, achieving 85.7 TOPS/W for CIFAR100.
ACKNOWLEDGMENTS
This work was supported by Institute of Information & communications Technology
Planning & Evaluation (IITP) under the Graduate School of Artificial Intelligence
Semiconductor (IITP-2024-RS-2023-00256472) grant funded by the Korea government (MSIT).
References
A. Biswas and A. P. Chandrakasan, ``Conv-RAM: An energy-efficient SRAM with embedded
convolution computation for low-power CNN based machine learning applications,'' Proc.
of 2018 IEEE International Solid-State Circuits Conference - (ISSCC), pp. 488-490,
2018

X. Si, Y.-N. Tu, W.-H. Huang, J.-W. Su, P.-J. Lu, and J.-H. Wang, ``15.5 A 28nm 64Kb
6T SRAM computing-in-memory macro with 8b MAC operation for AI edge chips,'' 2020
IEEE International Solid-State Circuits Conference - (ISSCC), pp. 246-248, 2020

M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, ``A multi-functional in-memory
inference processor using a standard 6T SRAM array,'' IEEE Journal of Solid-State
Circuits, vol. 53, no. 2, pp. 642-655, Feb. 2018

H. Jia, M. Ozatay, Y. Tang, H. Valavi, R. Pathak, J. Lee, and N. Verma, ``15.1 A programmable
neural-network inference accelerator based on scalable in-memory computing,'' Proc.
of 2021 IEEE International Solid-State Circuits Conference (ISSCC), pp. 236-238, 2021.

J. Yue, X. Feng, Y. He, Y. Huang, Y. Wang, and Z. Yuan, ``15.2 A 2.75-to-75.9TOPS/W
computing-in-memory NN processor supporting set-associate block-wise zero skipping
and ping-pong CIM with simultaneous computation and weight updating,'' Proc. of 2021
IEEE International Solid-State Circuits Conference (ISSCC), pp. 238-240, 2021.

R. Guo, Z. Yue, X. Si, T. Hu, H. Li, and L. Tang, ``15.4 A 5.99-to-691.1TOPS/W tensor-train
in-memory-computing processor using bit-level-sparsity-based optimization and variable-precision
quantization,'' Proc. of 2021 IEEE International Solid-State Circuits Conference (ISSCC),
pp. 242-244, 2021.

E. Park, S. Yoo, and P. Vajda, ``Value-aware quantization for training and inference
of neural networks,'' Proc. of the European Conference on Computer Vision (ECCV),
pp. 580-595, 2018.

A. Krizhevsky and G. Hinton, ``Learning multiple layers of features from tiny images,''
University of Toronto, Toronto, ON, USA, Technical Report, vol. 1, no. 4, p. 7, 2009.

J.-H. Kim, J. Lee, J. Lee, J. Heo, and J.-Y. Kim, ``Z-PIM: A sparsity-aware processing-in-memory
architecture with fully variable weight bit-precision for energy-efficient deep neural
networks,'' IEEE Journal of Solid-State Circuits, vol. 56, no. 4, pp. 1093-1104, Apr.
2021

J. Lee, J. Kim, W. Jo, S. Kim, S. Kim, and H.-J. Yoo, ``ECIM: Exponent computing in
memory for an energy efficient heterogeneous floating-point DNN training processor,''
IEEE Micro, vol. 42, no. 1, pp. 99-107, 2022.

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F.-F. Li, ``ImageNet: A large-scale
hierarchical image database,'' Proc. of 2009 IEEE Conference on Computer Vision and
Pattern Recognition, pp. 248-255, 2009

Byeongcheol Kim received his B.S. degree in electronic engineering from Sogang
University, Seoul, South Korea, in 2018. In the same year, he joined Samsung Electronics,
Hwaseong, South Korea, where he has been involved with the DRAM Design Team. He is
currently pursuing the M.S. degree with the Graduate School of AI Semiconductor, Korea
Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. His current
research interests include energy-efficient processing-in-memory accelerators and
deep learning processors.
Wooyoung Jo received his B.S. and M.S. degrees in electrical engineering from the
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea,
in 2020 and 2022, respectively, where he is currently pursuing a Ph.D. degree. His
current research interests include energy-efficient processing-in-memory accelerators,
ASIC/SoCs especially focused on computer vision and deep-learning algorithms for efficient
processing.
Sangjin Kim received his B.S., M.S., and Ph.D. degrees in electrical engineering
from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South
Korea, in 2019, 2021, and 2024, respectively. He is currently a Postdoctoral Associate
at KAIST. His research interests include computing-in-memory for low-power AI accelerators,
processing-in-memory for energy-efficient AI systems, and hardware-software co-optimization
for generative AI models.
Soyeon Um received her B.S., M.S., and Ph.D. degrees in the School of Electrical
Engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon,
South Korea, in 2020, 2021, and 2025, respectively. She is currently a Postdoctoral
Associate in Electrical Engineering and Computer Science from Massachusetts Institute
of Technology (MIT), Cambridge, MA, USA. Her current research interests include low-power
deep learning and intelligent vision system-on-chip (SoC) design, energy-efficient
processing-memory architecture, and application-specific neuromorphic hardware.
Hoi-Jun Yoo served as a member of the Executive Committee for the International
Solid-State Circuits Conference (ISSCC), the Symposium on Very Large-Scale Integration
(VLSI), and the Asian Solid-State Circuits Conference (A-SSCC), the TPC Chair for
the A-SSCC 2008 and the International Symposium on Wearable Computer (ISWC) 2010,
the IEEE Distinguished Lecturer from 2010 to 2011, the Far East Chair for the ISSCC
from 2011 to 2012, the Technology Direction Sub-Committee Chair for the ISSCC in 2013,
the TPC Vice-Chair for the ISSCC in 2014, and the TPC Chair for the ISSCC in 2015.
He is currently an ICT Chair Professor with the School of Electrical Engineering,
KAIST, where he is also the Director of the System Design Innovation and Application
Research Center (SDIA), PIM Semiconductor Design Research Center (AI-PIM), and the
KAIST Institute of Information Technology Convergence, Daejeon, South Korea. He is
the Dean of the Graduate School of AI Semiconductor, KAIST. More details are available
at http://ssl.kaist.ac.kr