LeeSangmyoung1
KimSeryeong1
KimSoyeon1
UmSoyeon1
KimSangjin1
KimSanyeob1
JoWooyoung1
YooHoi-jun1
-
(Korea Advanced Institute of Science and Technology (KAIST) Daejeon, Republic of Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
1T1C eDRAM, Computing-in-memory, network-on-chip, spiking neural network, system-level efficiency
I. INTRODUCTION
Recently, Computing-In-Memory (CIM) has been actively researched to achieve high energy
efficiency in processing Deep Neural Networks (DNNs). Traditional CIM processors [1,2] employed multi-word-line (WL) driving techniques to improve energy and area efficiency.
CIM mitigates data movement bottlenecks by enabling in-memory data processing, offering
significant advantages over conventional digital computing architectures: reduced
data movement, support for parallel processing, and improved energy and area efficiency
through the integration of memory and processors. These characteristics enable high-speed
and energy-efficient large-scale deep learning computation. However, while Analog-to-Digital
Converters (ADCs) play a critical role in CIM architectures, they also present significant
challenges, including high power consumption, limited area efficiency, and their impact
on computational accuracy. In particular, ADCs account for approximately 59% of total
power consumption [1] and 20% of processor area [2], creating a major obstacle to achieving higher energy and area efficiency.
To address these limitations, Spiking Neural Network (SNN)-based CIM [3,4] has been proposed as an effective alternative. By eliminating the power and area
overhead caused by ADCs, SNN-CIM offers a promising solution. SNN-CIM achieves macro-level
energy efficiency through two architectural features: spike-based accumulation and
1-bit thresholding. First, in conventional CIM architectures, multiplication, and
accumulation (MAC) operations are central to data processing, requiring significant
computational resources and energy. In contrast, SNN-CIM replaces these complex operations
with simple accumulation tasks. This is made possible by the inherent characteristics
of Spiking Neural Networks, where signals are represented as spikes, simplifying the
computational process. As shown in Fig. 2, during the integration phase, input signals are accumulated over time, enabling
data processing without the need for multiplication operations. This effectively eliminates
the computational cost associated with multiplication, significantly reducing overall
power consumption. Second, during the firing phase, SNN-CIM replaces the complex ADCs
typically used to determine whether input signals meet a neuron's firing threshold
with a simple 1-bit comparator. In traditional CIM architectures, ADCs convert multi-bit
signals to digital form for further processing, which involves considerable complexity.
By introducing the firing phase, SNN-CIM avoids using ADCs entirely, relying instead
on 1-bit comparators to determine firing conditions. These two features enable SNN-CIM
to overcome the inefficiencies of traditional CIM architectures, achieving substantial
improvements in energy and area efficiency at the macro level.
However, one drawback of SNN-CIM is the large input activation memory requirement,
as shown in Fig. 3. This is due to the representation method of SNN, where n-bit data as a $2n$-length
spike train composed of binary spikes (0 or 1). Despite its macro-level efficiency,
frequent external memory accesses (EMA) for input activation degrade system-level
energy efficiency. To mitigate the EMA issue, layer fusion has been introduced [5], allowing multiple layers to be processed by loading their weights simultaneously.
Consequently, a high-density SNN-CIM is necessary to accommodate the extensive memory
footprint required for such layer fusion. Existing SNN-CIM designs [3] face limitations due to a large peripheral logic area ($>69.5$%) and reliance on
8T SRAM cells, highlighting the need for more area-efficient peripheral logic and
denser memory technologies like eDRAM.
Additionally, the rigid architectures of prior SNN-CIMs [3,4] are not adaptable to varying layer configurations, such as input channel sizes and
kernel dimensions. Optimizing hardware utilization and throughput requires a flexible
ratio of memory to peripheral logic that adapts to specific layer requirements. Previous
fixed architectures result in suboptimal hardware utilization and diminished throughput.
A reconfigurable architecture is therefore essential to dynamically adjust the memory
and peripheral logic balance, ensuring optimal configurations for each layer and enhancing
system-level energy efficiency.
This paper presents a high-density and reconfigurable SNN eDRAM-based CIM processor
that introduces two major innovations to boost system-level energy efficiency: 1)
A high-density Reconfigurable Neuro-Cell Array (ReNCA) design using a 1T1C eDRAM cell
and area-efficient SNN peripheral logic, achieving a 41% reduction in area and 90%
reduction in power compared to prior works. 2) A reconfigurable CIM architecture leveraging
ReNCA and Dynamic Adjustable Neuron Link (DAN Link) to adapt memory and peripheral
configurations to diverse layer requirements, thereby maximizing energy efficiency.
The proposed processor is implemented using a 28nm CMOS process and demonstrates outstanding
performance: achieving 157.15 TOPS/W excluding EMA and 0.03 TOPS/W including EMA on
ResNet18 with CIFAR-10, which is $10\times$ higher than [3], and 86.63 TOPS/W excluding EMA and 0.53 TOPS/W including EMA on ResNet18 with ImageNet.
Fig. 1. Advantages and bottleneck of analog-CIM.
Fig. 2. Operation of SNN-CIM.
Fig. 3. Limitation of previous SNN-CIM.
II. OVERALL ARCHITECTURE
As shown in Fig. 4, the proposed SNN eDRAM-CIM processor incorporates a high-density, reconfigurable
1T1C eDRAM CIM bank as its core component. The overall architecture includes 37 KB
of input memory, 37 KB of output memory, four 1T1C eDRAM CIM banks, a top controller,
and a spike encoder which converts spike trains. These components are interconnected
via a custom 2D mesh Network-on-Chip (NoC), which transforms data into spike trains.
Each 1T1C eDRAM CIM bank comprises a $76\times16$ reconfigurable neuro cell array
(ReNCA), a dynamic adjustable neuron link (DAN Link), and a bank controller. The ReNCA
includes an accumulation path designed to combine weights across ReNCAs. Each ReNCA
is composed of a $64\times 64$ array of 1T1C cells, supported by 64 charge pumps (CP)
[9] and 64 firing sense amplifiers (firing SAs). This structure can function in either
memory mode (mem-mode) or peripheral mode (peri-mode), with the bank controller dynamically
configuring the mode based on the specific layer requirements.
The DAN Link facilitates layer fusion by connecting multiple ReNCAs within the bank.
In the $64\times64$ 1T1C cell array, cells associated with even-numbered word lines
(WLs) are connected to the bit line (BL), while cells on odd-numbered WLs are linked
to the bit line bar (BLB). This separation prevents overlap between the WLs associated
with BL and BLB connections.
Each 1T1C bit cell occupies an area of 0.276 $\mu$m² and has a capacitance of 0.87
fF. The capacitor of the cell is constructed using both MOS and MOM capacitors, ensuring
a compact and efficient design.
Fig. 4. Overall architecture.
III. RECONFIGURABLE NEURO-CELL ARRAY
Although SNN-CIM demonstrates superior macro-level energy efficiency over earlier
CNN CIM designs [6-8], its system-level energy efficiency is limited due to the substantial power consumption
incurred by intermediate activations. When input data is represented as a spike train,
the power consumption for intermediate EMA increases by $3.88\times$ (@ResNet50, ImageNet,
I $= 6$ bit, W $= 8$ bit, Micron 16Gb DDR4 SDRAM$\times16$, 600 MHz, based on architecture
in [3]). The state-of-the-art SNN-CIM [3], which uses 8T SRAM cells and extensive peripheral logic, suffers from insufficient
memory density, making efficient layer fusion impractical and limiting system-level
energy efficiency to just 0.003 TOPS/W, including EMA power.
The proposed reconfigurable neuro-cell array (ReNCA) overcomes these limitations through
three key innovations, as shown in Fig. 5. First, vertical and horizontal charge-sharing accumulation eliminates the requirement
for additional integrated capacitors by leveraging the parasitic capacitance of bit
lines (BL) and data bus lines (DB). Second, a charge pump (CP) with only two MOSFETs
[9] is employed for membrane potential accumulation, reusing the 1T1C cell capacitors
as the secondary capacitor (C2) in the CP structure.
Third, the eDRAM-designed sense amplifier (SA) is reused as firing logic, functioning
in both memory (mem-mode) and peripheral (peri-mode) modes through switch control.
By effectively repurposing the 1T1C cell array to serve as both eDRAM bit cells and
peripheral logic, the ReNCA architecture enhances system-level energy efficiency.
The compact and high-density structure of the design allows for the integration of
multiple ReNCAs, facilitating parallel operations within and across banks
Fig. 5. Features of reconfigurable 1T1C neuro-cell array architecture.
1. 1T1C ReNCA Architecture
Fig. 6 illustrates the detailed architecture of the ReNCA. Switches are included to support
both membrane accumulation and multi-bit operations on BL pairs. Each BL pair in the
ReNCA is equipped with a charge pump (CP), which consists of two MOSFETs, four input
switches, and two output switches.
Membrane integration between arrays is accomplished solely through the CP within the
ReNCA, without requiring additional peripheral logic. This approach achieves a 17%
reduction in area and a 90% reduction in power consumption compared to folding circuits
[3] used for inter-cell array membrane integration. As depicted in Fig. 6, the charge pump (CP) input can be linked to the BL and BLB of the Nth ReNCA via
switches (1, 2), while switches (3, 4) allow it to connect to the BL and BLB of the
($N+1$)th ReNCA. The CP output can then connect to the BL and BLB of the ($N+1$)th
ReNCA via switches (5, 6). The firing SA operates either as a sense amplifier in memory
mode or as a comparator in peripheral mode, depending on the selected mode. This integrated
design reduces the area by 41% compared to architectures that rely on individually
implemented cell arrays, CPs, SAs, and comparators.
Fig. 6. Comprehensive architecture of the reconfigurable neuro-cell array (ReNCA).
2. 1-bit SNN Computation with ReNCA
The computation process for 1-bit weight SNN using ReNCA, illustrated in Fig. 7, is divided into two main phases: integration (❶$\mathrm{\sim}$❸) and firing (❹).
The steps involved are as follows: 1) Spatial Summation (❶): Weights are summed along
the spatial dimension. 2) Temporal Accumulation (❷): The output from spatial summation
is added to the membrane potential. 3) Input Channel Summation (❸): Weights from input
channels are accumulated, and the result is further added to the membrane potential
using temporal accumulation.
This sequence (❶ $\mathrm{\to}$ ❷ $\mathrm{\to}$ ❸ $\mathrm{\to}$ ❷) skips zero-valued
spikes and continues until all input spikes for the current time step are processed.
Finally, at the firing phase (❹), the output spike is generated.
The detailed operation of the CIM processor is described as below. In step 1, word
lines (WLs) corresponding to active spikes in ReNCA 0 and ReNCA 1 are enabled. While
the WLs are driven, the firing SA writes back the data. Once the WLs are disabled,
spatial summation is executed between the two ReNCAs through charge sharing, utilizing
BL parasitic capacitors. Parallel calculations are performed in the output channel
direction for each ReNCA.
The BL voltage is then amplified by an amplifier and sent to the charge pump (CP)
input. The CP utilizes its C${}_{2}$ capacitor, which reuses the BL parasitic capacitance
and cell capacitors within the ReNCA, to perform temporal accumulation. As depicted
in Fig. 5(b), the simulated output voltage of the charge pump exhibited high linearity when four
cell capacitors were utilized for temporal accumulation. Next, each ReNCA input driver
skips WLs that correspond to zero-valued inactivated spike and only activates WLs
linked to active spikes.
After all the weights from the input spikes of the current time step are aggregated,
the firing phase (❹) begins. In this phase, the membrane potential on each BL of ReNCA
2 is connected to the firing SA through switches, and output spikes are generated
by comparing the BL voltage to a predefined threshold.
Fig. 7. 1-bit weight SNN computation flow.
IV. RECONFIGURABLE CIM ARCHITECTURE
Traditional CIM processors utilize a fixed ratio of memory to peripheral logic. To
maximize hardware utilization across various layers, a flexible hardware configuration
is essential. This paper presents a reconfigurable CIM architecture featuring a dual-mode
Reconfigurable Neuro-Cell Array (ReNCA) and a Dynamic Adjustable Neuron Link (DAN
Link), as illustrated in Fig. 8, for interconnecting multiple ReNCAs.
The ReNCA supports dual-mode operation. In memory mode (mem-mode), the ReNCA serves
as a storage array for SNN weights, where the firing SA replaces the conventional
sense amplifier. Conversely, in peripheral mode (Peri-mode), the ReNCA serves as a
peripheral circuit by reusing the 1T1C cell array. In this mode, the charge pump is
used for membrane potential accumulation and supports multi-bit operations using the
1T1C cell capacitor. Additionally, the firing SA is repurposed as a comparator.
The DAN Link, designed as a 2D custom mesh-type NoC, interconnects ReNCAs and enables
the bank controller to configure each ReNCA mode according to per-layer memory and
peripheral logic requirements. This NoC structure facilitates layer fusion by establishing
dynamic computational pathways within the bank.
Previous fixed CIM architectures suffer from a rigid memory-to-peripheral hardware
ratio, leading to underutilization and reduced operation density. In contrast, the
proposed reconfigurable CIM architecture in this paper, dynamically adjusts this ratio,
resulting in higher operation and memory densities. Specifically, it improves operation
and memory density by up to 68.2% compared to conventional fixed CIM architectures.
Furthermore, the proposed ReNCA and DAN Link-based CIM design achieves a $2.82{\times}$
improvement in density FoM compared to fixed CIM architectures. Ultimately, this reconfigurable
architecture enhances system efficiency by a factor of $10\times$ for ResNet-18 on
CIFAR-10 compared to the previous SNN-CIM [3].
Fig. 8. Reconfigurable CIM with dual-mode ReNCA.
V. MULTI-BIT EXTENSION IN RENCA
The charge pump (CP) in ReNCA is designed to be expandable, allowing it to handle
multi-bit SNN operations. The CP processes 4 bit weights by converting them into an
analog voltage. The operation begins by resetting the voltages of all cells in the
array. Multi-bit functionality is realized by generating an analog output voltage
(Vout) that reflects the significance of each bit position. The output voltage (Vout)
is scaled by adjusting the value of the C2 capacitor in the CP, with a scaling factor
of one-half per bit position. The capacitance of C2 is determined by the number of
unit capacitors connected to the bit line (BL) or bit line bar (BLB).
VI. IMPLEMENT RESULT
Fig. 9 presents the layout photograph of the proposed 1T1C eDRAM-based SNN-CIM processor,
fabricated using 28 nm CMOS technology. The processor integrates 19 Mb of memory,
offering over $20\times$ the capacity of the state-of-the-art SNN-CIM [3]. The chip occupies an area of 10.47 mm$^2$ and includes four banks. The processor
achieves an accuracy of 94.13% on ResNet-18 with CIFAR-10 and 72.03% on ResNet-50
with ImageNet, based on CNN-to-SNN conversion, operating at a clock frequency of 250
MHz condition with a low power consumption of 0.096 W. Table 1 highlights the comparison with prior state-of-the-art CIM processors. This is the
first SNN-CIM processor utilizing 1T1C eDRAM technology. Furthermore, unlike previous
CIM architecture, it supports a reconfigurable CIM architecture that dynamically adjusts
the memory-to-peripheral logic ratio based on various layer requirements. The proposed
CIM processor achieves remarkable energy efficiency on two levels: At the macro level,
it delivers 382 TOPS/W to 1531.3 TOPS/W, outperforming previous CNN CIMs [6-8]. It provides a $7729\times$ and $55.6\times$ improvement in density FoM over prior
1T1C CIM [7] and SNN-CIM [3], respectively. Ultimately, the processor achieves a $10\times$ increase in system-level
energy efficiency compared to [3], as evaluated on ResNet-18, CIFAR-10, considering EMA.
Fig. 9. Chip photo & performance summary.
Table 1. Comparison table.
VII. CONCLUSION
The proposed 1T1C eDRAM-based SNN-CIM processor significantly enhances system-level
efficiency through two main features. First, the high-density Reconfigurable Neuro-Cell
Array (ReNCA), utilizing 1T1C bit cells and compact SNN peripheral logic, achieves
a 41% reduction in area and 90% reduction in power consumption compared to [3]. Second, the dual-mode ReNCA and Dynamic Adjustable Neuron Link (DAN Link) enable
a reconfigurable CIM architecture, improving system-level efficiency, including EMA,
by $10\times$ over [3]. As a result, the processor demonstrates a state-of-the-art macro-level efficiency
of 1531.3 TOPS/W and achieves $10\times$ higher system-level efficiency than the previous
SNN-CIM [3].
ACKNOWLEDGMENTS
This work was supported by Institute of Information & communications Technology Planning
& Evaluation (IITP) under the Graduate School of Artificial Intelligence Semiconductor
(IITP-2024-RS-2023-00256472) grant funded by the Korea government (MSIT).
References
J. Lee, H. Valavi, Y. Tang, and N. Verma, ``Fully row/column parallel in-memory computing
SRAM macro employing capacitor based mixed-signal computation with 5-b inputs,'' Proc.
of 2021 Symposium on VLSI Circuits, pp. 1-2, 2021

H. Jia and M. Ozatay, ``A programmable neural-network inference accelerator based
on scalable in-memory computing,'' Proc. of 2021 IEEE International Solid- State Circuits
Conference (ISSCC), pp. 236-237, 2021.

S. Kim, S. Kim, S. Um, S. Kim, K. Kim, and H. -J. Yoo, ``Neuro-CIM: A 310.4 TOPS/W
neuromorphic computing-in-memory processor with low WL/BL activity and digital-analog
mixed-mode Neuron firing,'' Proc. of 2022 IEEE Symposium on VLSI Technology and Circuits
(VLSI Technology and Circuits), pp. 38-39, 2022.

Y. Wang, ``Energy efficient RRAM spiking neural network for real time classification,''
Proc. of the 25th edition on Great Lakes Symposium on VLSI, pp. 189-194, 2015.

K. Goetschalckx and M. Verhelst, ``DepFiN: A 12 nm, 3.8TOPs depth first CNN processor
for high res. image processing,'' Proc. of 2021 Symposium on VLSI Circuits, pp. 1-2,
2021.

Z. Chen, X. Chen, and J. Gu, ``15.3 A 65 nm 3T dynamic analog RAM-based computing-in-memory
macro and CNN accelerator with retention enhancement, adaptive analog sparsity and
44TOPS/W system energy efficiency,'' Proc. of 2021 IEEE International Solid- State
Circuits Conference (ISSCC), pp. 240-242, 2021.

S. Xie, C. Ni, A. Sayal, P. Jain, F. Hamzaoglu, and J. P. Kulkarni, ``16.2 eDRAM-CIM:
Compute-in-memory design with reconfigurable embedded-dynamic-memory array realizing
adaptive data converters and charge-domain computing,'' Proc. of 2021 IEEE International
Solid- State Circuits Conference (ISSCC), pp. 248-250, 2021.

S. Xie, C. Ni, P. Jain, F. Hamzaoglu, and J. P. Kulkarni, ``Gain-cell CIM: Leakage
and bitline swing aware 2T1C gain-cell eDRAM compute in memory design with bitline
precharge DACs and compact schmitt trigger ADCs,'' Proc. of 2022 IEEE Symposium on
VLSI Technology and Circuits (VLSI Technology and Circuits), pp. 112-113, 2022.

T. Tanzawa and T. Tanaka, ``A dynamic analysis of the Dickson charge pump circuit,''
IEEE Journal of Solid-State Circuits, vol. 32, no. 8, pp. 1231-1240, Aug. 1997.

Sangmyoung Lee received his B.S. degrees in electrical engineering from the Korea
Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2024,
where he is currently pursuing an M.S. degree in Graduate School of AI Semiconductor.
His current research interests include energy-efficient processing-in-memory accelerators
and application-specific neuromorphic hardware.
Seryeong Kim received her B.S. degree in electrical engineering from the Korea Aerospace
University, South Korea, in 2021, and an M.S. degree in electrical engineering from
the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea,
in 2023, where she is currently pursuing a Ph.D. degree. Her current research interests
include SW-HW co-optimization, low-power system-on-chip design, and application-specific
processor design.
Soyeon Kim received her B.S. and M.S. degree from the School of Electrical Engineering,
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea,
in 2019 and 2021, respectively, where she is currently pursuing a Ph.D. degree. Her
research interests include energy-efficient deep learning processor design and intelligent
computer vision systems.
Soyeon Um received her B.S. and M.S. degrees from the School of Electrical Engineering,
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea,
in 2020 and 2021, respectively, where she is currently pursuing a Ph.D. degree. Her
current research interests include low-power deep learning and intelligent vision
system-on-chip (SoC) design, energy-efficient processing-memory architecture, and
application-specific neuromorphic hardware.
Sangjin Kim received his B.S., M.S., and Ph.D. degrees in electrical engineering from
the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea,
in 2019, 2021, and 2024, respectively. He is currently a Postdoctoral Associate at
KAIST. His research interests include computing-in-memory for low-power AI accelerators,
processing-in-memory for energy-efficient AI systems, and hardware-software co-optimization
for generative AI models.
Sangyeob Kim received his B.S., M.S. and Ph.D. degrees from the School of Electrical
Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon,
South Korea, in 2018, 2020 and 2023, respectively. He is currently a Post-Doctoral
Associate with the KAIST. His current research interests include energy-efficient
system-on-chip design, especially focused on deep neural network accelerators, neuromorphic
hardware, and computing-in-memory accelerators.
Wooyoung Jo received his B.S. and M.S. degree in electrical engineering from the Korea
Advanced Institute of Science and Technology (KAIST), Daejeon, Korea in 2020 and 2022,
respectively, where he is currently pursuing a Ph.D. degree. His current research
interests include energy-efficient ASIC/SoCs especially focused on computer vision,
large language models and deep learning algorithms for efficient processing.
Hoi-Jun Yoo graduated from the Department of Electronics, Seoul National University,
Seoul, South Korea, in 1983. He received his M.S. and Ph.D. degrees in electrical
engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon,
South Korea, in 1985 and 1988, respectively. Dr. Yoo has served as a member of the
Executive Committee for the International Solid-State Circuits Conference (ISSCC),
the Symposium on Very Large-Scale Integration (VLSI), and the Asian Solid-State Circuits
Conference (A-SSCC), the TPC Chair of the A-SSCC 2008 and the International Symposium
on Wearable Computer (ISWC) 2010, the IEEE Distinguished Lecturer from 2010 to 2011,
the Far East Chair of the ISSCC from 2011 to 2012, the Technology Direction Sub-Committee
Chair of the ISSCC in 2013, the TPC Vice-Chair of the ISSCC in 2014, and the TPC Chair
of the ISSCC in 2015. He is currently an ICT chair professor in the School of Electrical
Engineering at KAIST, where he also serves as director of the System Design Innovation
and Application Research Center (SDIA), PIM Semiconductor Design Research Center (AI-PIM),
and KAIST Institute of Information Technology Convergence. He is also dean of graduate
school of AI semiconductor at KAIST. More details are available at http://ssl.kaist.ac.kr.