I. INTRODUCTION
Over the past few decades, data storage and throughput have increased exponentially.
However, the von Neumann architecture, in which the processing unit that operates
information and the memory unit responsible for storing the information are physically
separated, has reached its limits owing to the memory wall, data movement overhead,
and increased leakage current [1]. To overcome this problem, in-memory computing (IMC) and computing-in-memory (CIM),
which add a small number of processing units inside and around the memory, have been
proposed in previous studies [2,3]. New non-volatile memories, such as resistive random access memory (RRAM), phase-change
memory (PCM), and spin-transfer torque magnetic random access memory (STT-MRAM), have
been used to create the CIM. Among them, STT-MRAM has advantages of high speed, low
power, and excellent durability compared to PCM and RRAM [4]. Additionally, STT-MRAMs have a small cell size and relatively small device-to-device
variation and near-zero cycle-to-cycle variation [5]; therefore, many CIMs based on STT-MRAM have been proposed. Plus, the critical switching
current ($I_{C}$) at the magnetic tunnel junction (MTJ) decreases with device size,
making STT-MRAM a strong candidate to overcome the scaling challenges of conventional
memories such as SRAM, dynamic RAM, and flash memory [6,7,8]. However, area overhead cannot be avoided with several additional logical units and
a sense amplifier (SA) because arithmetic logic must be performed in the memory [9,10].
Recently, an area-efficient CIM (previous CIM) based on the STT-MRAM was proposed
[9]. Therefore, a high-performance multi-bit FA operating in the analog domain using
charge saving and sharing (CSS) circuit has been proposed to minimize area overhead
and reduce the number of stages required [11]. However, in this study, from the perspective of architecture, we improve the disadvantage
that the previous CIM uses two reference word lines (WLs) for AND/OR operations. Therefore,
we propose an area-optimized CIM platform based on STT-MRAM that performs AND/OR operations
by reducing one WL. In addition, for a reference WL structure with a small area overhead
[12], an offset-canceling current-sampling SA (OCCS-SA) [13] is used instead of a pre-charge SA (PCSA) [14] to improve the read yield. According to the simulation results, the proposed CIM
achieves read yield 100% with SA with an area about 5.5 times smaller than the previous
CIM, and the Reference WL for CIM operation is also reduced by one.
The remainder of this paper is organized as follows: Section II briefly introduces
the fundamentals of STT-MRAM. In Section III, we present the proposed area-optimized
CIM platform. Section IV presents and discusses the simulation results. Section V
presents the comparison. Finally, Section VI concludes the paper.
III. PROPOSED CIM
Fig. 3 shows a simplified array structure of the proposed CIM. In this section, we describe
the changed reference scheme, decoder, and SA of the proposed CIM, explain the area-optimized
and improved read stability of the proposed CIM and compare it with the previous CIM
using STT-MRAM.
Fig. 3. Simplified array structure of the proposed CIM.
1. Reference Scheme
Fig. 4(a) shows the previous CIM reference cell for general memory and CIM (AND, OR) operations
in the multiple-cell reference (MCR) scheme [17], which is the optimal reference scheme. $R_{\rm P}$ and $R_{\rm AP}$ are 3 k$\Omega$
and 9 k$\Omega$, respectively, with a TMR ratio of 200% [9]. The SL switch in Fig. 4 is in the form of a switch as shown in Fig. 2, but it is expressed in NMOS for convenience. When 16 cells are used in the MCR scheme,
the conventional $R_{\rm REF}$ ($=2\times( R_{\rm P} $//$ R_{\rm AP})$) for the memory
operation can be achieved at $R_{\rm P}:R_{\rm AP} = 12 : 4$. For the CIM AND and
OR operations, two WLs ($L_{2}$ with $R_{\rm P}$ and $L_{3}$ with $R_{\rm AP}$) are
used. The previous CIM performs normal memory operation when only Ref-WL is turned
on. When Ref-WL and $L_{2}$ ($L_{3}$) are turned on simultaneously, the AND (OR) operation
is performed with the value of $R_{\rm REF} $//$ R_{\rm P}$ ($R_{\rm REF} $//$ R_{\rm
AP}$). The required resistance value for each operation is shown in Fig. 5. The required BL, WL, and SL values for each operation are determined by the decoder,
as discussed in detail in Section III-2.
Fig. 4. MCR Scheme. (a) Previous CIM. (b) Proposed CIM.
(a)
(b)
Fig. 5. Resistance during logic operations.
The proposed CIM uses only two WLs, unlike the previous CIM, which uses three WLs.
Ref-WL uses the same MCR scheme with $R_{\rm P}:R_{\rm AP} = 12 : 4$ ratio as in the
previous CIM. However, the proposed CIM only uses one WL ($L_{C}$) for CIM (AND, OR)
operation. The resistor comprising $L_{C}$ is composed of $R_{\rm AP}$, as shown in
Fig. 4(b). When only Ref-WL is turned on, a normal memory operation is performed. When Ref-WL
and $L_{C}$ are turned on simultaneously, an AND or OR operation is performed. AND
and OR are determined by the SL used, and the reference value required for each is
created based on the SL used. When Ref-WL, $L_{C}$, SL1, SL5, SL9, and SL13 are selected,
they generate $R_{\rm REF} $//$ R_{\rm P}$ values and operates as AND. If Ref-WL,
$L_{C}$, SL3, SL7, SL11, and SL15 are selected, it generates $R_{\rm REF} $//$ R_{\rm
AP}$ values and an OR operation is performed.
2. Design of the decoder
A decoder is used to select the desired BL, WL, and SL from the memory. The numbers
of BLs, WLs, and SLs determine the number of signals required by the decoder, and
this study covers the design of the decoder when the numbers of BLs, WLs, and SLs
are 16, 128, and 16, respectively.
For typical memory, the same SL is used with the BL. Five signals (S0-S3, bank select
(BS)) of the BL decoder, shown in Fig. 6, are used for the SL decoder. The MCR scheme has two banks connected to the SA. When
data are selected, the bank to which the data cells belong is activated, and another
bank, called the deactivated bank, selects reference cells for sensing. BL0 and Ref-WL
must be used to create a reference voltage or current in the deactivated bank. The
SL decoder uses five signals (S0-S3, BS), as shown in Fig. 7(a). In this case, the BS signal is 1 (0) in the activated (deactivated) bank. The WL
decoder (shown in Fig. 8(a), ignoring the CIM_AND_EN and CIM_OR_EN signals) uses eight signals (W0-W6, BS). Thus,
a total of 12 signals (S0-S3, W0-W6, BS) are available in a typical memory without
CIM operations.
Fig. 7. SL decoder. (a) Typical memory and previous CIM. (b) Proposed CIM.
(a)
(b)
Fig. 8. WL decoder. (a) Previous CIM. (b) Proposed CIM.
(a)
(b)
Additional signals are required for the memory to support the CIM operation. First,
the BL decoder has the same structure as the previous and proposed CIMs, as depicted
in Fig. 6. The previous CIM has the same SL decoder structure as the typical memory, as shown
in Fig. 7(a), because SL3, SL7, SL11, and SL15 are used for Ref, AND, and OR operations, respectively.
The SL decoder in the proposed CIM has a slightly different design, with an additional
CIM_AND_EN signal, as shown in Fig. 7(b). The NOR signal of the BS and CIM_AND_EN is connected to $V_{g_SL3,\ }$$V_{g_SL7,\
}$$V_{g_SL11,\ }$and V${}_{g_SL15}$, and CIM_AND_EN is connected to $V_{g_SL1,\ }$$V_{g_SL5,\
}$$V_{g_SL9}$, and $V_{g_SL13}$ to obtain the proper resistance value required for
each operation.
In a WL decoder design, the two operand data used in the CIM operation are set to
the data located in the neighboring WLs. For example, WL0 and WL1 become a pair for
the CIM operation. Compared with a typical memory, both CIMs use two additional signals
(CIM_AND_EN and CIM_OR_EN) in the WL decoder. To perform both general memory and AND/OR
operations, the WL decoder of the previous CIM generates not only Ref-WL ($= V_{g_ref-wl}$),
but also $L_{2}$ ($= V_{g_L2}$) and $L_{3}$ ($= V_{g_L3}$) signals, as shown in Fig. 8(a). Since $L_{2}$ or $L_{3}$ must be turned on simultaneously as Ref-WL for AND/OR operation,
each enabled signal and Ref-WL are connected using an AND logic gate. The proposed
CIM can be designed as shown in Fig. 8(b) because it only requires Ref-WL and $L_{C}$ to be turned on simultaneously for both
AND and OR operations.
Table 1. Signal of each operation.
|
S0-S3
|
W0-W6
|
CIM_
AND_
EN
|
CIM_
OR_
EN
|
BS
|
Read/Write data
|
SL, BL select
signal
|
WL
select
signal
|
0
|
0
|
1
|
Ref
|
0
|
0
|
0
|
AND
|
1
|
0
|
0
|
OR
|
0
|
1
|
0
|
The total number of signals required for the previous and proposed CIM decoders is
14 (S0-S3, W0-W6, BS, CIM_AND_EN, and CIM_OR_EN). It can be observed that two signals
are additionally required for CIM operation. The WL decoder of both the previous and
proposed CIMs is increased by one logic stage compared to common memory. The proposed
CIM is similar to the previous CIM in terms of decoder design; however, it is more
area-efficient because it can reduce the reference WL required for CIM operation by
one. The designed decoder can prevent the toggling of internal nodes by blocking the
unused BL, SL, and WL that are not in use because the BS signal is located at the
beginning. This process is also energy-efficient.
3. SA
The PCSA, shown in Fig. 9, is a circuit that charges the OUT node connected to the reference cell and the OUTB
node connected to the data cell with $V_{\rm DD}$ in the pre-charge stage. Then creates
a difference in the discharge rate between OUT and OUTB with $I_{\rm data}$ and $I_{ref}$
in the discharge stage, and senses it through positive feedback. The PCSA is used
in many logic circuits because it has the advantages of fast sensing speed, small
sensing power, and reduced standby power.
Fig. 9. Schematic of previous CIM.
However, for the proposed CIM with the MCR scheme, a serious sensing error occurs
when PCSA is used. When the MCR scheme is used, the reference cells are located at
the end of the memory array at WL127. When the sensed data cell is far from the reference
cell located at WL127, that is, when the data cell is close to WL0, a serious read
failure occurs. Unlike $R_{\rm P}$, which continues to have a read yield close to
100%, $R_{\rm AP}$ exhibits a sharp decrease in read yield to 50% from WL80 and 100%
read failures from WL30, as depicted in Fig. 10. This is because the PCSA is sensed by the Elmore delay and not the resistance difference
between the data and the reference.
Fig. 10. . Memory read yield of the previous CIM when TMR is 200%.
Fig. 11. Simplified previous CIM.
A simplified representation of the previous CIM is shown in Fig. 11. The Elmore delay at the OUT node can be expressed as
$R_{\rm BL} \approx C_{\rm BL} \approx 0$ for data cell located at WL0 and $R_{\rm
SL} \approx C_{\rm SL} \approx 0$ for data cell located at WL127, which can be summarized
as
If the same parasitic between BL and SL is assumed and $R_{\rm SL}=R_{\rm BL} = {R}$,
$C_{\rm SL} = C_{\rm BL} = C$, and the identity terms in $d_{WL0}$ and $d_{WL127}$
are removed, then Elmore delays in WL0 and WL127 are expressed as
Even for the same data, the data cell located at WL0 discharges faster than the data
cell located at WL127, because $d_{WL0}<d_{WL127}$. If the reference cell is located
at WL127, OUT becomes unconditionally 1 ($V_{\rm DD}$) because the discharge rate
of the OUTB node connected to the data cell increases as the data cell approaches
WL0, regardless of the value of $R_{\rm data}$. Therefore, the closer to WL0, the
yield of $R_{\rm P}$ (state 1), which should be faster than $R_{ref}$, increases,
and the yield of $R_{\rm AP}$ (state 0), which should be slower, decreases.
Therefore, to use the PCSA, it is necessary to use the reference BL structure, in
which the data and reference cells use the same WL [12]. However, because a large area overhead cannot be avoided owing to the reference
BL structure, the proposed CIM uses the 1) MCR scheme [16], which is based on the reference WL structure, and 2) OCCS-SA [13], which makes the proposed CIM unaffected by Elmore delay.
The OCCS-SA in Fig. 13 charges the OUT (OUTB) node with the sampled current $I_{M1}$ ($I_{M2}$), while the
M12 (M13) discharges the OUT (OUTB) node with the current $I_{\rm data}$ ($I_{ref}$).
The OCCS-SA has a strong positive feedback by creating a current path to the GND through
the M12 (M13). And it is a circuit in which $I_{\rm SA}$ ($= I_{\rm data} - I_{ref}$)
is doubled by M12 (M13) as a data-dependent reference generator (DDRG). The OCCS-SA
has relatively fast speed and high reliability due to the use of offset-cancellation,
but it is slower than the PCSA. The PCSA is fast, but the result is not reliable because
offset cancellation is not applied. However, the OCCS-SA used in the proposed CIM
uses more transistors than the PCSA in the previous CIM. Since the number and size
of transistors are directly related to the area overhead, it is not efficient if the
area required to replace the SA is large compared to the WL of the reduced reference
cell. Therefore, we compare the area of each SA required for a yield of 100%. In this
case, the yield is based on the data $R_{\rm AP}$ located at WL0, which is the most
difficult to read, and the area is estimated as the width and length of the transistor
used.
Fig. 12. $R_{\rm AP}$ read yield of OCCS-SA and PCSA according to SA area.
The OCCS-SA of the proposed CIM achieves a $R_{\rm AP}$ read yield of 100% in an area
of 2 $\mu$m$^{2}$, as shown in Fig. 12. On the other hand, PCSA shows severe read failures with 0% yield even at 11 $\mu$m${}^{2}$,
which is about 5.5 times the area of 2 $\mu$m${}^{2}$, and requires an area of 21
$\mu$m${}^{2}$, which is about 10.5 times the area of OCCS-SA, to achieve 100% yield.
Because the PCSA uses Elmore delay as described above. The increase in the SA size
means that $R_{\rm SA}$ becomes small and $C_{\rm SA}$ increases, and thus, the specific
gravity of ${C}_{\rm SA}$ increases in Elmore delay. Thus, in the Elmore delay equation,
${C}_{\rm SA}$ ($R_{\rm SA} + R_{\rm BL} + R_{\rm MTJ} + R_{SL} + R_{\rm SLMUX}$)
has a large specific gravity. The difference between d$_{\rm WL0}$ and d${}_{\rm WL127}$
is determined by R${}_{\rm MTJ}$. Therefore, PCSA should make SA large because PCSA
should have the condition of $C_{\rm BL}$ ($C_{\rm SL}$) $\ll C_{\rm SA}$ for accurate
sensing. Therefore, in order to satisfy 100% yield, the PCSA requires a large area
overhead, which can lead to power consumption and delay problems, making it unsuitable
for the proposed CIM. By using OCCS-SA instead of PCSA, the proposed CIM can increase
reliability while having a small area overhead.
Fig. 13. Transistor-level schematic and transistor size of the OCCS-SA used.
IV. SIMULATION
1. Simulation Conditions
The proposed CIM simulations use 28 nm process technology and HSPICE. Fig. 3 shows a simplified array structure of the proposed CIM. A transistor-level schematic
and transistor size of the SA are shown in Fig. 13. In other circuits, the NMOS has a width of 2 $\mu$m and length of 0.03 $\mu$m, and
PMOS has a width of 4 $\mu$m and length of 0.03 $\mu$m. $R_{\rm P}$ is 3 k$\Omega$
and $R_{\rm AP}$ is 9 k$\Omega$ with a TMR of 200%. A 4% variation in the MTJ process
was added for reliability verification. In this simulation, the PCSA for comparison
was designed to be 8 $\mu$m${}^{2}$, similar to the OCCS-SA area used (7.84 $\mu$m${}^{2}$).
Fig. 14. Memory read yield of the proposed CIM when TMR is 200%.
2. Simulation Results
Fig. 14 shows the read yield of the proposed CIM based on the WL of the data when $V_{\rm
DD} = 1$ V, gate voltage of the clamp NMOS ($V_{G_clamp}$) $= 0.8$ V, gate voltage
of access transistor ($V_{access}$) $= 1.2$ V, and sensing time ($t_{sen}$) $= 10$
ns. Compared to the previous CIM (Fig. 10), where the yield rapidly decreases as the data approaches WL0, the proposed CIM
shows 100% yield to WL0. Even when all conditions are set the same and only TMR is
reduced to 100% ($R_{\rm P} = 3$ k$\Omega$, $R_{\rm AP} = 6$ k$\Omega$), the proposed
CIM shows a yield of 100%, as shown in Fig. 15, while the previous CIM has a serious read failure with $R_{\rm AP}$ (state 0) yield
of 0% from WL90.
Fig. 15. Memory read yield when TMR is 100%. (a) Previous CIM. (b) Proposed CIM.
(a)
(b)
The worst cases for detecting the CIM AND and OR operations are AND(1,1) and OR(0,1),
respectively. This is because the difference between $R_{ref}$ and $R_{\rm data}$
for each operation is small, as depicted in Fig. 5. Fig. 16 shows the worst case yield of the previous and proposed CIMs. The proposed CIM has
100% yield in all cases, including AND(1,1) and OR(0,1), which have the smallest $R_{ref}$
and $R_{\rm data}$ differences, as mentioned earlier. However, the previous CIM shows
high failure rates in AND(1,0), AND(0,0), and OR(0,0), where the output should be
zero, rather than in AND(1,1) and OR(1,0), where the resistance difference is small.
This is because it is measured by the Elmore delay based on the WL of the data and
not the difference between $R_{ref}$ and $R_{\rm data}$, as described in Section III-C.
Therefore, the previous CIM outputs 1 unconditionally when the data approaches WL0
owing to the Elmore delay; thus, it is an unreliable result, even if it provides a
100% yield for AND(1,1), OR(1,0), and OR(1,1).
Fig. 16. Worst case CIM yield. (a) Previous CIM. (b) Proposed CIM.
(a)
(b)
V. COMPARISON
Table 2 presents a brief comparison of the previous and proposed CIMs. As described earlier,
both CIMs require additional signals than the typical memory to perform general memory
and CIM operations, and the total number of signals required is 14. When the reference
voltage or current required for the CIM operation is generated using the MCR scheme,
the number of reference WLs required for the proposed CIM is one less than the previous
CIM. Therefore, the proposed CIM is more area-optimized and has the advantage of increasing
the memory capacity. When the memory read yields at WL0, WL30, WL60, WL90, and WL127
are simulated and averaged using a TMR of 200%, the proposed CIM shows a read yield
of 100%. However, the previous CIM shows a read yield of 41.8%, with an $R_{\rm AP}$
read yield of less than 50%. In addition, the AND(0,1) operation of the previous CIM
yields 0% in all WLs, indicating that the previous CIM does not work properly, whereas
the proposed CIM shows a 100% yield in all CIM operations. As shown in Fig. 12, the PCSA area (21 $\mu$m${}^{2}$) required for $R_{\rm AP}$ yield of 100% is approximately
3 times the area (7.84 $\mu$m${}^{2}$) of the OCCS-SA used. Since AND/OR operations
are harder to detect than READ and require a larger SA area, the area of the PCSA
for the CIM yield of 100% is even larger.
Table 2. Signal of each operation.
|
Previous CIM
(TMAG'21 [5])
|
Proposed CIM
|
Total number of signals
|
14
|
14
|
Required number of reference WLs1)
|
3
(Ref-WL, L2, L3)
|
2
(Ref-WL, LC)
|
Type of SA
|
PCSA
|
OCCS-SA
|
Memory read yield2)
|
RP: 100%
RAP: 41.8%
|
RP: 100%
RAP: 100%
|
Worst case CIM read yield
|
0%3)
|
100%
|
1) Using the MCR scheme.
2) Average memory read yield when data is located at WL0, WL30, WL60, WL90, and WL126.
TMR $= 200%$ (RP $= 3$ k$\Omega$, RAP $= 9$ k$\Omega$) is used.
3) AND(0,1) fails on all WLs.