Mobile QR Code QR CODE

  1. ( Department of EE, Incheon National University, Incheon 22012, Korea)



Area-optimized, computing-in-memory (CIM), reference scheme, reliable, spin-transfer torque magnetic random access memory (STT-MRAM)

I. INTRODUCTION

Over the past few decades, data storage and throughput have increased exponentially. However, the von Neumann architecture, in which the processing unit that operates information and the memory unit responsible for storing the information are physically separated, has reached its limits owing to the memory wall, data movement overhead, and increased leakage current [1]. To overcome this problem, in-memory computing (IMC) and computing-in-memory (CIM), which add a small number of processing units inside and around the memory, have been proposed in previous studies [2,3]. New non-volatile memories, such as resistive random access memory (RRAM), phase-change memory (PCM), and spin-transfer torque magnetic random access memory (STT-MRAM), have been used to create the CIM. Among them, STT-MRAM has advantages of high speed, low power, and excellent durability compared to PCM and RRAM [4]. Additionally, STT-MRAMs have a small cell size and relatively small device-to-device variation and near-zero cycle-to-cycle variation [5]; therefore, many CIMs based on STT-MRAM have been proposed. Plus, the critical switching current ($I_{C}$) at the magnetic tunnel junction (MTJ) decreases with device size, making STT-MRAM a strong candidate to overcome the scaling challenges of conventional memories such as SRAM, dynamic RAM, and flash memory [6,7,8]. However, area overhead cannot be avoided with several additional logical units and a sense amplifier (SA) because arithmetic logic must be performed in the memory [9,10].

Recently, an area-efficient CIM (previous CIM) based on the STT-MRAM was proposed [9]. Therefore, a high-performance multi-bit FA operating in the analog domain using charge saving and sharing (CSS) circuit has been proposed to minimize area overhead and reduce the number of stages required [11]. However, in this study, from the perspective of architecture, we improve the disadvantage that the previous CIM uses two reference word lines (WLs) for AND/OR operations. Therefore, we propose an area-optimized CIM platform based on STT-MRAM that performs AND/OR operations by reducing one WL. In addition, for a reference WL structure with a small area overhead [12], an offset-canceling current-sampling SA (OCCS-SA) [13] is used instead of a pre-charge SA (PCSA) [14] to improve the read yield. According to the simulation results, the proposed CIM achieves read yield 100% with SA with an area about 5.5 times smaller than the previous CIM, and the Reference WL for CIM operation is also reduced by one.

The remainder of this paper is organized as follows: Section II briefly introduces the fundamentals of STT-MRAM. In Section III, we present the proposed area-optimized CIM platform. Section IV presents and discusses the simulation results. Section V presents the comparison. Finally, Section VI concludes the paper.

II. PRELIMINARIES OF PROPOSED CIM

1. Preliminaries of STT-MRAM and MTJ

Fig. 1(a) shows a perpendicular magnetic anisotropy (PMA) magnetic tunnel junction (MTJ) of a general STT-MRAM cell. The MTJ comprises an oxide barrier layer, primarily between the free layer (FL) and pinned layer (PL), composed of a ferromagnetic material. The magnetization directions of FL and PL can be expressed in two states: parallel (P) and antiparallel (AP). Based on the magnetization direction of the FL, the MTJ can exhibit two resistance states owing to the tunnel magnetoresistance (TMR) effect. As shown in Fig. 1(b), the case in which FL is magnetized in the same direction as PL is denoted as $R_{\rm P}$, low resistance, and state 1. And the case where it is magnetized in the opposite direction is denoted as $R_{\rm AP}$, high resistance, and state 0. The difference in resistance between $R_{\rm P}$ and $R_{\rm AP}$ affects the reading performance measured by the TMR ratio and is calculated as ($R_{\rm AP}-R_{\rm P})/R_{\rm P}\times 100$ [15], [16].

Fig. 1. (a) Structure of PMA-MTJ. (b) Resistance state based on magnetization direction.

../../Resources/ieie/JSTS.2025.25.1.56/fig1a.png(a)

../../Resources/ieie/JSTS.2025.25.1.56/fig1b.png(b)

2. BL and SL Characteristics

To write and read the desired data in the memory, it is necessary to select the corresponding bit line (BL), WL, and source line (SL). Voltage is applied to the desired BL, WL, and SL using the signals of the BL, WL, and SL decoders, making a path through which current can flow from ${\rm V}_{\rm DD}$ to GND through the desired data. Generally, unlike the WL, which consists of one NMOS, the BL and SL are composed of one NMOS, one PMOS, and one inverter in STT-MRAM, as depicted in Fig. 2. This is called a BL (SL) switch. The BL and SL switches have the above shapes because of the pull-down network nature of the NMOS. When only NMOS is used, only the value of ${\rm V}_{\rm DD}-{\rm V}_{\rm TH}$ can be used, even if ${\rm V}_{\rm DD}$ is applied to the memory. Therefore, the voltage is stably supplied using an additional PMOS with the pass strong 1 property and an inverter that converts the voltage applied to ${\rm V}_{\rm DD}$ into GND.

Fig. 2. BL, SL switch.

../../Resources/ieie/JSTS.2025.25.1.56/fig2.png

III. PROPOSED CIM

Fig. 3 shows a simplified array structure of the proposed CIM. In this section, we describe the changed reference scheme, decoder, and SA of the proposed CIM, explain the area-optimized and improved read stability of the proposed CIM and compare it with the previous CIM using STT-MRAM.

Fig. 3. Simplified array structure of the proposed CIM.

../../Resources/ieie/JSTS.2025.25.1.56/fig3.png

1. Reference Scheme

Fig. 4(a) shows the previous CIM reference cell for general memory and CIM (AND, OR) operations in the multiple-cell reference (MCR) scheme [17], which is the optimal reference scheme. $R_{\rm P}$ and $R_{\rm AP}$ are 3 k$\Omega$ and 9 k$\Omega$, respectively, with a TMR ratio of 200% [9]. The SL switch in Fig. 4 is in the form of a switch as shown in Fig. 2, but it is expressed in NMOS for convenience. When 16 cells are used in the MCR scheme, the conventional $R_{\rm REF}$ ($=2\times( R_{\rm P} $//$ R_{\rm AP})$) for the memory operation can be achieved at $R_{\rm P}:R_{\rm AP} = 12 : 4$. For the CIM AND and OR operations, two WLs ($L_{2}$ with $R_{\rm P}$ and $L_{3}$ with $R_{\rm AP}$) are used. The previous CIM performs normal memory operation when only Ref-WL is turned on. When Ref-WL and $L_{2}$ ($L_{3}$) are turned on simultaneously, the AND (OR) operation is performed with the value of $R_{\rm REF} $//$ R_{\rm P}$ ($R_{\rm REF} $//$ R_{\rm AP}$). The required resistance value for each operation is shown in Fig. 5. The required BL, WL, and SL values for each operation are determined by the decoder, as discussed in detail in Section III-2.

Fig. 4. MCR Scheme. (a) Previous CIM. (b) Proposed CIM.

../../Resources/ieie/JSTS.2025.25.1.56/fig4a.png(a)

../../Resources/ieie/JSTS.2025.25.1.56/fig4b.png(b)

Fig. 5. Resistance during logic operations.

../../Resources/ieie/JSTS.2025.25.1.56/fig5.png

The proposed CIM uses only two WLs, unlike the previous CIM, which uses three WLs. Ref-WL uses the same MCR scheme with $R_{\rm P}:R_{\rm AP} = 12 : 4$ ratio as in the previous CIM. However, the proposed CIM only uses one WL ($L_{C}$) for CIM (AND, OR) operation. The resistor comprising $L_{C}$ is composed of $R_{\rm AP}$, as shown in Fig. 4(b). When only Ref-WL is turned on, a normal memory operation is performed. When Ref-WL and $L_{C}$ are turned on simultaneously, an AND or OR operation is performed. AND and OR are determined by the SL used, and the reference value required for each is created based on the SL used. When Ref-WL, $L_{C}$, SL1, SL5, SL9, and SL13 are selected, they generate $R_{\rm REF} $//$ R_{\rm P}$ values and operates as AND. If Ref-WL, $L_{C}$, SL3, SL7, SL11, and SL15 are selected, it generates $R_{\rm REF} $//$ R_{\rm AP}$ values and an OR operation is performed.

2. Design of the decoder

A decoder is used to select the desired BL, WL, and SL from the memory. The numbers of BLs, WLs, and SLs determine the number of signals required by the decoder, and this study covers the design of the decoder when the numbers of BLs, WLs, and SLs are 16, 128, and 16, respectively.

For typical memory, the same SL is used with the BL. Five signals (S0-S3, bank select (BS)) of the BL decoder, shown in Fig. 6, are used for the SL decoder. The MCR scheme has two banks connected to the SA. When data are selected, the bank to which the data cells belong is activated, and another bank, called the deactivated bank, selects reference cells for sensing. BL0 and Ref-WL must be used to create a reference voltage or current in the deactivated bank. The SL decoder uses five signals (S0-S3, BS), as shown in Fig. 7(a). In this case, the BS signal is 1 (0) in the activated (deactivated) bank. The WL decoder (shown in Fig. 8(a), ignoring the CIM_AND_EN and CIM_OR_EN signals) uses eight signals (W0-W6, BS). Thus, a total of 12 signals (S0-S3, W0-W6, BS) are available in a typical memory without CIM operations.

Fig. 6. BL decoder.

../../Resources/ieie/JSTS.2025.25.1.56/fig6.png

Fig. 7. SL decoder. (a) Typical memory and previous CIM. (b) Proposed CIM.

../../Resources/ieie/JSTS.2025.25.1.56/fig7a.png(a)

../../Resources/ieie/JSTS.2025.25.1.56/fig7b.png(b)

Fig. 8. WL decoder. (a) Previous CIM. (b) Proposed CIM.

../../Resources/ieie/JSTS.2025.25.1.56/fig8a.png(a)

../../Resources/ieie/JSTS.2025.25.1.56/fig8b.png(b)

Additional signals are required for the memory to support the CIM operation. First, the BL decoder has the same structure as the previous and proposed CIMs, as depicted in Fig. 6. The previous CIM has the same SL decoder structure as the typical memory, as shown in Fig. 7(a), because SL3, SL7, SL11, and SL15 are used for Ref, AND, and OR operations, respectively. The SL decoder in the proposed CIM has a slightly different design, with an additional CIM_AND_EN signal, as shown in Fig. 7(b). The NOR signal of the BS and CIM_AND_EN is connected to $V_{g_SL3,\ }$$V_{g_SL7,\ }$$V_{g_SL11,\ }$and V${}_{g_SL15}$, and CIM_AND_EN is connected to $V_{g_SL1,\ }$$V_{g_SL5,\ }$$V_{g_SL9}$, and $V_{g_SL13}$ to obtain the proper resistance value required for each operation.

In a WL decoder design, the two operand data used in the CIM operation are set to the data located in the neighboring WLs. For example, WL0 and WL1 become a pair for the CIM operation. Compared with a typical memory, both CIMs use two additional signals (CIM_AND_EN and CIM_OR_EN) in the WL decoder. To perform both general memory and AND/OR operations, the WL decoder of the previous CIM generates not only Ref-WL ($= V_{g_ref-wl}$), but also $L_{2}$ ($= V_{g_L2}$) and $L_{3}$ ($= V_{g_L3}$) signals, as shown in Fig. 8(a). Since $L_{2}$ or $L_{3}$ must be turned on simultaneously as Ref-WL for AND/OR operation, each enabled signal and Ref-WL are connected using an AND logic gate. The proposed CIM can be designed as shown in Fig. 8(b) because it only requires Ref-WL and $L_{C}$ to be turned on simultaneously for both AND and OR operations.

Table 1. Signal of each operation.

S0-S3

W0-W6

CIM_

AND_

EN

CIM_

OR_

EN

BS

Read/Write data

SL, BL select

signal

WL

select

signal

0

0

1

Ref

0

0

0

AND

1

0

0

OR

0

1

0

The total number of signals required for the previous and proposed CIM decoders is 14 (S0-S3, W0-W6, BS, CIM_AND_EN, and CIM_OR_EN). It can be observed that two signals are additionally required for CIM operation. The WL decoder of both the previous and proposed CIMs is increased by one logic stage compared to common memory. The proposed CIM is similar to the previous CIM in terms of decoder design; however, it is more area-efficient because it can reduce the reference WL required for CIM operation by one. The designed decoder can prevent the toggling of internal nodes by blocking the unused BL, SL, and WL that are not in use because the BS signal is located at the beginning. This process is also energy-efficient.

3. SA

The PCSA, shown in Fig. 9, is a circuit that charges the OUT node connected to the reference cell and the OUTB node connected to the data cell with $V_{\rm DD}$ in the pre-charge stage. Then creates a difference in the discharge rate between OUT and OUTB with $I_{\rm data}$ and $I_{ref}$ in the discharge stage, and senses it through positive feedback. The PCSA is used in many logic circuits because it has the advantages of fast sensing speed, small sensing power, and reduced standby power.

Fig. 9. Schematic of previous CIM.

../../Resources/ieie/JSTS.2025.25.1.56/fig9.png

However, for the proposed CIM with the MCR scheme, a serious sensing error occurs when PCSA is used. When the MCR scheme is used, the reference cells are located at the end of the memory array at WL127. When the sensed data cell is far from the reference cell located at WL127, that is, when the data cell is close to WL0, a serious read failure occurs. Unlike $R_{\rm P}$, which continues to have a read yield close to 100%, $R_{\rm AP}$ exhibits a sharp decrease in read yield to 50% from WL80 and 100% read failures from WL30, as depicted in Fig. 10. This is because the PCSA is sensed by the Elmore delay and not the resistance difference between the data and the reference.

Fig. 10. . Memory read yield of the previous CIM when TMR is 200%.

../../Resources/ieie/JSTS.2025.25.1.56/fig10.png

Fig. 11. Simplified previous CIM.

../../Resources/ieie/JSTS.2025.25.1.56/fig11.png

A simplified representation of the previous CIM is shown in Fig. 11. The Elmore delay at the OUT node can be expressed as

$ Elmore delay\\ = C_{\rm SLMUX}R_{\rm SLMUX}+C_{\rm SL}\left(R_{\rm SL}+R_{\rm SLMUX}\right)\\ \quad +C_{\rm BL}({R_{\rm BL}+R}_{\rm MTJ}+R_{\rm SL}+R_{\rm SLMUX})\\ \quad +C_{\rm SA}({{R_{\rm SA}+R}_{\rm BL}+R}_{\rm MTJ}+R_{\rm SL}+R_{\rm SLMUX})\\ \quad +C_L(R_L+R_{\rm SA}+{R_{\rm BL}+R}_{\rm MTJ}+R_{\rm SL}+R_{\rm SLMUX}). $

$R_{\rm BL} \approx C_{\rm BL} \approx 0$ for data cell located at WL0 and $R_{\rm SL} \approx C_{\rm SL} \approx 0$ for data cell located at WL127, which can be summarized as

$ d_{WL0}=C_{\rm SLMUX}R_{\rm SLMUX}+C_{\rm SL}\left(R_{\rm SL}+R_{\rm SLMUX}\right)\\ \hskip 2.5pc +C_{\rm SA}\left({R_{\rm SA}+R}_{\rm MTJ}+R_{\rm SL}+R_{\rm SLMUX}\right)\\ \hskip 2.5pc +C_L(R_L+R_{\rm SA}+R_{\rm MTJ}+R_{\rm SL}+R_{\rm SLMUX}),\\ d_{WL127}=C_{\rm SLMUX}R_{\rm SLMUX}\!+\!C_{\rm BL}\left({R_{\rm BL}\!+\!R}_{\rm MTJ}\!+\!R_{\rm SLMUX}\right)\\ \hskip 2.5pc +C_{\rm SA}\left({{R_{\rm SA}+R}_{\rm BL}+R}_{\rm MTJ}+R_{\rm SLMUX}\right)\\ \hskip 2.5pc +C_L(R_L+R_{\rm SA}+{R_{\rm BL}+R}_{\rm MTJ}+R_{\rm SLMUX}). $

If the same parasitic between BL and SL is assumed and $R_{\rm SL}=R_{\rm BL} = {R}$, $C_{\rm SL} = C_{\rm BL} = C$, and the identity terms in $d_{WL0}$ and $d_{WL127}$ are removed, then Elmore delays in WL0 and WL127 are expressed as

$ d_{WL0}=0,~d_{WL127}={CR}_{\rm MTJ}. $

Even for the same data, the data cell located at WL0 discharges faster than the data cell located at WL127, because $d_{WL0}<d_{WL127}$. If the reference cell is located at WL127, OUT becomes unconditionally 1 ($V_{\rm DD}$) because the discharge rate of the OUTB node connected to the data cell increases as the data cell approaches WL0, regardless of the value of $R_{\rm data}$. Therefore, the closer to WL0, the yield of $R_{\rm P}$ (state 1), which should be faster than $R_{ref}$, increases, and the yield of $R_{\rm AP}$ (state 0), which should be slower, decreases.

Therefore, to use the PCSA, it is necessary to use the reference BL structure, in which the data and reference cells use the same WL [12]. However, because a large area overhead cannot be avoided owing to the reference BL structure, the proposed CIM uses the 1) MCR scheme [16], which is based on the reference WL structure, and 2) OCCS-SA [13], which makes the proposed CIM unaffected by Elmore delay.

The OCCS-SA in Fig. 13 charges the OUT (OUTB) node with the sampled current $I_{M1}$ ($I_{M2}$), while the M12 (M13) discharges the OUT (OUTB) node with the current $I_{\rm data}$ ($I_{ref}$). The OCCS-SA has a strong positive feedback by creating a current path to the GND through the M12 (M13). And it is a circuit in which $I_{\rm SA}$ ($= I_{\rm data} - I_{ref}$) is doubled by M12 (M13) as a data-dependent reference generator (DDRG). The OCCS-SA has relatively fast speed and high reliability due to the use of offset-cancellation, but it is slower than the PCSA. The PCSA is fast, but the result is not reliable because offset cancellation is not applied. However, the OCCS-SA used in the proposed CIM uses more transistors than the PCSA in the previous CIM. Since the number and size of transistors are directly related to the area overhead, it is not efficient if the area required to replace the SA is large compared to the WL of the reduced reference cell. Therefore, we compare the area of each SA required for a yield of 100%. In this case, the yield is based on the data $R_{\rm AP}$ located at WL0, which is the most difficult to read, and the area is estimated as the width and length of the transistor used.

Fig. 12. $R_{\rm AP}$ read yield of OCCS-SA and PCSA according to SA area.

../../Resources/ieie/JSTS.2025.25.1.56/fig12.png

The OCCS-SA of the proposed CIM achieves a $R_{\rm AP}$ read yield of 100% in an area of 2 $\mu$m$^{2}$, as shown in Fig. 12. On the other hand, PCSA shows severe read failures with 0% yield even at 11 $\mu$m${}^{2}$, which is about 5.5 times the area of 2 $\mu$m${}^{2}$, and requires an area of 21 $\mu$m${}^{2}$, which is about 10.5 times the area of OCCS-SA, to achieve 100% yield. Because the PCSA uses Elmore delay as described above. The increase in the SA size means that $R_{\rm SA}$ becomes small and $C_{\rm SA}$ increases, and thus, the specific gravity of ${C}_{\rm SA}$ increases in Elmore delay. Thus, in the Elmore delay equation, ${C}_{\rm SA}$ ($R_{\rm SA} + R_{\rm BL} + R_{\rm MTJ} + R_{SL} + R_{\rm SLMUX}$) has a large specific gravity. The difference between d$_{\rm WL0}$ and d${}_{\rm WL127}$ is determined by R${}_{\rm MTJ}$. Therefore, PCSA should make SA large because PCSA should have the condition of $C_{\rm BL}$ ($C_{\rm SL}$) $\ll C_{\rm SA}$ for accurate sensing. Therefore, in order to satisfy 100% yield, the PCSA requires a large area overhead, which can lead to power consumption and delay problems, making it unsuitable for the proposed CIM. By using OCCS-SA instead of PCSA, the proposed CIM can increase reliability while having a small area overhead.

Fig. 13. Transistor-level schematic and transistor size of the OCCS-SA used.

../../Resources/ieie/JSTS.2025.25.1.56/fig13.png

IV. SIMULATION

1. Simulation Conditions

The proposed CIM simulations use 28 nm process technology and HSPICE. Fig. 3 shows a simplified array structure of the proposed CIM. A transistor-level schematic and transistor size of the SA are shown in Fig. 13. In other circuits, the NMOS has a width of 2 $\mu$m and length of 0.03 $\mu$m, and PMOS has a width of 4 $\mu$m and length of 0.03 $\mu$m. $R_{\rm P}$ is 3 k$\Omega$ and $R_{\rm AP}$ is 9 k$\Omega$ with a TMR of 200%. A 4% variation in the MTJ process was added for reliability verification. In this simulation, the PCSA for comparison was designed to be 8 $\mu$m${}^{2}$, similar to the OCCS-SA area used (7.84 $\mu$m${}^{2}$).

Fig. 14. Memory read yield of the proposed CIM when TMR is 200%.

../../Resources/ieie/JSTS.2025.25.1.56/fig14.png

2. Simulation Results

Fig. 14 shows the read yield of the proposed CIM based on the WL of the data when $V_{\rm DD} = 1$ V, gate voltage of the clamp NMOS ($V_{G_clamp}$) $= 0.8$ V, gate voltage of access transistor ($V_{access}$) $= 1.2$ V, and sensing time ($t_{sen}$) $= 10$ ns. Compared to the previous CIM (Fig. 10), where the yield rapidly decreases as the data approaches WL0, the proposed CIM shows 100% yield to WL0. Even when all conditions are set the same and only TMR is reduced to 100% ($R_{\rm P} = 3$ k$\Omega$, $R_{\rm AP} = 6$ k$\Omega$), the proposed CIM shows a yield of 100%, as shown in Fig. 15, while the previous CIM has a serious read failure with $R_{\rm AP}$ (state 0) yield of 0% from WL90.

Fig. 15. Memory read yield when TMR is 100%. (a) Previous CIM. (b) Proposed CIM.

../../Resources/ieie/JSTS.2025.25.1.56/fig15a.png(a)

../../Resources/ieie/JSTS.2025.25.1.56/fig15b.png(b)

The worst cases for detecting the CIM AND and OR operations are AND(1,1) and OR(0,1), respectively. This is because the difference between $R_{ref}$ and $R_{\rm data}$ for each operation is small, as depicted in Fig. 5. Fig. 16 shows the worst case yield of the previous and proposed CIMs. The proposed CIM has 100% yield in all cases, including AND(1,1) and OR(0,1), which have the smallest $R_{ref}$ and $R_{\rm data}$ differences, as mentioned earlier. However, the previous CIM shows high failure rates in AND(1,0), AND(0,0), and OR(0,0), where the output should be zero, rather than in AND(1,1) and OR(1,0), where the resistance difference is small. This is because it is measured by the Elmore delay based on the WL of the data and not the difference between $R_{ref}$ and $R_{\rm data}$, as described in Section III-C. Therefore, the previous CIM outputs 1 unconditionally when the data approaches WL0 owing to the Elmore delay; thus, it is an unreliable result, even if it provides a 100% yield for AND(1,1), OR(1,0), and OR(1,1).

Fig. 16. Worst case CIM yield. (a) Previous CIM. (b) Proposed CIM.

../../Resources/ieie/JSTS.2025.25.1.56/fig16a.png(a)

../../Resources/ieie/JSTS.2025.25.1.56/fig16b.png(b)

V. COMPARISON

Table 2 presents a brief comparison of the previous and proposed CIMs. As described earlier, both CIMs require additional signals than the typical memory to perform general memory and CIM operations, and the total number of signals required is 14. When the reference voltage or current required for the CIM operation is generated using the MCR scheme, the number of reference WLs required for the proposed CIM is one less than the previous CIM. Therefore, the proposed CIM is more area-optimized and has the advantage of increasing the memory capacity. When the memory read yields at WL0, WL30, WL60, WL90, and WL127 are simulated and averaged using a TMR of 200%, the proposed CIM shows a read yield of 100%. However, the previous CIM shows a read yield of 41.8%, with an $R_{\rm AP}$ read yield of less than 50%. In addition, the AND(0,1) operation of the previous CIM yields 0% in all WLs, indicating that the previous CIM does not work properly, whereas the proposed CIM shows a 100% yield in all CIM operations. As shown in Fig. 12, the PCSA area (21 $\mu$m${}^{2}$) required for $R_{\rm AP}$ yield of 100% is approximately 3 times the area (7.84 $\mu$m${}^{2}$) of the OCCS-SA used. Since AND/OR operations are harder to detect than READ and require a larger SA area, the area of the PCSA for the CIM yield of 100% is even larger.

Table 2. Signal of each operation.

Previous CIM

(TMAG'21 [5])

Proposed CIM

Total number of signals

14

14

Required number of reference WLs1)

3

(Ref-WL, L2, L3)

2

(Ref-WL, LC)

Type of SA

PCSA

OCCS-SA

Memory read yield2)

RP: 100%

RAP: 41.8%

RP: 100%

RAP: 100%

Worst case CIM read yield

0%3)

100%

1) Using the MCR scheme.

2) Average memory read yield when data is located at WL0, WL30, WL60, WL90, and WL126. TMR $= 200%$ (RP $= 3$ k$\Omega$, RAP $= 9$ k$\Omega$) is used.

3) AND(0,1) fails on all WLs.

VI. CONCLUSION

In this study, we proposed an area-optimized CIM platform that reduces the required number of reference WLs and increases the data capacity. Compared with the previous CIM, which performs an AND/OR operation with three reference WLs and uses PCSA, the proposed CIM uses two reference WLs and the OCCS-SA to produce a more area-optimized and reliable CIM. In addition, we briefly discuss the Elmore delay, which is the reason PCSA cannot be used in the proposed CIM. The CIM architecture in this study increases data storage for a reduced number of reference WL. Therefore, it is expected to help develop area-efficient large-scale memory in the AI market, where data storage and processing volumes are rapidly increasing.

ACKNOWLEDGMENTS

This work was supported by Incheon National University Research Grant in 2023. The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

References

1 
L. Wilson, International Technology Roadmap for Semiconductors (ITRS). Washington, DC, USA: Semiconductor Industry Association, 2013. [Online]. Available: https://www.semiconductors.orgURL
2 
S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, ``Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories,'' Proc. DAC, pp. 1-6, Jun 2016.DOI
3 
T. Na, ``Ternary output binary neural network with zero-skipping for MRAM-based digital in-memory computing,'' IEEE Trans. Circuits Syst. II, Exp., vol. 70, no. 7, pp. 2655-2659, Jul. 2023.DOI
4 
P.-H. Lee et al., ``33.1 A 16nm 32Mb embedded STT-MRAM with a 6ns read-access time, a 1M-cycle write endurance, 20-year retention at 150$^\circ$C and MTJ-OTP solutions for magnetic immunity,'' Proc. of 2023 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, pp. 494-496, 2023.DOI
5 
Z. Xiao et al., ``Device variation-aware adaptive quantization for MRAM-based accurate in-memory computing without on-chip training,'' Proc. of 2022 International Electron Devices Meeting (IEDM), pp. 10.5.1-10.5.4, 2022.DOI
6 
M. Hosomi et al., ``A novel nonvolatile memory with spin torque transfer magnetization switching: Spin-RAM,'' IEEE IEDM Tech. Dig., pp. 459-462, Dec. 2005.DOI
7 
S. Ikeda et al., ``A perpendicular-anisotropy CoFeB-MgO magnetic tunnel junction,'' Nature Mater., vol. 9, pp. 721-724, Jul. 2010.DOI
8 
R. Takemura et al., ``A 32-Mb SPRAM with 2T1R memory cell, localized bi-directional write driver and `1'/`0' dual-array equalized reference scheme,'' IEEE J. Solid-State Circuits, vol. 45, no. 4, pp. 869-879, Apr. 2010.DOI
9 
C. Wang, Z. Wang, G. Wang, Y. Zhang, and W. Zhao, ``Design of an area-efficient computing in memory platform based on STT-MRAM,'' IEEE Transactions on Magnetics, vol. 57, no. 2, pp. 1-4, Feb 2021.DOI
10 
S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, ``Computing in memory with spin-transfer torque magnetic RAM,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 3, pp. 470-483, March 2018.DOI
11 
J. Yu, G. Lee, and T. Na, ``High-performance sum operation with charge saving and sharing circuit for MRAM-based in-memory computing,'' IEIE J. Semicond. Technol. Sci. (JSTS), vol. 22, no. 2, pp. 111-121, Apr. 2024.DOI
12 
T. Na, J. Kim, J. P. Kim, S. H. Kang, and S.-O. Jung, ``Reference-scheme study and novel reference scheme for deep submicrometer STT-RAM,'' IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 61, no. 12, pp. 3376-3385, Dec. 2014.DOI
13 
T. Na, B. Song, J. P. Kim, S. H. Kang, and S.-O. Jung, ``Offset-canceling current-sampling sense amplifier for resistive nonvolatile memory in 65 nm CMOS,'' IEEE Journal of Solid-State Circuits, vol. 52, no. 2, pp. 496-504, Feb. 2017.DOI
14 
G. P. Devaraj, R. Kabilan, J. Z. Gabriel, U. Muthuraman, N. Muthukumaran, and R. Swetha, ``Design and analysis of modified pre-charge sensing circuit for STT-MRAM,'' Proc. of 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, pp. 507-511, 2021.DOI
15 
T. Na, S. H. Kang, and S.-O. Jung, ``STT-MRAM sensing: A review,'' IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 68, no. 1, pp. 12-18, Jan. 2021.DOI
16 
G. Wang et al., ``Compact modeling of perpendicular-Mmagnetic-anisotropy double-barrier magnetic tunnel junction with enhanced thermal stability recording structure,'' IEEE Transactions on Electron Devices, vol. 66, no. 5, pp. 2431-2436, May 2019.DOI
17 
T. Na, J. P. Kim, S. H. Kang, and S.-O. Jung, ``Multiple-cell reference scheme for narrow reference resistance distribution in deep submicrometer STT-RAM,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 9, pp. 2993-2997, Sep. 2016.DOI
Dasom Ahn
../../Resources/ieie/JSTS.2025.25.1.56/au1.png

Dasom Ahn received her B.S. degree in electronics engineering from Incheon National University, Incheon, Korea, in 2024. She is currently pursuing an M.S. degree in intelligent semiconductor engineering from Incheon National University, Incheon, Korea. Her current research interests include PVT variation tolerant and low-power circuit designs for memory, microcontroller unit, and neuromorphic SoC.

Seongmin Ahn
../../Resources/ieie/JSTS.2025.25.1.56/au2.png

Seongmin Ahn received his B.S. degree in electronics engineering from Incheon National University, Incheon, Korea, in 2023. He is currently pursuing an M.S. degree in electronics engineering from Incheon National University, Incheon, Korea. His current research interests include PVT variation tolerant and low-power circuit designs for memory, microcontroller unit, and neuromorphic SoC.

Taehui Na
../../Resources/ieie/JSTS.2025.25.1.56/au3.png

Taehui Na received his B.S. and Ph.D. degrees in electrical & electronic engineering from Yonsei University, Seoul, Korea, in 2012 and 2017, respectively. From 2017 to 2019, he was with Samsung Electronics Co., Ltd., Hwasung, Korea, where he worked on phase-change random access memory (PRAM) and high-performance NAND (ZNAND) core circuit designs. Since 2019, he has been a professor at Incheon National University, Incheon, Korea. His current research interests are focused on process-voltage-temperature variation tolerant and low-power circuit designs for memory, microcontroller unit, and neuromorphic SoC.