Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 24, No. 02, p.111-121

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 28 Jun 2023Revised : 14 Nov 2023Accepted : 25 Dec 2023

DOI :

https://doi.org/10.5573/JSTS.2024.24.2.111

High-performance Sum Operation with Charge Saving and Sharing Circuit for MRAM-based In-memory Computing

YuJangseok¹ LeeGeonwoo¹ NaTaehui^†

(Department of EE, Incheon National University, Incheon 22012, Korea)

^* E-mail: taehui.na@inu.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

In the era of big data, Von Neumann architectures, with their separation of processor and memory, face limitations in terms of bandwidth and data movement overhead. MRAM-based in-memory computing (IMC) is a promising approach to address these issues, leveraging MRAM to perform simple logical operations directly within memory. However, implementation of n-bit full adder (FA) using pre-charge sense amplifier requires “n + 1” stages. Although carry lookahead adders can reduce the number of stages, it causes significant area overhead, which makes them unsuitable for IMC applications. Therefore, it is important to explore alternatives that can minimize the number of stages. In this paper, we propose a high-performance multi-bit FA utilizing a charge saving and sharing (CSS) circuit that acquires a carry every 4 bits and performs a sum operation every 4 bits in parallel. The CSS circuit-based FA reduces the number of stages to “n/4 + 5”, while minimizing the associated area overhead.

Index Terms

Charge saving and sharing circuit, in-memory computing, full adder, MRAM

I. INTRODUCTION

Over the past few decades, there has been a significant increase in the volume of data being processed and stored. One of the most severe bottlenecks in conventional Von-Neumann computer architectures is the limited data bandwidth between the processor and memory ^[1-^3]. Furthermore, data transfer between the processor and memory incurs high latency and energy consumption, which leads to a significant degradation in system performance and efficiency. This situation has resulted in memory bandwidth limitations, known as the ``memory wall,'' and increased the data movement overhead and leakage current ^[4]. In-memory computing (IMC), an idea proposed several decades ago, aims to address these challenges by incorporating processing units directly into the memory itself ^[5]. The fundamental concept revolves around preprocessing data and providing only intermediate results to the processor ^[2]. Such a computer architecture not only reduces data transfer bandwidth and power overhead but also enhances performance by executing simple logical operations within the memory ^[1].

In recent years, the emergence of new non-volatile memories (NVMs), such as resistive random access memory (RRAM), phase-change random access memory (PRAM), and spin-transfer torque magnetic random access memory (STT-MRAM), has opened up new possibilities for efficient implementation of IMC ^[6]. The resistance-based storage mechanism of these NVM devices offers unique processing capabilities, enabling energy-efficient logical computing within the memory itself. In this scenario, logical operations can be performed, and the results can be stored in a non-volatile format on the memory chip ^[7]. Among these NVMs, STT-MRAM have garnered significant attention, with various prototype demonstrations and early commercial products ^[2]. Extensive research efforts have been dedicated to improving the efficiency of STT-MRAM at the device, circuit, and architectural levels ^[6, ^8-^10]. In this paper, we delve into the exploration of IMC utilizing STT-MRAM.

Numerous STT-MRAM-based IMC approaches have been proposed at the architectural level ^[2,^11]. The capability to simultaneously activate multiple word lines (WLs) within a memory array can be leveraged to execute various arithmetic, logic, and vector operations ^[12,^13]. The concurrent activation of memory cells enables the AND and OR operations in a single stage by utilizing a pre-charge sense amplifier (PCSA) ^[11]. Furthermore, a full adder (FA) can also operate by integrating a logic tree into the PCSA ^[11]. However, for multi-bit FA, an ``n + 1'' stage configuration is required to perform an n-bit operation. Although digital circuits like carry-lookahead adders (e.g., Kogge-Stone adder (KSA), Brent-Kung adder, Sklansky adder) can significantly reduce the number of stages, they entail significant area overhead and are unsuitable for memory arrays. Therefore, to minimize the number of stages while minimizing overhead within a memory array, the utilization of analog circuits is preferred instead of digital circuits.

In this study, we propose a high-performance multi-bit FA that incorporates a charge saving and sharing (CSS) circuit, which operates in the analog domain ^[14]. Similar to the carry skip adder, we pre-compute the carry for every 4 bits to enable parallel computation of the 4-bit sum operation ^[15]. To compute the carry for every 4 bits, we employ the CSS circuit, while the 4-bit sum operation is performed using the PCSA with an integrated logic tree ^[11]. As a result, the proposed method utilizing the CSS circuit successfully reduces the required number of stages from ``n + 1'' to ``n/4 + 5'' stages, while minimizing the area overhead.

The remainder of this paper is structured as follows: Section II provides the background information on STT-MRAM and PCSA; Section III describes the implementation of the state-of-the-art multi-bit FA and the proposed multi-bit FA using the CSS circuit; Section IV presents the simulation results; and finally, Section V offers the conclusion.

II. BACKGROUND

1. STT-MRAM

Fig. 1(a) illustrates a magnetic tunnel junction (MTJ), which serves as the fundamental storage element of STT-MRAM. The MTJ comprises a free layer, a tunnel barrier, and a pinned layer. Commonly employed materials for the tunnel barrier include AlOx and MgO, while the free layer is typically composed of CoFeB, Ru, CoFe, PtMn, and similar substances ^[16].

Fig. 1(b) demonstrates two states, namely parallel (P) and anti-parallel (AP), which are determined by the magnetization direction of the free layer. The MTJ can exhibit two resistance states, attributed to the tunneling magneto-resistance (TMR) effect, depending on whether it is in the P or AP state ^[17].

(1)

$ \mathrm{TMR}=\frac{\mathrm{R}_{\mathrm{H}}-\mathrm{R}_{\mathrm{L}}}{\mathrm{R}_{\mathrm{L}}}\times 100\mathrm{\% } $

In the case of the P state, it is represented by low resistance (R_L), which corresponds to the data ‘1’. On the other hand, the AP state is indicated by high resistance (R_H), representing the data ‘0’. Fig. 1(c) depicts a single bit-cell configuration, known as 1T-1MTJ, in STT-MRAM. During a write operation, the ‘1’ data can be written by allowing current to flow from the bit-line (BL) to the source line (SL), while the ‘0’ data can be written by allowing the current to flow from SL to BL.

Fig. 1. (a) MTJ; (b) Two states of MTJ; (c) 1T-1MTJ bit-cell structure of STT-MRAM.

2. PCSA

The PCSA depicted in Fig. 2 enables the execution of read, AND/OR, carry, and sum operations ^[11]. The logic tree within the PCSA is utilized specifically for FA operation. According to Table 1, during all the operations, L0 and L1 maintain a high level, except for sum (i.e., FA) operation.

Fig. 2. PCSA with the addition of a logic tree[11].

Table 1. Control signals for read, AND, OR, carry, and sum operations[11]

Operation	L0	/L1	L1	/L0	L2	L3
Read	1	0	1	0	0	0
AND	1	0	1	0	1	0
OR	1	0	1	0	0	1
Carry	1	0	1	0	/C_IN	C_IN
Sum	C_IN	/C_OUT	C_OUT	/C_IN	C_IN	/C_IN

A. Read Operation ^[13,^18]

Fig. 3(a) demonstrates the read behavior when L2 and L3 are deactivated, as indicated in Table 1. During this read operation, the selected data cell (R_L or R_H) is compared to the reference cell (R_REF), and read by the PCSA. R_REF has a resistance value between R_L and R_H, as depicted in Fig. 4(a). The outcome of the read operation, as read by the PCSA, is shown in Fig. 5(a).

Fig. 3. (a) Circuit for read operation; (b) Circuit for AND, OR operation[19,20].

Fig. 4. (a) Resistance distribution of RL, RH, and RREF[21,22]; (b) Resistance distribution when RL, RH, and RREFare connected in parallel [11].

Fig. 5. (a) Results of read operation according to MTJ state; (b) Result of AND operation according to MTJ 'A' and 'B' states; (c) Results of OR operation based on MTJ 'A' and 'B' states [23].

B. AND and OR Operations ^[1,^24]

A key approach for performing bit logic operations in STT-MRAM macro involves organizing and distinguishing resistor combinations. In Fig. 2, by enabling two WLs simultaneously, the resistive state can be extended by connecting two resistors in parallel, as demonstrated in Fig. 3(b). Fig. 4(b) illustrates the resistance distribution of R_L${\parallel}$R_L, R_H${\parallel}$R_L, and R_H${\parallel}$R_H when two MTJs are connected in parallel, along with a reference resistor that distinguishes the three resistance values. Then, these resistance combinations are connected to the PCSA, and the resulting OUT indicates an AND operation when only L2 is activated on the reference branch. Conversely, when only L3 is activated, the OUT represents an OR operation.

III. MULTI-BIT FA

1. State-of-the-art Multi-bit FA [11]

Several papers have proposed the use of PCSA for sum operations ^[11-^13]. The sum operation, as proposed by Wang et al. ^[11], can be executed by utilizing the PCSA equipped with the logic tree illustrated in Fig. 2.

A. Carry Operation

Fig. 6(a) shows the single-bit carry operation. The carry result, denoted as C_OUT, is determined by the MAJ(A, B, C_IN) function, where MAJ(A, B, 0) represents the AND operation (i.e., AND(A, B)) and MAJ(A, B, 1) represents the OR operation (i.e., OR(A, B)). In the figure, the red and blue paths correspond to the AND and OR operations, respectively.

Fig. 6. Single-bit FA using PCSA: (a) Carry operation (red path when CIN= 0 and blue path when CIN= 1); (b) Sum operation when CIN= 0 (red path when COUT= 0 and blue path when COUT= 1); (c) Sum operation when CIN= 1 (red path when COUT= 0 and blue path when COUT= 1).

B. Sum Operation

The sum result is determined by the MAJ(A, B, C_IN, /C_OUT, /C_OUT), as shown in Table 1. L0, /L1, L1, and /L0 correspond to C_IN, /C_OUT, C_OUT, and /C_IN, respectively. In Fig. 6(b), the red path represents the case where MAJ(A, B, 0, 1, 1) becomes OR(A, B) and the blue path represents the case where MAJ(A, B, 0, 0, 0) evaluates to zero. Fig. 6(c) shows the case where the red path of MAJ(A, B, 1, 1, 1) yields 1 and the blue path of MAJ(A, B, 1, 0, 0) yields AND(A, B). This sum result can be achieved using the logic tree or by reusing the AND and OR operations. Because the sum operation requires the C_OUT value, it is essential to obtain it in the previous step so that the sum result can be obtained in the next step of the calculation.

Fig. 7(a) shows the schematic of the state-of-the-art multi-bit FA ^[11]. Fig. 7(b) illustrates the SAE signal for the PCSA. In Fig. 7(c), it is evident that the sum operation for the current bit and the carry operation for the subsequent bit are executed concurrently. The final outcome of the sum operation, S_n, is obtained in stages ``n + 1''.

Fig. 7. (a) Schematic of multi-bit FA [11]; (b) SAE signal for the PCSA; (c) Result of multi-bit FA according to the number of stages.

2. Proposed Multi-bit FA using CSS Circuit

Fig. 8(a) shows the array structure of the proposed multibit FA. This structure can be used to read inputs A and B simultaneously by closing a switch, or to read inputs A and B separately by opening a switch. Fig. 8(b) shows the schematic of the CSS circuit, which is responsible for storing charge in the capacitor and sharing the charge by closing the switch.

Fig. 8. (a) Array structure for the proposed multi-bit FA; (b) Schematic of the CSS circuit.

Fig. 9. (a) 1 stage operation; (b) 1.5 stage operation; (c) 2 stage operation; (d) Result of SA as a function of stage; (e) SAE signal.

To obtain C_OUT(X+3) from A(x+3)A(x+2)A(x+1)A(x) + B(x+3)B(x+2)B(x+1)B(x) + C_IN, the values V_CAP1, V\-_CAP2, V_CAP3, V_CAP4, V_CAP5, V_CAP6, V_CAP7, V_CAP8, V_CAP9 are used as inputs to V_CIN, V_A(x), V_B(x), V_A(x+1), V_B(x+1), V_A(x+2), V_B(x+2), V_A(x+3), V_B(x+3), respectively. The size of the capacitor of the CSS circuit is determined by the weight of each digit.

(2)

$ CAP1=CAP2=CAP3 $

(3)

$ CAP4=CAP5=2CAP1 $

(4)

$ CAP6=CAP7=2^{2}CAP1 $

(5)

$ CAP8=CAP9=2^{3}CAP1 $

Based on CAP1, CAP2, and CAP3, which store the least significant bit and C\-_IN, the second bit has a size of 2x, the third bit has a size of 4x, and the fourth bit has a size of 8x. Charge-sharing occurs when all the switches are closed so that all the capacitors have the same voltage. The voltage at this point is V_CSS.

(6)

V_CSS = $~ \frac{{\sum }_{i=1}^{9}\left(\left[V_{CAPi}\right]\times \left[CAPi\right]\right)}{{\sum }_{i=1}^{9}CAPi}$

(7)

V_REF=$\mathrm{VDD}\times \frac{CAP2+CAP4+CAP6+CAP8}{{\sum }_{i=1}^{9}CAPi}$

Fig. 10. (a) FA operation in parallel by 4 bits; (b) 4-bit adder; (c) Result as per stage.

V_REF represents the reference voltage used for reading the output, OUT, of the SA. The value of C_OUT(X+3) can be read using the latch-type SA ^[25,^26], as depicted in Fig. 8(b).

Fig. 9 illustrates the process of calculating C_OUT for every 4 bits. In Fig. 9(a), which represents the stage 1, A1-A4 and B1-B4 are read using the PCSA, and the read values, along with C_IN, are stored in capacitors of the CSS circuit. Fig. 9(b) corresponds to stage 1.5. At this stage, the switch in the CSS circuit is closed to obtain V_css, which represents the shared voltage across the capacitors. Fig. 9(c) depicts the behavior during stage 2. Utilizing the V_css obtained in stage 1.5, C_OUT4 (= C4, the carry-out bit for the fourth bit) is obtained using the SA. At the same time, A5-A8 and B5-B8 are read using the PCSA and stored in the CSS circuit along with C_OUT4. Thus, by continuing this process, the final result shown in Fig. 9(d) can be obtained by iteratively calculating C_OUT for each group of 4 bits.

Once the C_OUT values for every 4 bits are obtained through the CSS circuit, the 4-bit adder depicted in Fig. 10(a) and (b) performs the sum operation in parallel, processing 4 bits at a time. The resulting sum values can be observed in Fig. 10(c). Notably, all the sum operations are accomplished within a total of only ``n/4 + 5'' stages.

IV. SIMULATION

The efficiency of the proposed MRAM-based IMC platform was evaluated by Cadence Spectre simulations with industry-compatible 28-nm model parameters.

Fig. 11 shows the read yield as a function of MTJ variation when reading STT-MRAM with PCSA. It can be seen that the read yield decreases sharply as the MTJ variation increases. The proposed CSS circuit can be utilized with SAs other than PCSA; therefore, to increase the read yield, an offset-canceling current-sampling SA ^[27], single-cap offset-cancelled SA ^[28], offset-canceling single-ended SA ^[29], or a sensing circuit (SC) can be used as a pre-amplifier for the STT-MRAM to increase the read yield. Examples of SCs include source-degeneration SC ^[30], body-voltage SC ^[31], etc.

Fig. 11. Read yield based on MTJ variation.

The capacitance mismatch can affect the accuracy of the calculation results. In Table 2, starting with a capacitance mismatch of 9%, the results are inverted. It does not affect the accuracy up to 8%, but when the capacitance mismatch is larger, it will affect the accuracy.

Fig. 12 shows the performance as a function of the number of bits in the adder. It can be seen that as n increases, the performance becomes higher compared to the state-of-the-art multi-bit FA ^[11], especially when n = 64, the number of stages can be reduced by more than 3 times. In Table 3, compared to the state-of-the-art multi-bit FA ^[11], the proposed multi-bit FA using CSS circuit increases the area by about 2 times and the energy by 1.6 times. Therefore, it has an advantage over the state-of-the-art multi-bit FA ^[11] starting from 16 bits, when the number of stages is about half.

Fig. 12. $\frac{state-of-the-art multi-bit FA[11]stagecount}{proposed multi-bit FA using CSS circuit stage count}$ depending on the number of bits.

The 16-bit values of A (A16-A1), B (B16-B1), and C_IN are set to ``1011 0111 1010 1100'', ``0100 0011 0111 1001'', and ``1'', respectively. Fig. 13 shows the results of the state-of-the-art multi-bit FA ^[11], while the results of the proposed multi-bit FA using the CSS circuit are shown in Fig. 14. Both sets of results have been calculated correctly. State-of-the-art multi-bit FA ^[11] required 17 stages to perform the operation, whereas the proposed multi-bit FA using the CSS circuit accomplished the operation in only 9 stages. In conclusion, by incorporating the CSS circuit into the existing multi-bit FA, the number of required stages can be reduced by half, from 17 to 9 stages, when 16-bit design is considered.

Fig. 13. 16-bit results from state-of-the-art multi bit FA [11]. “1011 0111 1010 1100” (A16-A1) + “0100 0011 0111 1001” (B16-B1) + “1” (CIN) = “0 1111 1011 0010 0110” (C16 S16-S1).

Fig. 14. 16-bit results of the proposed multi-bit FA using CSS circuit. “1011 0111 1010 1100” (A16-A1) + “0100 0011 0111 1001” (B16-B1) + “1” (CIN) = “0 1111 1011 0010 0110” (C16 S16-S1).

Table 3 compares the performance, energy consumption, and area utilization of the three multibit FAs on a 16-bit basis. The evaluation parameters include the number of stages, number and size of PCSAs with logic trees, number and size of additional transistors, number of memory read operations, and energy consumption. The state-of-the-art multi-bit FA ^[11] demonstrates superior area efficiency and low energy consumption; however, it suffers from a high number of stages (poor performance). Although the utilization of KSA significantly reduces the number of stages, its large area overhead prevents it from being incorporated into the memory array. Similarly, other digital circuits such as carry lookahead adders, carry select adders, and carry skip adders face similar area overhead challenges, thus preventing their inclusion in the memory array. To address this issue, it is necessary to optimize the overhead while improving the performance by leveraging the analog domain instead of the digital domain ^[34]. Compared to the state-of-the-art multi-bit FA, the proposed multi-bit FA with the CSS circuit requires approximately half the number of stages. Additionally, it employs fewer transistors compared to the multi-bit FA with KSA. However, compared to the other two multi-bit FAs, the proposed circuit entails a higher number of memory read operations. In this case, the energy consumption by CAP is 22.56 f J, which accounts for 2.3% of the total energy consumption. The reason for the increase in energy consumption is the increase in the number of read operations. In summary, the proposed multi-bit FA utilizing the analog domain offers intermediate performance between the other two FAs while effectively addressing the area overhead problem associated with the digital domain. Nevertheless, there is still a need to reduce energy consumption.

Table 2. CSS circuit operation result due to capacitance mismatch1)

Capacitance mismatch	0%	1%	2%	3%	4%	5%
Result	Pass	Pass	Pass	Pass	Pass	Pass

Capacitance mismatch	6%	7%	8%	9%	10%	11%
Result	Pass	Pass	Pass	Fail	Fail	Fail

1) For the worst case, “1111” (A4-A1) + “0000” (B4-B1) + “1” (CIN), we simulated the CAP mismatch so that the CAP size where 1 are stored decreases and the CAP size where 0 are stored increases.

Table 3. Comparison of 16-bit sum operation between state-of-the-art multi-bit FA, multi-bit FA using KSA, and proposed multi-bit FA using CSS circuit

	State-of-the-art multi-bit FA [11]	Multi-bit FA using KSA ^[32,^33]	Proposed multi-bit FA using CSS circuit
Computing domain	Digital	Digital	Analog + Digital
Number of computing stages (performance)	17	1 + t_pg + 4*t_AO+ t_xor	9
PCSA count (size¹⁾)	16 (2.92 um²)	32 (5.84 um²)	32 (5.84 um²)
Additional transistor count	0	2982	104.5
Additional size¹⁾	0 um²	7.16 um²	0.25 um²
Total size¹⁾ (area overhead)	2.92 um²	13 um²	6.09 um²
Memory read operation count	32	32	56
Energy consumption	598.7 fJ	755.2 fJ	969.3 fJ

1) The size is the size for the pre-layout and is the sum of the width*length of the transistor.

V. CONCLUSIONS

In this paper, we propose a multi-bit FA designed specifically for high-performance sum operations in STT-MRAM-based IMC systems. The proposed multi-bit FA is implemented with the CSS circuit in the analog domain with parallel C_out generation every 4 bits followed by a 4-bit sum operation in the digital domain. Our circuit architecture demonstrates a more efficient stage utilization, requiring only ``n/4 + 5'' stages per n-bit compared to the conventional ``n + 1'' stages. Moreover, it significantly reduces the area overhead when compared to digital domain-based multi-bit FAs, making it feasible for integration within a memory array. However, it is important to note that the proposed circuit, while effectively reducing the number of stages, requires twice the number of PCSA and additional circuits compared to the state-of-the-art multi-bit FA. Additionally, its energy consumption is also higher. As a result, our future work will be focused on minimizing both the area overhead and energy consumption associated with the proposed circuit.

ACKNOWLEDGMENTS

This work was supported by Incheon National University Research Grant in 2022. The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

References

C. Wang et al., "Computing-in-memory paradigm based on STT-MRAM with synergetic read/write-like modes," in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May. 2021, pp. 1-5.

S. Jain et al., "Computing in memory with spin-transfer torque magnetic RAM," IEEE Trans, Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 3, pp. 470-483, Mar. 2018.

T. Na, “Ternary output binary neural network with zero-skipping for MRAM-based digital in-memory computing,” IEEE Trans. Circuits Syst. II, Exp. Briefs (TCAS-II), 2023.

Z. He et al., "Exploring STT-MRAM based in-memory computing paradigm with application of image edge extraction," In 2017 IEEE International Conference on Computer Design (ICCD)., Nov. 2017, pp. 439-446.

H. S. Stone, "A logic-in-memory computer," IEEE Trans. Comput., Vol. C-19, no. 1, pp. 73-78, Jan. 1970.

T. Na et al., “STT-MRAM sensing: a review,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 68, no. 1, pp. 12-18, Jan. 2021.

M. Zabihi et al. "In-memory processing on the spintronic CRAM: From hardware design to application mapping," IEEE Trans. Comput., Vol. 68, no. 8, pp. 1159-1173, Aug 2019.

D. Apalkov et al. "Spin-transfer torque magnetic random access memory (STT-MRAM)," ACM Journal. Emerging Technologies in Computing Systems (JETC), Vol. 9, no. 2, pp. 1-35, May 2013.

R. Bishnoi et al. "Improving write performance for STT-MRAM," IEEE Trans. Magn., vol. 52, no. 8, pp. 1-11, Aug 2016.

L. Zhang et al. "Addressing the thermal issues of STT-MRAM from compact modeling to design techniques," IEEE Trans. Nanotechnology., Vol. 17, no. 2, pp. 345-352, Mar 2018.

C. Wang et al. "Design of an area-efficient computing in memory platform based on STT-MRAM," IEEE Trans. Magn., vol. 57, no. 2, pp. 1-4, Feb. 2021.

G. Patrigeon et al. "Design and evaluation of a 28-nm FD-SOI STT-MRAM for ultra-low power microcontrollers," IEEE Trans. Magn., vol. 7, no. 9, pp. 4982-4987, Sep. 2019.

S. Angizi et al "Design and evaluation of a spintronic in-memory processing platform for nonvolatile data encryption," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 9, pp. 1788-1801, Sep. 2018.

H. Yu et al. "An adder using charge sharing and its application in DRAMs," In Proceedings 2000 International Conference on Computer Design, Sep. 2000.

V. Vijay et al. "A Review On N-Bit Ripple-Carry Adder Carry-Select Adder And Carry-Skip Adder," Journal of VLSI circuits and systems., vol. 4, no. 01, pp. 27-32, Mar. 2022.

J.-G. Zhu et al. "Magnetic tunnel junctions," Mater. today., vol. 9, no. 11, pp. 36-45, Nov. 2006.

M. Hosomi et al. "A novel nonvolatile memory with spin torque transfer magnetization switching: Spin-RAM," in IEDM Tech. Dig., Dec. 2005, pp. 459-462.

Y. Luo et al. "A variation robust inference engine based on STT-MRAM with parallel read-out," Proc. IEEE Int. Symp. Circuits Syst. (ISCAS) Oct. 2020.

S. Ikeda et al. "Magnetic tunnel junctions for spintronic memories and beyond," IEEE Trans. Electron Devices., vol. 54, no. 5, pp. 991-1002, May. 2007.

M. Zabihi et al. "Using spin-hall mtjs to build an energy-efficient in-memory computation platform," Proc. 20th Int. Symp. Qual. Electron. Design (ISQED), Mar. 2019, pp. 52-57.

E. Deng et al. "Low power magnetic full-adder based on spin transfer torque MRAM," IEEE trans. Magn., vol. 49, no. 9, pp. 4982-4987, Sep. 2013.

S. Lim et al "Highly independent MTJ-based PUF system using diode-connected transistor and two-step postprocessing for improved response stability," IEEE Trans. Inf. Forensics Security., vol. 15, pp. 2798-2807, 2020.

W. Zhao et al "Design considerations and strategies for high-reliable STT-MRAM," Microelectron. Rel., vol. 51, no. 9, pp. 1454-1458, Sep. 2011.

G. P. Devaraj et al "Design and Analysis of Modified Pre-Charge Sensing Circuit for STT-MRAM," 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), March. 2021, pp. 507-511.

T. Na et al "Comparative study of various latch-type sense amplifiers," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 2, pp. 425-429, Feb. 2014.

B. Wicht et al. "Yield and speed optimization of a latch-type voltage sense amplifier," IEEE Journal of Solid-State Circuit. (JSSC), vol. 39, no. 7, pp. 1148-1158, July. 2004.

T. Na et al., "Offset-canceling current-sampling sense amplifier for resistive nonvolatile memory in 65 nm CMOS", IEEE J. Solid-State Circuits, vol. 52, no. 2, pp. 496-504, Feb. 2017.

Q. Dong et al., "A 1-Mb 28-nm 1T1MTJ STT-MRAM with single-cap offset-cancelled sense amplifier and in situ self-write-termination", IEEE J. Solid-State Circuits, vol. 54, no. 1, pp. 231-239, Jan. 2019.

T. Na et al., "Offset-canceling single-ended sensing scheme with one-bit-line precharge architecture for resistive nonvolatile memory in 65-nm CMOS", IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 11, pp. 2548-2555, Nov. 2019.

J. Kim et al., "A novel sensing circuit for deep submicron spin transfer torque MRAM (STT-MRAM)", IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 1, pp. 181-186, Jan. 2012.

F. Ren et al., "A body-voltage-sensing-based short pulse reading circuit for spin-torque transfer RAMs (STT-RAMs)", Proc. Int. Symp. Quality Electron Design (ISQED), pp. 275-282, 2012.

P. Chakali et al "Design of High Speed Kogge-Stone Based Carry Select Adder," International Journal of Emerging Science and Engineering. (IJESE), vol. 1, no. 4, pp. 2319-6378, Feb. 2013.

R. Anjana et al "Implementation of Vedic mutiplier using Kogge Stone adder," IEEE Int. Conf. on Embedded Sys., July. 2014, pp. 28-31.

T. Brächer and P. Pirro "An analog magnon adder for all-magnonic neurons," J. Appl. Phys., vol. 124, no. 15, Oct. 2018.

Jangseok Yu

Jangseok Yu received the B.S. degree in Electronics Engineering from Incheon National University, Incheon, Republic of Korea, in 2024.

Geonwoo Lee

Geonwoo Lee is currently pursuing the B.S. degree in Electronics Engineering from Incheon National University, Incheon, Republic of Korea.

Taehui Na

Taehui Na received the B.S. and Ph.D. degrees in Electrical & Electronic Engineering from Yonsei University, Seoul, Republic of Korea, in 2012 and 2017, respectively. From 2017 to 2019, he was with Samsung Electronics Co., Ltd., Hwasung, Republic of Korea, where he worked on phase-change random access memory (PRAM) and high-performance NAND (ZNAND) core circuit designs. Since 2019, he has been a professor at Incheon National University, Incheon, Republic of Korea. His current research interests are focused on process-voltage-temperature variation tolerant and low-power circuit designs for memory, microcontroller unit, and neuromorphic SoC.

JSTSJournal of Semiconductor Technology and Science

Journal Search

Journal XML

Journal Information

High-performance Sum Operation with Charge Saving and Sharing Circuit for MRAM-based In-memory Computing

Abstract

Index Terms

I. INTRODUCTION

II. BACKGROUND

1. STT-MRAM

(1)

Fig. 1. (a) MTJ; (b) Two states of MTJ; (c) 1T-1MTJ bit-cell structure of STT-MRAM.

2. PCSA

Fig. 2. PCSA with the addition of a logic tree[11].

Table 1. Control signals for read, AND, OR, carry, and sum operations[11]

Fig. 3. (a) Circuit for read operation; (b) Circuit for AND, OR operation[19,20].

Fig. 4. (a) Resistance distribution of RL, RH, and RREF[21,22]; (b) Resistance distribution when RL, RH, and RREFare connected in parallel [11].

Fig. 5. (a) Results of read operation according to MTJ state; (b) Result of AND operation according to MTJ 'A' and 'B' states; (c) Results of OR operation based on MTJ 'A' and 'B' states [23].

III. MULTI-BIT FA

1. State-of-the-art Multi-bit FA [11]

Fig. 6. Single-bit FA using PCSA: (a) Carry operation (red path when CIN= 0 and blue path when CIN= 1); (b) Sum operation when CIN= 0 (red path when COUT= 0 and blue path when COUT= 1); (c) Sum operation when CIN= 1 (red path when COUT= 0 and blue path when COUT= 1).

Fig. 7. (a) Schematic of multi-bit FA [11]; (b) SAE signal for the PCSA; (c) Result of multi-bit FA according to the number of stages.

2. Proposed Multi-bit FA using CSS Circuit

Fig. 8. (a) Array structure for the proposed multi-bit FA; (b) Schematic of the CSS circuit.

Fig. 9. (a) 1 stage operation; (b) 1.5 stage operation; (c) 2 stage operation; (d) Result of SA as a function of stage; (e) SAE signal.

(2)

(3)

(4)

(5)

(6)

(7)

Fig. 10. (a) FA operation in parallel by 4 bits; (b) 4-bit adder; (c) Result as per stage.

IV. SIMULATION

Fig. 11. Read yield based on MTJ variation.

Fig. 12. $\frac{state-of-the-art multi-bit FA[11]stagecount}{proposed multi-bit FA using CSS circuit stage count}$ depending on the number of bits.

Fig. 13. 16-bit results from state-of-the-art multi bit FA [11]. “1011 0111 1010 1100” (A16-A1) + “0100 0011 0111 1001” (B16-B1) + “1” (CIN) = “0 1111 1011 0010 0110” (C16 S16-S1).

Fig. 14. 16-bit results of the proposed multi-bit FA using CSS circuit. “1011 0111 1010 1100” (A16-A1) + “0100 0011 0111 1001” (B16-B1) + “1” (CIN) = “0 1111 1011 0010 0110” (C16 S16-S1).

Table 2. CSS circuit operation result due to capacitance mismatch1)

Table 3. Comparison of 16-bit sum operation between state-of-the-art multi-bit FA, multi-bit FA using KSA, and proposed multi-bit FA using CSS circuit

V. CONCLUSIONS

ACKNOWLEDGMENTS

References

Jangseok Yu

Geonwoo Lee

Taehui Na

Article Information (continued)

Index Terms

JSTS
Journal of Semiconductor Technology and Science