YuJangseok1
                     LeeGeonwoo1
                     NaTaehui†
               
                  - 
                           
                        (Department of EE, Incheon National University, Incheon 22012, Korea)
                        
 
               
             
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Index Terms
               
                Charge saving and sharing circuit,  in-memory computing,  full adder,  MRAM
             
            
          
         
            
                  I. INTRODUCTION
               Over the past few decades, there has been a significant increase in the volume of
                  data being processed and stored. One of the most severe bottlenecks in conventional
                  Von-Neumann computer architectures is the limited data bandwidth between the processor
                  and memory [1-3]. Furthermore, data transfer between the processor and memory incurs high latency
                  and energy consumption, which leads to a significant degradation in system performance
                  and efficiency. This situation has resulted in memory bandwidth limitations, known
                  as the ``memory wall,'' and increased the data movement overhead and leakage current
                  [4]. In-memory computing (IMC), an idea proposed several decades ago, aims to address
                  these challenges by incorporating processing units directly into the memory itself
                  [5]. The fundamental concept revolves around preprocessing data and providing only intermediate
                  results to the processor [2]. Such a computer architecture not only reduces data transfer bandwidth and power
                  overhead but also enhances performance by executing simple logical operations within
                  the memory [1].
               
               In recent years, the emergence of new non-volatile memories (NVMs), such as resistive
                  random access memory (RRAM), phase-change random access memory (PRAM), and spin-transfer
                  torque magnetic random access memory (STT-MRAM), has opened up new possibilities for
                  efficient implementation of IMC [6]. The resistance-based storage mechanism of these NVM devices offers unique processing
                  capabilities, enabling energy-efficient logical computing within the memory itself.
                  In this scenario, logical operations can be performed, and the results can be stored
                  in a non-volatile format on the memory chip [7]. Among these NVMs, STT-MRAM have garnered significant attention, with various prototype
                  demonstrations and early commercial products [2]. Extensive research efforts have been dedicated to improving the efficiency of STT-MRAM
                  at the device, circuit, and architectural levels [6, 8-10]. In this paper, we delve into the exploration of IMC utilizing STT-MRAM.
               
               Numerous STT-MRAM-based IMC approaches have been proposed at the architectural level
                  [2,11]. The capability to simultaneously activate multiple word lines (WLs) within a memory
                  array can be leveraged to execute various arithmetic, logic, and vector operations
                  [12,13]. The concurrent activation of memory cells enables the AND and OR operations in a
                  single stage by utilizing a pre-charge sense amplifier (PCSA) [11]. Furthermore, a full adder (FA) can also operate by integrating a logic tree into
                  the PCSA [11]. However, for multi-bit FA, an ``n + 1'' stage configuration is required to perform
                  an n-bit operation. Although digital circuits like carry-lookahead adders (e.g., Kogge-Stone
                  adder (KSA), Brent-Kung adder, Sklansky adder) can significantly reduce the number
                  of stages, they entail significant area overhead and are unsuitable for memory arrays.
                  Therefore, to minimize the number of stages while minimizing overhead within a memory
                  array, the utilization of analog circuits is preferred instead of digital circuits.
               
               In this study, we propose a high-performance multi-bit FA that incorporates a charge
                  saving and sharing (CSS) circuit, which operates in the analog domain [14]. Similar to the carry skip adder, we pre-compute the carry for every 4 bits to enable
                  parallel computation of the 4-bit sum operation [15]. To compute the carry for every 4 bits, we employ the CSS circuit, while the 4-bit
                  sum operation is performed using the PCSA with an integrated logic tree [11]. As a result, the proposed method utilizing the CSS circuit successfully reduces
                  the required number of stages from ``n + 1'' to ``n/4 + 5'' stages, while minimizing
                  the area overhead.
               
               The remainder of this paper is structured as follows: Section II provides the background
                  information on STT-MRAM and PCSA; Section III describes the implementation of the
                  state-of-the-art multi-bit FA and the proposed multi-bit FA using the CSS circuit;
                  Section IV presents the simulation results; and finally, Section V offers the conclusion.
               
             
            
                  II. BACKGROUND
               
                     1. STT-MRAM
                  Fig. 1(a) illustrates a magnetic tunnel junction (MTJ), which serves as the fundamental storage
                     element of STT-MRAM. The MTJ comprises a free layer, a tunnel barrier, and a pinned
                     layer. Commonly employed materials for the tunnel barrier include AlOx and MgO, while
                     the free layer is typically composed of CoFeB, Ru, CoFe, PtMn, and similar substances
                     [16].
                  
                  Fig. 1(b) demonstrates two states, namely parallel (P) and anti-parallel (AP), which are determined
                     by the magnetization direction of the free layer. The MTJ can exhibit two resistance
                     states, attributed to the tunneling magneto-resistance (TMR) effect, depending on
                     whether it is in the P or AP state [17].
                  
                  
                  In the case of the P state, it is represented by low resistance (RL), which corresponds to the data ‘1’. On the other hand, the AP state is indicated
                     by high resistance (RH), representing the data ‘0’. Fig. 1(c) depicts a single bit-cell configuration, known as 1T-1MTJ, in STT-MRAM. During a
                     write operation, the ‘1’ data can be written by allowing current to flow from the
                     bit-line (BL) to the source line (SL), while the ‘0’ data can be written by allowing
                     the current to flow from SL to BL.
                  
                  
                        Fig. 1. (a) MTJ; (b) Two states of MTJ; (c) 1T-1MTJ bit-cell structure of STT-MRAM.
 
                
               
                     2. PCSA
                  The PCSA depicted in Fig. 2 enables the execution of read, AND/OR, carry, and sum operations [11]. The logic tree within the PCSA is utilized specifically for FA operation. According
                     to Table 1, during all the operations, L0 and L1 maintain a high level, except for sum (i.e.,
                     FA) operation.
                  
                  
                        Fig. 2. PCSA with the addition of a logic tree[11].
 
                  
                        Table 1. Control signals for read, AND, OR, carry, and sum operations[11]
                     
                           
                              
                                 | 
                                    
                                 									
                                  Operation 
                                 								
                               | 
                              
                                    
                                 									
                                  L0 
                                 								
                               | 
                              
                                    
                                 									
                                  /L1 
                                 								
                               | 
                              
                                    
                                 									
                                  L1 
                                 								
                               | 
                              
                                    
                                 									
                                  /L0 
                                 								
                               | 
                              
                                    
                                 									
                                  L2 
                                 								
                               | 
                              
                                    
                                 									
                                  L3 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Read 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  AND 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  OR 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Carry 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  1 
                                 								
                               | 
                              
                                    
                                 									
                                  0 
                                 								
                               | 
                              
                                    
                                 									
                                  /CIN 
                                 								
                               | 
                              
                                    
                                 									
                                  CIN 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Sum 
                                 								
                               | 
                              
                                    
                                 									
                                  CIN 
                                 								
                               | 
                              
                                    
                                 									
                                  /COUT 
                                 								
                               | 
                              
                                    
                                 									
                                  COUT 
                                 								
                               | 
                              
                                    
                                 									
                                  /CIN 
                                 								
                               | 
                              
                                    
                                 									
                                  CIN 
                                 								
                               | 
                              
                                    
                                 									
                                  /CIN 
                                 								
                               | 
                           
                        
                     
                   
                  A. Read Operation [13,18]
                  Fig. 3(a) demonstrates the read behavior when L2 and L3 are deactivated, as indicated in Table 1. During this read operation, the selected data cell (RL or RH) is compared to the reference cell (RREF), and read by the PCSA. RREF has a resistance value between RL and RH, as depicted in Fig. 4(a). The outcome of the read operation, as read by the PCSA, is shown in Fig. 5(a).
                  
                  
                        Fig. 3. (a) Circuit for read operation; (b) Circuit for AND, OR operation[19,20].
 
                  
                        Fig. 4. (a) Resistance distribution of RL, RH, and RREF[21,22]; (b) Resistance distribution when RL, RH, and RREFare connected in parallel [11].
 
                  
                        Fig. 5. (a) Results of read operation according to MTJ state; (b) Result of AND operation according to MTJ 'A' and 'B' states; (c) Results of OR operation based on MTJ 'A' and 'B' states [23].
 
                  B. AND and OR Operations [1,24]
                  A key approach for performing bit logic operations in STT-MRAM macro involves organizing
                     and distinguishing resistor combinations. In Fig. 2, by enabling two WLs simultaneously, the resistive state can be extended by connecting
                     two resistors in parallel, as demonstrated in Fig. 3(b). Fig. 4(b) illustrates the resistance distribution of RL${\parallel}$RL, RH${\parallel}$RL, and RH${\parallel}$RH when two MTJs are connected in parallel, along with a reference resistor that distinguishes
                     the three resistance values. Then, these resistance combinations are connected to
                     the PCSA, and the resulting OUT indicates an AND operation when only L2 is activated
                     on the reference branch. Conversely, when only L3 is activated, the OUT represents
                     an OR operation.
                  
                
             
            
                  III. MULTI-BIT FA
               
                     1. State-of-the-art Multi-bit FA [11]
                  Several papers have proposed the use of PCSA for sum operations [11-13]. The sum operation, as proposed by Wang et al. [11], can be executed by utilizing the PCSA equipped with the logic tree illustrated in
                     Fig. 2.
                  
                  A. Carry Operation
                  Fig. 6(a) shows the single-bit carry operation. The carry result, denoted as COUT, is determined by the MAJ(A, B, CIN) function, where MAJ(A, B, 0) represents the AND operation (i.e., AND(A, B)) and
                     MAJ(A, B, 1) represents the OR operation (i.e., OR(A, B)). In the figure, the red
                     and blue paths correspond to the AND and OR operations, respectively.
                  
                  
                        Fig. 6. Single-bit FA using PCSA: (a) Carry operation (red path when CIN= 0 and blue path when CIN= 1); (b) Sum operation when CIN= 0 (red path when COUT= 0 and blue path when COUT= 1); (c) Sum operation when CIN= 1 (red path when COUT= 0 and blue path when COUT= 1).
 
                  B. Sum Operation
                  The sum result is determined by the MAJ(A, B, CIN, /COUT, /COUT), as shown in Table 1. L0, /L1, L1, and /L0 correspond to CIN, /COUT, COUT, and /CIN, respectively. In Fig. 6(b), the red path represents the case where MAJ(A, B, 0, 1, 1) becomes OR(A, B) and the
                     blue path represents the case where MAJ(A, B, 0, 0, 0) evaluates to zero. Fig. 6(c) shows the case where the red path of MAJ(A, B, 1, 1, 1) yields 1 and the blue path
                     of MAJ(A, B, 1, 0, 0) yields AND(A, B). This sum result can be achieved using the
                     logic tree or by reusing the AND and OR operations. Because the sum operation requires
                     the COUT value, it is essential to obtain it in the previous step so that the sum result can
                     be obtained in the next step of the calculation.
                  
                  Fig. 7(a) shows the schematic of the state-of-the-art multi-bit FA [11]. Fig. 7(b) illustrates the SAE signal for the PCSA. In Fig. 7(c), it is evident that the sum operation for the current bit and the carry operation
                     for the subsequent bit are executed concurrently. The final outcome of the sum operation,
                     Sn, is obtained in stages ``n + 1''.
                  
                  
                        Fig. 7. (a) Schematic of multi-bit FA [11]; (b) SAE signal for the PCSA; (c) Result of multi-bit FA according to the number of stages.
 
                
               
                     2. Proposed Multi-bit FA using CSS Circuit
                  Fig. 8(a) shows the array structure of the proposed multibit FA. This structure can be used
                     to read inputs A and B simultaneously by closing a switch, or to read inputs A and
                     B separately by opening a switch. Fig. 8(b) shows the schematic of the CSS circuit, which is responsible for storing charge in
                     the capacitor and sharing the charge by closing the switch. 
                  
                  
                        Fig. 8. (a) Array structure for the proposed multi-bit FA; (b) Schematic of the CSS circuit.
 
                  
                        Fig. 9. (a) 1 stage operation; (b) 1.5 stage operation; (c) 2 stage operation; (d) Result of SA as a function of stage; (e) SAE signal.
 
                  To obtain COUT(X+3) from A(x+3)A(x+2)A(x+1)A(x) + B(x+3)B(x+2)B(x+1)B(x) + CIN, the values VCAP1, V\-CAP2, VCAP3, VCAP4, VCAP5, VCAP6, VCAP7, VCAP8, VCAP9 are used as inputs to VCIN, VA(x), VB(x), VA(x+1), VB(x+1), VA(x+2), VB(x+2), VA(x+3), VB(x+3), respectively. The size of the capacitor of the CSS circuit is determined by the
                     weight of each digit.
                  
                  
                  
                  
                  
                  Based on CAP1, CAP2, and CAP3, which store the least significant bit and C\-IN, the second bit has a size of 2x, the third bit has a size of 4x, and the fourth
                     bit has a size of 8x. Charge-sharing occurs when all the switches are closed so that
                     all the capacitors have the same voltage. The voltage at this point is VCSS.
                  
                  
                  
                  
                        Fig. 10. (a) FA operation in parallel by 4 bits; (b) 4-bit adder; (c) Result as per stage.
 
                  VREF represents the reference voltage used for reading the output, OUT, of the SA. The
                     value of COUT(X+3) can be read using the latch-type SA [25,26], as depicted in Fig. 8(b).
                  
                  Fig. 9 illustrates the process of calculating COUT for every 4 bits. In Fig. 9(a), which represents the stage 1, A1-A4 and B1-B4 are read using the PCSA, and the read
                     values, along with CIN, are stored in capacitors of the CSS circuit. Fig. 9(b) corresponds to stage 1.5. At this stage, the switch in the CSS circuit is closed
                     to obtain Vcss, which represents the shared voltage across the capacitors. Fig. 9(c) depicts the behavior during stage 2. Utilizing the Vcss obtained in stage 1.5, COUT4 (= C4, the carry-out bit for the fourth bit) is obtained using the SA. At the same
                     time, A5-A8 and B5-B8 are read using the PCSA and stored in the CSS circuit along
                     with COUT4. Thus, by continuing this process, the final result shown in Fig. 9(d) can be obtained by iteratively calculating COUT for each group of 4 bits.
                  
                  Once the COUT values for every 4 bits are obtained through the CSS circuit, the 4-bit adder depicted
                     in Fig. 10(a) and (b) performs the sum operation in parallel, processing 4 bits at a time. The
                     resulting sum values can be observed in Fig. 10(c). Notably, all the sum operations are accomplished within a total of only ``n/4 +
                     5'' stages.
                  
                
             
            
                  IV. SIMULATION
               The efficiency of the proposed MRAM-based IMC platform was evaluated by Cadence Spectre
                  simulations with industry-compatible 28-nm model parameters.
               
               Fig. 11 shows the read yield as a function of MTJ variation when reading STT-MRAM with PCSA.
                  It can be seen that the read yield decreases sharply as the MTJ variation increases.
                  The proposed CSS circuit can be utilized with SAs other than PCSA; therefore, to increase
                  the read yield, an offset-canceling current-sampling SA [27], single-cap offset-cancelled SA [28], offset-canceling single-ended SA [29], or a sensing circuit (SC) can be used as a pre-amplifier for the STT-MRAM to increase
                  the read yield. Examples of SCs include source-degeneration SC [30], body-voltage SC [31], etc.
               
               
                     Fig. 11. Read yield based on MTJ variation.
 
               The capacitance mismatch can affect the accuracy of the calculation results. In Table 2, starting with a capacitance mismatch of 9%, the results are inverted. It does not
                  affect the accuracy up to 8%, but when the capacitance mismatch is larger, it will
                  affect the accuracy.
               
               Fig. 12 shows the performance as a function of the number of bits in the adder. It can be
                  seen that as n increases, the performance becomes higher compared to the state-of-the-art
                  multi-bit FA [11], especially when n = 64, the number of stages can be reduced by more than 3 times.
                  In Table 3, compared to the state-of-the-art multi-bit FA [11], the proposed multi-bit FA using CSS circuit increases the area by about 2 times
                  and the energy by 1.6 times. Therefore, it has an advantage over the state-of-the-art
                  multi-bit FA [11] starting from 16 bits, when the number of stages is about half.
               
               
                     Fig. 12. $\frac{state-of-the-art multi-bit FA[11]stagecount}{proposed multi-bit FA using CSS circuit stage count}$ depending on the number of bits.
 
               The 16-bit values of A (A16-A1), B (B16-B1), and CIN are set to ``1011 0111 1010 1100'', ``0100 0011 0111 1001'', and ``1'', respectively.
                  Fig. 13 shows the results of the state-of-the-art multi-bit FA [11], while the results of the proposed multi-bit FA using the CSS circuit are shown in
                  Fig. 14. Both sets of results have been calculated correctly. State-of-the-art multi-bit
                  FA [11] required 17 stages to perform the operation, whereas the proposed multi-bit FA using
                  the CSS circuit accomplished the operation in only 9 stages. In conclusion, by incorporating
                  the CSS circuit into the existing multi-bit FA, the number of required stages can
                  be reduced by half, from 17 to 9 stages, when 16-bit design is considered.
               
               
                     Fig. 13. 16-bit results from state-of-the-art multi bit FA [11]. “1011 0111 1010 1100” (A16-A1) + “0100 0011 0111 1001” (B16-B1) + “1” (CIN) = “0 1111 1011 0010 0110” (C16 S16-S1).
 
               
                     Fig. 14. 16-bit results of the proposed multi-bit FA using CSS circuit. “1011 0111 1010 1100” (A16-A1) + “0100 0011 0111 1001” (B16-B1) + “1” (CIN) = “0 1111 1011 0010 0110” (C16 S16-S1).
 
               Table 3 compares the performance, energy consumption, and area utilization of the three multibit
                  FAs on a 16-bit basis. The evaluation parameters include the number of stages, number
                  and size of PCSAs with logic trees, number and size of additional transistors, number
                  of memory read operations, and energy consumption. The state-of-the-art multi-bit
                  FA [11] demonstrates superior area efficiency and low energy consumption; however, it suffers
                  from a high number of stages (poor performance). Although the utilization of KSA significantly
                  reduces the number of stages, its large area overhead prevents it from being incorporated
                  into the memory array. Similarly, other digital circuits such as carry lookahead adders,
                  carry select adders, and carry skip adders face similar area overhead challenges,
                  thus preventing their inclusion in the memory array. To address this issue, it is
                  necessary to optimize the overhead while improving the performance by leveraging the
                  analog domain instead of the digital domain [34]. Compared to the state-of-the-art multi-bit FA, the proposed multi-bit FA with the
                  CSS circuit requires approximately half the number of stages. Additionally, it employs
                  fewer transistors compared to the multi-bit FA with KSA. However, compared to the
                  other two multi-bit FAs, the proposed circuit entails a higher number of memory read
                  operations. In this case, the energy consumption by CAP is 22.56 f J, which accounts
                  for 2.3% of the total energy consumption. The reason for the increase in energy consumption
                  is the increase in the number of read operations. In summary, the proposed multi-bit
                  FA utilizing the analog domain offers intermediate performance between the other two
                  FAs while effectively addressing the area overhead problem associated with the digital
                  domain. Nevertheless, there is still a need to reduce energy consumption.
               
               
                     Table 2. CSS circuit operation result due to capacitance mismatch1)
                  
                        
                           
                              | 
                                 
                              								
                               Capacitance mismatch 
                              							
                            | 
                           
                                 
                              								
                               0% 
                              							
                            | 
                           
                                 
                              								
                               1% 
                              							
                            | 
                           
                                 
                              								
                               2% 
                              							
                            | 
                           
                                 
                              								
                               3% 
                              							
                            | 
                           
                                 
                              								
                               4% 
                              							
                            | 
                           
                                 
                              								
                               5% 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Result 
                              							
                            | 
                           
                                 
                              								
                               Pass 
                              							
                            | 
                           
                                 
                              								
                               Pass 
                              							
                            | 
                           
                                 
                              								
                               Pass 
                              							
                            | 
                           
                                 
                              								
                               Pass 
                              							
                            | 
                           
                                 
                              								
                               Pass 
                              							
                            | 
                           
                                 
                              								
                               Pass 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               
                              							
                             | 
                        
                        
                              | 
                                 
                              								
                               Capacitance mismatch 
                              							
                            | 
                           
                                 
                              								
                               6% 
                              							
                            | 
                           
                                 
                              								
                               7% 
                              							
                            | 
                           
                                 
                              								
                               8% 
                              							
                            | 
                           
                                 
                              								
                               9% 
                              							
                            | 
                           
                                 
                              								
                               10% 
                              							
                            | 
                           
                                 
                              								
                               11% 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Result 
                              							
                            | 
                           
                                 
                              								
                               Pass 
                              							
                            | 
                           
                                 
                              								
                               Pass 
                              							
                            | 
                           
                                 
                              								
                               Pass 
                              							
                            | 
                           
                                 
                              								
                               Fail 
                              							
                            | 
                           
                                 
                              								
                               Fail 
                              							
                            | 
                           
                                 
                              								
                               Fail 
                              							
                            | 
                        
                     
                  
1) For the worst case, “1111” (A4-A1) + “0000” (B4-B1) + “1” (CIN), we simulated the
                  CAP mismatch so that the CAP size where 1 are stored decreases and the CAP size where
                  0 are stored increases.
                  			
               
 
               
                     Table 3. Comparison of 16-bit sum operation between state-of-the-art multi-bit FA, multi-bit FA using KSA, and proposed multi-bit FA using CSS circuit
                  
                        
                           
                              | 
                                 
                              								
                               
                              							
                             | 
                           
                                 
                              								
                               State-of-the-art multi-bit FA [11] 
                              							
                            | 
                           
                                 
                              								
                               Multi-bit FA using KSA [32,33] 
                              							
                            | 
                           
                                 
                              								
                               Proposed multi-bit FA using CSS circuit 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Computing domain 
                              							
                            | 
                           
                                 
                              								
                               Digital 
                              							
                            | 
                           
                                 
                              								
                               Digital 
                              							
                            | 
                           
                                 
                              								
                               Analog + Digital 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Number of computing stages 
                              								
                              (performance) 
                              							
                            | 
                           
                                 
                              								
                               17 
                              							
                            | 
                           
                                 
                              								
                               1 + tpg + 4*tAO + txor 
                              							
                            | 
                           
                                 
                              								
                               9 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               PCSA count (size1))
                               
                              							
                            | 
                           
                                 
                              								
                               16 (2.92 um2)
                               
                              							
                            | 
                           
                                 
                              								
                               32 (5.84 um2)
                               
                              							
                            | 
                           
                                 
                              								
                               32 (5.84 um2)
                               
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Additional transistor count 
                              							
                            | 
                           
                                 
                              								
                               0 
                              							
                            | 
                           
                                 
                              								
                               2982 
                              							
                            | 
                           
                                 
                              								
                               104.5 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Additional size1) 
                              							
                            | 
                           
                                 
                              								
                               0 um2 
                              							
                            | 
                           
                                 
                              								
                               7.16 um2 
                              							
                            | 
                           
                                 
                              								
                               0.25 um2 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Total size1) 
                              								
                              (area overhead) 
                              							
                            | 
                           
                                 
                              								
                               2.92 um2 
                              							
                            | 
                           
                                 
                              								
                               13 um2 
                              							
                            | 
                           
                                 
                              								
                               6.09 um2 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Memory read operation count 
                              							
                            | 
                           
                                 
                              								
                               32 
                              							
                            | 
                           
                                 
                              								
                               32 
                              							
                            | 
                           
                                 
                              								
                               56 
                              							
                            | 
                        
                        
                              | 
                                 
                              								
                               Energy consumption 
                              							
                            | 
                           
                                 
                              								
                               598.7 fJ 
                              							
                            | 
                           
                                 
                              								
                               755.2 fJ 
                              							
                            | 
                           
                                 
                              								
                               969.3 fJ 
                              							
                            | 
                        
                     
                  
1) The size is the size for the pre-layout and is the sum of the width*length of the
                  transistor.
                  			
               
 
             
            
                  V. CONCLUSIONS
               In this paper, we propose a multi-bit FA designed specifically for high-performance
                  sum operations in STT-MRAM-based IMC systems. The proposed multi-bit FA is implemented
                  with the CSS circuit in the analog domain with parallel Cout generation every 4 bits followed by a 4-bit sum operation in the digital domain.
                  Our circuit architecture demonstrates a more efficient stage utilization, requiring
                  only ``n/4 + 5'' stages per n-bit compared to the conventional ``n + 1'' stages. Moreover,
                  it significantly reduces the area overhead when compared to digital domain-based multi-bit
                  FAs, making it feasible for integration within a memory array. However, it is important
                  to note that the proposed circuit, while effectively reducing the number of stages,
                  requires twice the number of PCSA and additional circuits compared to the state-of-the-art
                  multi-bit FA. Additionally, its energy consumption is also higher. As a result, our
                  future work will be focused on minimizing both the area overhead and energy consumption
                  associated with the proposed circuit.
               
             
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This work was supported by Incheon National University Research Grant in 2022.
                  The EDA tool was supported by the IC Design Education Center (IDEC), Korea.
                  			
               
             
            
                  
                     References
                  
                     
                        
                        C. Wang et al., "Computing-in-memory paradigm based on STT-MRAM with synergetic read/write-like
                           modes," in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May. 2021, pp. 1-5.

 
                      
                     
                        
                        S. Jain et al., "Computing in memory with spin-transfer torque magnetic RAM," IEEE
                           Trans, Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 3, pp. 470-483, Mar. 2018.

 
                      
                     
                        
                        T. Na, “Ternary output binary neural network with zero-skipping for MRAM-based digital
                           in-memory computing,” IEEE Trans. Circuits Syst. II, Exp. Briefs (TCAS-II), 2023.

 
                      
                     
                        
                        Z. He et al., "Exploring STT-MRAM based in-memory computing paradigm with application
                           of image edge extraction," In 2017 IEEE International Conference on Computer Design
                           (ICCD)., Nov. 2017, pp. 439-446.

 
                      
                     
                        
                        H. S. Stone, "A logic-in-memory computer," IEEE Trans. Comput., Vol. C-19, no. 1,
                           pp. 73-78, Jan. 1970.

 
                      
                     
                        
                        T. Na et al., “STT-MRAM sensing: a review,” IEEE Trans. Circuits Syst. II, Exp. Briefs,
                           vol. 68, no. 1, pp. 12-18, Jan. 2021.

 
                      
                     
                        
                        M. Zabihi et al. "In-memory processing on the spintronic CRAM: From hardware design
                           to application mapping," IEEE Trans. Comput., Vol. 68, no. 8, pp. 1159-1173, Aug 2019.

 
                      
                     
                        
                        D. Apalkov et al. "Spin-transfer torque magnetic random access memory (STT-MRAM),"
                           ACM Journal. Emerging Technologies in Computing Systems (JETC), Vol. 9, no. 2, pp.
                           1-35, May 2013.

 
                      
                     
                        
                        R. Bishnoi et al. "Improving write performance for STT-MRAM," IEEE Trans. Magn., vol.
                           52, no. 8, pp. 1-11, Aug 2016.

 
                      
                     
                        
                        L. Zhang et al. "Addressing the thermal issues of STT-MRAM from compact modeling to
                           design techniques," IEEE Trans. Nanotechnology., Vol. 17, no. 2, pp. 345-352, Mar
                           2018.

 
                      
                     
                        
                        C. Wang et al. "Design of an area-efficient computing in memory platform based on
                           STT-MRAM," IEEE Trans. Magn., vol. 57, no. 2, pp. 1-4, Feb. 2021.

 
                      
                     
                        
                        G. Patrigeon et al. "Design and evaluation of a 28-nm FD-SOI STT-MRAM for ultra-low
                           power microcontrollers," IEEE Trans. Magn., vol. 7, no. 9, pp. 4982-4987, Sep. 2019.

 
                      
                     
                        
                        S. Angizi et al "Design and evaluation of a spintronic in-memory processing platform
                           for nonvolatile data encryption," IEEE Trans. Comput.-Aided Design Integr. Circuits
                           Syst., vol. 37, no. 9, pp. 1788-1801, Sep. 2018.

 
                      
                     
                        
                        H. Yu et al. "An adder using charge sharing and its application in DRAMs," In Proceedings
                           2000 International Conference on Computer Design, Sep. 2000.

 
                      
                     
                        
                        V. Vijay et al. "A Review On N-Bit Ripple-Carry Adder Carry-Select Adder And Carry-Skip
                           Adder," Journal of VLSI circuits and systems., vol. 4, no. 01, pp. 27-32, Mar. 2022.

 
                      
                     
                        
                        J.-G. Zhu et al. "Magnetic tunnel junctions," Mater. today., vol. 9, no. 11, pp. 36-45,
                           Nov. 2006.

 
                      
                     
                        
                        M. Hosomi et al. "A novel nonvolatile memory with spin torque transfer magnetization
                           switching: Spin-RAM," in IEDM Tech. Dig., Dec. 2005, pp. 459-462.

 
                      
                     
                        
                        Y. Luo et al. "A variation robust inference engine based on STT-MRAM with parallel
                           read-out," Proc. IEEE Int. Symp. Circuits Syst. (ISCAS) Oct. 2020.

 
                      
                     
                        
                        S. Ikeda et al. "Magnetic tunnel junctions for spintronic memories and beyond," IEEE
                           Trans. Electron Devices., vol. 54, no. 5, pp. 991-1002, May. 2007.

 
                      
                     
                        
                        M. Zabihi et al. "Using spin-hall mtjs to build an energy-efficient in-memory computation
                           platform," Proc. 20th Int. Symp. Qual. Electron. Design (ISQED), Mar. 2019, pp. 52-57.

 
                      
                     
                        
                        E. Deng et al. "Low power magnetic full-adder based on spin transfer torque MRAM,"
                           IEEE trans. Magn., vol. 49, no. 9, pp. 4982-4987, Sep. 2013.

 
                      
                     
                        
                        S. Lim et al "Highly independent MTJ-based PUF system using diode-connected transistor
                           and two-step postprocessing for improved response stability," IEEE Trans. Inf. Forensics
                           Security., vol. 15, pp. 2798-2807, 2020.

 
                      
                     
                        
                        W. Zhao et al "Design considerations and strategies for high-reliable STT-MRAM," Microelectron.
                           Rel., vol. 51, no. 9, pp. 1454-1458, Sep. 2011.

 
                      
                     
                        
                        G. P. Devaraj et al "Design and Analysis of Modified Pre-Charge Sensing Circuit for
                           STT-MRAM," 2021 Third International Conference on Intelligent Communication Technologies
                           and Virtual Mobile Networks (ICICV), March. 2021, pp. 507-511.

 
                      
                     
                        
                        T. Na et al "Comparative study of various latch-type sense amplifiers," IEEE Trans.
                           Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 2, pp. 425-429, Feb. 2014.

 
                      
                     
                        
                        B. Wicht et al. "Yield and speed optimization of a latch-type voltage sense amplifier,"
                           IEEE Journal of Solid-State Circuit. (JSSC), vol. 39, no. 7, pp. 1148-1158, July.
                           2004.

 
                      
                     
                        
                        T. Na et al., "Offset-canceling current-sampling sense amplifier for resistive nonvolatile
                           memory in 65 nm CMOS", IEEE J. Solid-State Circuits, vol. 52, no. 2, pp. 496-504,
                           Feb. 2017.

 
                      
                     
                        
                        Q. Dong et al., "A 1-Mb 28-nm 1T1MTJ STT-MRAM with single-cap offset-cancelled sense
                           amplifier and in situ self-write-termination", IEEE J. Solid-State Circuits, vol.
                           54, no. 1, pp. 231-239, Jan. 2019.

 
                      
                     
                        
                        T. Na et al., "Offset-canceling single-ended sensing scheme with one-bit-line precharge
                           architecture for resistive nonvolatile memory in 65-nm CMOS", IEEE Trans. Very Large
                           Scale Integr. (VLSI) Syst., vol. 27, no. 11, pp. 2548-2555, Nov. 2019.

 
                      
                     
                        
                        J. Kim et al., "A novel sensing circuit for deep submicron spin transfer torque MRAM
                           (STT-MRAM)", IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 1, pp.
                           181-186, Jan. 2012.

 
                      
                     
                        
                        F. Ren et al., "A body-voltage-sensing-based short pulse reading circuit for spin-torque
                           transfer RAMs (STT-RAMs)", Proc. Int. Symp. Quality Electron Design (ISQED), pp. 275-282,
                           2012.

 
                      
                     
                        
                        P. Chakali et al "Design of High Speed Kogge-Stone Based Carry Select Adder," International
                           Journal of Emerging Science and Engineering. (IJESE), vol. 1, no. 4, pp. 2319-6378,
                           Feb. 2013.

 
                      
                     
                        
                        R. Anjana et al "Implementation of Vedic mutiplier using Kogge Stone adder," IEEE
                           Int. Conf. on Embedded Sys., July. 2014, pp. 28-31.

 
                      
                     
                        
                        T. Brächer and P. Pirro "An analog magnon adder for all-magnonic neurons," J. Appl.
                           Phys., vol. 124, no. 15, Oct. 2018.

 
                      
                   
                
             
            
            
               			Jangseok Yu  received the B.S. degree in Electronics Engineering from Incheon National
               University, Incheon, Republic of Korea, in 2024.
               		
            
            
            
               			Geonwoo Lee  is currently pursuing the B.S. degree in Electronics Engineering from
               Incheon National University, Incheon, Republic of Korea.
               		
            
            
            
               			Taehui Na  received the B.S. and Ph.D. degrees in Electrical & Electronic Engineering
               from Yonsei University, Seoul, Republic of Korea, in 2012 and 2017, respectively.
               From 2017 to 2019, he was with Samsung Electronics Co., Ltd., Hwasung, Republic of
               Korea, where he worked on phase-change random access memory (PRAM) and high-performance
               NAND (ZNAND) core circuit designs. Since 2019, he has been a professor at Incheon
               National University, Incheon, Republic of Korea. His current research interests are
               focused on process-voltage-temperature variation tolerant and low-power circuit designs
               for memory, microcontroller unit, and neuromorphic SoC.