I. INTRODUCTION
               With the emergence of the backpropagation algorithm [8] and multilayer perceptron [3], deep neural networks (DNNs) have demonstrated outstanding performance in various
                  fields [1,2,110]. However, they face the challenge of exponentially increasing computational load
                  as the number of learnable parameters grows. This poses a significant obstacle to
                  the practical implementation of DNN models in terms of processing speed and power
                  consumption [113]. To tackle these issues, parallel processing devices such as graphics processing
                  units (GPUs) and neural processing units (NPUs) [4] are being utilized, and researchers are actively exploring optimized acceleration
                  algorithms for each device [5]. However, modern computer architectures based on the von Neumann architecture still
                  have limitations regarding DNN processing. Specifically, a substantial portion of
                  the power consumption, up to 75%, is attributed to loading parameters for DNN operations
                  (e.g., feature maps and weights) from external memory, such as dynamic random-access
                  memory (DRAM), to the processor, or storing them back to memory [9,10,115]. To address this issue, processor-in-memory (PIM) architecture has emerged as a promising
                  technology [6]. By integrating computing and memory units at the processing element (PE) level,
                  PIM significantly reduces latency associated with data transmission and enhances data
                  processing efficiency [112]. This integrated architecture has the potential to significantly reduce energy consumption
                  during memory access, thereby enhancing the efficiency of applications that require
                  high-performance computing [7].
               
               This survey explores diverse PIM architectures and methodologies for enhancing PIM
                  performance in different memory types. It analyzes the characteristics of various
                  DNN models, including convolutional neural networks (CNNs), graph neural networks
                  (GNNs), recurrent neural networks (RNNs), and transformer models. The focus is on
                  optimizing data mapping and dataflows within the context of PIM, providing valuable
                  insights into efficient handling of DNNs. This comprehensive study aims to deepen
                  researchers' understanding of the connection between DNNs and PIM, opening up new
                  avenues for future AI research and advancements.
               
               Section II shows the background of this work. Section III presents the PIM architectures
                  for DNNs, and Section IV concludes this paper.
               
             
            
                  III. PIM FOR DEEP NEURAL NETWORKS
               
                     1. Technologies and Representative Architectures Needed for PIM
                  PIM fundamentally offers high throughput because it minimizes data transfer with the
                     host processor by integrating data processing logic directly into memory, thus resolving
                     the associated bottleneck [28,29]. In the DNN inference process, the most frequently performed MAC operations are executed
                     in the PIM core to achieve high energy efficiency. In addition, during the DNN training
                     process, PIM can reduce both processing time and power consumption by performing the
                     computations necessary for weight updates directly within the memory [88,94]. However, not all functions benefit from the application of PIM. For instance, it
                     can be burdensome to process functions with high computational complexity and memory
                     reusability using in-memory logic. Therefore, to determine where a specific function
                     should be computed, it is necessary to establish appropriate metrics and analyze them
                     using a benchmark simulator. DAMOV [30] is a memory simulator comprising a frequently used ramulator [31] and a zsim CPU simulator [32]. It extracts memory traces for each workload [117] using an Intel VTune profiler [33]. The extracted traces calculate the temporal/spatial locality and divide the causes
                     of memory bottlenecks into six classes using three indicators: the last-to-first miss-ratio
                     (LFMR), last-level-cache misses per kilo-instruction (LLC MPKI), and arithmetic intensity.
                     Moreover, by conducting an experimental analysis with 77 K functions, we demonstrated
                     its reliability and applicability across various research areas.
                  
                  Current PIM research is largely categorized into commercially accessible DRAM-based
                     PIM research [52-59, 81-85] and research utilizing next-generation memory [90-99], both of which are presented in a competitive manner. Unlike academic research, mass-producible
                     PIM products fundamentally utilize bank-level parallelism of DRAM for computation
                     processing. In addition, they also consider maximizing compatibility with existing
                     mass-produced products and prioritizing cost aspects, such as minimizing the area
                     occupied by logic operations and addressing heat-dissipation issues. The HBM-PIM [58] is an addition of PIM functionality to the HBM architecture, designed to increase
                     memory bandwidth and energy efficiency by performing computational processing within
                     the memory. It proposes not only a hardware architecture but also a software stack.
                     The software stack supports FP16 operations, MAC, general matrix-matrix product (GEMM),
                     and activation functions, along with the operation logic loaded onto the HBM by applying
                     the LUT. In addition, it allows programmers to write PIM microkernels using PIM commands
                     to maximize performance. The hardware architecture was implemented based on 20 nm
                     DRAM technology and integrated with an unmodified commercial processor to prove its
                     practicality and effectiveness at the system level. Furthermore, it is designed to
                     be replaceable because it is compatible with existing HBM. By implementing the proposed
                     PIM architecture, there was a significant improvement in the performance of memory-bound
                     neural network kernels and applications. Specifically, the performance of neural network
                     kernels increased by 11.2${\times}$, while applications showed a 3.5${\times}$ improvement.
                     Additionally, the energy consumption per bit transfer was reduced by 3.5${\times}$,
                     resulting in an overall enhancement of the system's energy efficiency by 3.2${\times}$
                     when running applications. 
                  
                  Newton’s architecture [59] was designed as an accelerator in memory (AiM) for DNNs. In this design, a minimum
                     number of computing units were placed in the DRAM to satisfy the area constraints,
                     which can be a problem in the hardware design for PIM. The computing units included
                     MAC operations and buffers. It also uses an interface similar to that of DRAM so that
                     the host can issue commands for PIM computing. The PIM matches the internal DRAM bandwidth
                     and speed, captures input reuse, and uses a global input vector buffer to divide the
                     buffer area costs across all channels. The three optimization techniques proposed
                     by Newton helped the PIM-host interface overcome bottlenecks: 1) Grouping multiple
                     computational tasks in the in-bank and bank groups. 2) Support complex, multistep
                     computing commands to process multiple stages of operations simultaneously. 3) The
                     strength of the internal low-dropout (LDO) regulator and DC-DC pump driver should
                     be increased to allow for higher current and faster voltage recovery. As a result,
                     Newton applied to HBM2E achieves an average speed improvement of 10${\times}$ over
                     a system assumed to ideally use the external DRAM bandwidth without applying PIM and
                     54${\times}$ over a GPU.
                  
                  The UPMEM PIM architecture [52] was the first commercialized PIM architecture that combined traditional DRAM memory
                     arrays and a common instruction sequence core: the DRAM processing unit (DPU). DPUs
                     are a concept proposed for UPMEM and are used to perform operations within memory
                     chips. The DPU has exclusive access to a 64MB DRAM bank, known as the main random-access
                     memory (MRAM), 24 KB of command memory, and 64 KB of scratchpad memory, called the
                     working random access memory (WRAM). This allows programmers to write code to be executed
                     on the DPU and process the data within the memory. This implies that the data transfer
                     between the host processor and DPU can be controlled, allowing for the selection of
                     parallel and sequential processing. 
                  
                  On the other hand, the most commonly used next-generation memory in PIM architecture
                     is ReRAM [36, 39, 91, 95, 100, 103, 107]. The ReRAM crossbar array consists of cells arranged in rows and columns. This array
                     can be used for memory purposes and can efficiently perform computations such as the
                     general matrix-vector product (GEMV), composed of MAC operations. In addition, the
                     use of a crossbar array can significantly reduce the overhead and energy related to
                     memory movement. In particular, as a pioneering study on ReRAM-based PIM, PRIME [91] distinguished the internal array space of a bank as memory a subarray (MemS), full
                     function subarray (FFS), and buffer subarray. MemS is a circuit that stores only data.
                     FFS allows the crossbar to be used for both memory and operation logic, achieving
                     minimum area overhead. To enable this, multiple voltage sources are added to provide
                     an accurate input voltage. The column multiplexer provides an analog subtraction unit
                     and a nonlinear threshold value unit, and the SA is modified to achieve high precision.
                  
                
               
                     2. PIM for CNN
                  Numerous PIM studies primarily support the MAC operation required by CNN [46, 47, 52-59]. However, this study focuses on PIM research that employs data mapping methods and
                     dataflow necessary for CNN operations. Efficient data handling in the CNN inference
                     process is crucial, with particular emphasis on maximizing the reuse of weights as
                     well as the input and output feature maps used between layers.
                  
                  
                        1) Inference Phase
                     Peng et al. [45] proposed an ReRAM-based PIM accelerator that adapted the data-mapping technique proposed
                        by Fey et al. [44] for the CONV layer. This reduces the use of interconnects and buffers by reusing
                        the input data and weights. As shown in Fig. 4(a), a 3D kernel of size K${\times}$K${\times}$D is arranged in vertical columns, and
                        the input feature map (IFM) is arranged in a similar manner in K${\times}$K submatrices
                        within 1${\times}$1${\times}$D kernels. As shown in Fig. 4(b), computation of the subarrays proceeds as a single PE within the ReRAM subarray.
                        This method is designed to maximize the reuse of IFMs and weights as the kernel (i.e.,
                        weights) slides over them during computation. Consequently, this study achieved a
                        2.1${\times}$ increase in speed and 17% improvement in energy efficiency (measured
                        in TOPS/W) during the inference phase with the VGG-16 model compared with [92].
                     
                     
                           Fig. 4. Processing-in-Memory for CNN proposed in [45]: (a) A basic mapping method of input and weight data, with kernel moving in multiple cycles; (b) An example of IFMs transferred among PEs and how the kernel slides over the input.
 
                   
                  
                        2) End-to-End Training Phase
                     Backpropagation in CNNs requires a significant amount of computation because it involves
                        computing the gradients for each layer and updating the weights to train the model.
                        It is considered memory-bound because it includes storing and tracking the intermediate
                        features and gradients of all the layers, which is more intensive than inferring the
                        CONV layer. Therefore, higher efficiency can be expected by optimizing the training
                        process in the PIM.
                     
                     T-PIM [88] is a DRAM-based PIM study considering the end-to-end training of CNN models. Fig. 5 represents the data mapping of T-PIM that reduces the overhead caused by data rearrangement
                        in DRAM and optimizes the data access to weights. Fig. 5(a) and (b) show the data mapping methods during the forward pass (FWP) and backward
                        pass (BWP) within the MLP layer, respectively. To maximize the utilization of DRAM's
                        cell array without rearranging data, the size of the tile is set to $M_{t}\times N_{t}$
                        and each weight is mapped to DRAM's column addresses. During the FWP process, the
                        input vector is flattened to size $M_{t}$ (Input$_{\mathrm{L}}$ $\left(M_{t}\right))$
                        and multiplied with the weights arranged in DRAM. Each column is then accumulated
                        into an output buffer of size $N_{t}$ (Output$_{\mathrm{L}}$ $\left(N_{t}\right)$).
                        For the BWP process, to use the weights aligned in the FWP process without additional
                        rearrangement, the loss (Error$_{\mathrm{L}}$ ($N_{t}$)) is flattened into $N_{t}$
                        elements and performs vector operations with the weights. Each row is then accumulated
                        into an output buffer of size $M_{t}$ (Output$_{\mathrm{L}}$ $\left(M_{t}\right))$.
                        Fig. 5(c) and (d) represent the data mapping methods used during the FWP and BWP in the CONV
                        layer, respectively. Similar to the MLP layer, weights (Weight$_{\mathrm{L}}$) are
                        arranged to column addresses by kernel size ($=W\mathrm{k}\times \mathrm{Hk}$), so
                        the weights can be reused without the need for data rearrangement. T-PIM shows high
                        efficiency of 0.84-7.59 TOPS/W for 8-bit input data and 0.25-2.21 TOPS/W for 16-bit
                        input data in VGG16 model training, using the non-zero computing, powering off computing
                        method.
                     
                     
                           Fig. 5. Data mapping of T-PIM: (a) FWP layer; (b) BWP layer; (c) FWP, CONV layer; (d) BWP, CONV layer (Reprinted from [88] with permission).
 
                   
                
               
                     3. PIM for GCN
                  The processing steps of a GCN (e.g., aggregation, combination, embedding, message
                     passing, and readout) are mostly low in operational complexity, data-dependent, and
                     performed repetitively. Among these, aggregation must process large amounts of data
                     to combine the information of each node with that of its neighboring nodes. Moreover,
                     these operations have the characteristic that they must be performed as different
                     operation combinations depending on the relationship between each node and its neighbors.
                     These characteristics require a large amount of computation and high memory bandwidth.
                     Therefore, these drawbacks can be effectively mitigated using PIM. The PIM for GCN
                     has also been approached by actively utilizing an ReRAM crossbar to perform operation
                     processing as an analog computing method [36].
                  
                  Two representative techniques are the MAC crossbar and content addressable memory
                     (CAM) crossbar [37]. Among these two, the CAM crossbar performs content-based searches. This allows a
                     parallel associative search by broadcasting the search key across multiple rows. This
                     enabled the storage of more data on a chip in the same area. It was shown in TCAM
                     [38] that 2-transistor-2-resistor ReRAM can achieve 3${\times}$ higher density than the
                     existing 8-transistor SRAM. The MAC crossbar can effectively perform the VMM with
                     low energy consumption through bit-line current accumulation. This process can be
                     described in three steps. 1) The elements of the matrix were converted to voltage
                     and assigned to the crossbar, and the resistor of the cell was precisely adjusted
                     to correspond to the elements. 2) The vector is converted to a voltage, which is accumulated
                     on the word line. 3) The current of the bit line was measured, and the sum of the
                     currents of all cells connected to the bit line was obtained as the product of the
                     column and vector.
                  
                  Fig. 6 shows the overall architecture of PIM-GCN [39], which consists of a central controller, a search engine, and two computing engines.
                     Each of these comprises a CAM crossbar and a MAC crossbar, and the two computing engines
                     operate in a typical ping-pong architecture, alternately performing aggregation and
                     combination. The central controller initially loaded the graph data and finally exported
                     the GCN results back to the external DRAM. It also generates the necessary control
                     logic for the CAM crossbar, the MAC crossbar, and the special function unit (SFU).
                     The SFU, composed of a shift-and-add (S&A) unit and scalar arithmetic and logic (sALU)
                     units, processes the partial results derived from the MAC crossbar. PIM-GCN introduces
                     not only a hardware architecture that can maximize inter-vertex parallelism, but also
                     a technique for optimizing node grouping without violating independence, providing
                     scheduling for these groups to operate independently at each layer. It also proposes
                     a timing strategy to reduce idle time owing to differences in read/write latency.
                  
                  GCIM [40] is an accelerator research that presents a software-hardware co-design approach,
                     becoming the first to enable efficient data processing of GCNs in 3D stack memory.
                     From a hardware design perspective, the GCIM proposes a logic-in-memory (LIM) die
                     that integrates light computing units near the DRAM bank, fully utilizing the bandwidth
                     and parallelism at the bank level. The GCIM offloads memory-bound aggregation operations
                     onto the LIM die. Each LIM bank group is equipped with an LLU consisting of a MAC
                     array, vertex feature buffer (VFB), look-ahead FIFO, CAM, and a controller to accelerate
                     the aggregation phase. A MAC array was used to execute the aggregation operations.
                     VFB is used to buffer the output features during the aggregation phase. Look-ahead
                     FIFO is a special edge buffer implemented as a scratch-pad memory that processes the
                     frontmost edge upon receiving a signal from the controller. The CAM provides key-value
                     storage that records the ID of nonlocal vertices and the local addresses where their
                     replicas are buffered. The controller is a data-based control unit that processes
                     the aggregation operations of local vertices. On the software side, GCIM proposes
                     a data-mapping algorithm that considers locality. It balances the workload by splitting
                     the input graph into subgraphs considering the connection strength of the nodes. Here,
                     if the weight between two vertices is large or if multiple paths exist, the strength
                     is determined to be strong. The divided subgraphs are assigned to the vault and mapped
                     to the LIM bank group. This was optimized to utilize a high bandwidth and reduce unnecessary
                     data movement. This significantly improves the computational efficiency while preventing
                     redundant calculations. In addition, the GCIM adopts a sequential mapping strategy
                     to maximize data locality and minimize the processing delay of the aggregation. This
                     optimization technique uses dynamic programming [41], a mechanism that saves the optimal solution of a subproblem and reuses it to determine
                     the optimal solution of the entire problem. Based on experimental results, GCIM demonstrated
                     a remarkable improvement in inference speed compared to other models. Specifically,
                     it achieved a speed enhancement of 580.02${\times}$ compared to HyGCN [42], 275.37${\times}$ compared to CIM-HyGCN, and 272.01${\times}$ compared to PyG-CPU
                     [43]. These results highlight the significant performance boost offered by GCIM in terms
                     of inference speed.
                  
                  Although the two studies mentioned earlier were based on different memory-based PIM
                     hardware architectures, they both proposed algorithms for grouping and mapping graph
                     data nodes in a memory-friendly manner, and effectively handled GCN aggregation and
                     combination operations.
                  
                  
                        Fig. 6. PIMGCN architecture overview (Reprinted from [39] with permission).
 
                
               
                     4. PIM for RNN
                  RNN and LSTM structures can be effectively applied with PIM owing to their similarity
                     to CONV layers and their ability to reuse feature maps and weights. ERA-LSTM [103] is a PuM architecture that uses ReRAM crossbars. It optimized the RNN's weight precision
                     and digital-to-analog converter (DAC) in Long et al. [100] PIM architecture and applied systolic dataflow to improve computing efficiency and
                     performance. Fig. 7(a) shows the overall structure of ERA-LSTM. The VMM unit in Fig. 7(b) stores the weights of the four LSTM gates and uses a digital-to-analog converter
                     to deliver the input data and hidden states from the I/O buffer to the analog ReRAM
                     crossbar. The computational results of the VMM unit are transmitted to an element-wise
                     (EW) unit. The EW unit enables EW operation of the LSTM cell in the three feedforward
                     layers. In addition, the VMM and EW units efficiently handle each of the four gate
                     weights $(e.g.,\,\,W_{f},\,\,W_{i},~ \,\,W_{g},\,\,W_{o})$ by splitting each weight
                     into four weights $(e.g.,\,\,W_{00}-W_{11})$ and tiling each weight into a tile for
                     computation. In addition, the NN operation used an approximator to minimize the overhead
                     caused by analog-to-digital converters, achieving a 6.1${\times}$ operational efficiency
                     compared with Long et al. [100].
                  
                  PSB-RNN [104] is another PuM architecture that uses a ReRAM crossbar. PSB-RNN transforms the MAC
                     operation required for the RNN model into a single weights matrix using Fast Fourier
                     Transform (FFT). The real ($Re$) and imaginary ($Im$) components of the resulting
                     matrix are mapped onto the ReRAM crossbar, thereby enabling the retrieval of complex
                     number operation results from each PE result. This method yielded a computational
                     efficiency that was 17${\times}$ higher than that of Long et al. [100] for the LSTM model. Although this study requires additional operations and tasks
                     beyond data mapping for the traditional LSTM model, it proposes an effective method
                     for ReRAM crossbar PIM by mapping data for a complex number of operations necessary
                     for MAC and utilizing the data flow.
                  
                  
                        Fig. 7. ERA-LSTM: (a) architecture overview; (b) Mapping a LSTM cell to multiple tiles.
 
                
               
                     5. PIM for Transformer
                  TransPIM [106] is an HBM-based PnM designed for efficient transformer utilization. An arithmetic
                     control unit (ACU) was allocated to each bank for computation, and a token-based data
                     shading scheme was proposed to allow parallel processing by dividing and assigning
                     the data required for the calculation to the HBM's bank stack. The study also optimizes
                     data using a token-based transformer operation method, which enables independent operations
                     between tokens, in contrast to the existing layer-operated transformer structure.
                     Fig. 8(a) illustrates the encoder process of TransPIM. The input token size is L${\times}$D,
                     where L signifies the number of tokens and D indicates the size of the embedding vector's
                     dimension. Input tokens $I_{1},\,\,I_{2}$ and $I_{3}$ are allocated to each bank using
                     a technique that distributes each input token to N banks. Based on this, the embedding
                     values $Q_{i},\,\,K_{i},\,\,V_{i}$ corresponding to each input token are calculated
                     and assigned to the same bank, followed by a 
                  
                  self-attention operation. For the MHA, $~ K_{i}$ and $~ V_{i}$ are sequentially transferred
                     to bank i +1 and sent to another bank for calculation using the ring-broadcast technique,
                     thus enabling computation with minimal data transmission between banks. Fig. 8(b) shows a decoder block, where K and V are received from the encoder vector for reuse,
                     and only the last bank obtains new Q, K, and V vectors for the fully connected layer
                     computation. The new $~ Q_{new}$ is broadcast to all other banks to calculate the
                     attention score, and $~ K_{new}$ and $~ V_{new}$ are concatenated with the previous
                     $~ K_{i}$ and $~ V_{i}$ of the last bank. Each bank stores the weights for Q, K, and
                     V during this time, and the ring broadcast technique is employed to reuse the stored
                     weights and Q, K, and V values in the other banks, facilitating the efficient processing
                     of repeated NN operations. To this end, this study incorporates the ACU onto the banks
                     of HBM memory and adds a ring broadcast unit between the banks. This allows for a
                     reduction of more than 30.8% in the data movement overhead on average compared with
                     the existing transformer, with only 4% additional area overhead relative to the original
                     DRAM. This study ensured that the PIM power remained below the DRAM power budget of
                     60 W.
                  
                  ReTransformer [107] proposed and applied optimization techniques to effectively accelerate GEMV operations
                     within the transformer inference process, and softmax is suitable for low-power implementation
                     in ReRAM-based PIM. This study has a similar direction to the existing ReRAM-based
                     transformer-based workload target PIM study implementing MatMul operations inside
                     ReRAM and applying optimization techniques. Thus, the latency of the operational process
                     can be reduced. Specifically, this paper proposes a method of decomposing the operation
                     into two consecutive multiplication steps to solve the compute-write-compute dependency
                     that occurs when implementing the MatMul operation between Q and $K^{T}$during the
                     transformer inference process using ReRAM. Consequently, the latency recorded in the
                     crossbar of the ReRAM can be eliminated. In addition, a modified hybrid softmax formula
                     that can maximize the crossbar arrangement of the ReRAM was proposed and applied to
                     the softmax operation; as a result, only 0.691 mW was consumed and implemented, unlike
                     1.023 mW for the existing softmax operation. Finally, this study achieved a 23.21${\times}$
                     computational efficiency improvement and a 1,086${\times}$ power consumption reduction
                     compared to NVIDIA TITAN RTX GPUs.
                  
                  
                        Fig. 8. Token-based data sharding scheme and the dataflow of Transformer: (a) encoder; (b) decoder in TransPIM (Reprinted from [106] with permission).
 
                
               
                     6. Discussions
                  The PIM is a new architecture that integrates processing and memory units into the
                     PE, thereby enabling efficient data processing. However, owing to the integration
                     of computational functions into memory cells, PIM may be limited in handling complex
                     operations, and can cause performance degradation when computationally intensive operations
                     are required. Moreover, PIM's complex control structure and limited memory capacity
                     pose limit the full and effective handling of increasingly large AI workloads. For
                     PIM cores to be effectively applied to AI workloads, clear criteria are required to
                     determine whether operands should be computed in the host processor or the PIM core.
                     These criteria are typically derived by statistically analyzing the results measured
                     at the functional level using benchmark simulators [34,35]. In addition, in the PIM design process, the mapping of processes and parameters,
                     as well as the data flow considering complex operations, must be carefully incorporated.
                     In previous PIM studies, these considerations were designed heuristically. However,
                     with increasingly diverse PIM architectures and algorithms, there is an urgent need
                     for research on compilers that can automatically optimize workload functions, data
                     mapping, and data flow.