Mobile QR Code QR CODE

  1. (Department of Electronic Engineering, Sogang University, Seoul, South Korea.)
  2. (Department of System Semiconductor Engineering at Sogang University, Seoul, Republic of Korea)



Molecular dynamics, LAMMPS, off-chip communication, hardware accelerator, data compression

I. INTRODUCTION

Molecular dynamics (MD) simulations are essential tools for modeling the physical behavior of materials at the atomic and molecular scale. In particular, the semiconductor device industry depends on accurate predictions of atomic-level interactions, as the devices are aggressively scaled to deep submicrometer technologies. MD simulations play a critical role in understanding phenomena such as thin film growth [2], Secondary ion mass spectrometry (SIMS) [3] and examination the physical properties of nanotechnological devices [4], contributing to the development of new material and device design.

One of the key components for accurate modeling of these atomic-level interactions is to use appropriate interatomic potentials, such as the Tersoff potential [5]. The Tersoff potential is a widely used empirical potential specifically designed for simulating covalently bonded materials including silicon and carbon, which are base components in semiconductor devices. It incorporates bond order and angular dependence, enabling realistic simulation of bond formation and breaking, as well as capturing mechanical and thermal properties of materials [6]. By utilizing the Tersoff potential, MD simulations can more precisely predict the behavior of semiconductor materials under various conditions, which is critical for device design and analysis.

To perform MD simulations, the forces between multiple atoms must be computed. These forces are derived from the gradients of interatomic potentials, thereby effectively capturing the interactions between atoms. Once the forces are determined, the evolution of the system over time is computed by solving Newton's equations of motion. One of the popular approaches for the motion equation in the MD simulation is the Verlet [7] algorithm due to its simplicity and numerical stability, and it efficiently integrates the equations of motion by updating atomic positions and velocities based on the calculated forces. This allows for precise modeling of atomic trajectories, enabling simulations to capture the dynamic behavior of materials at the atomic level accurately.

However, MD simulations are computationally intensive, requiring large-scale data processing, which leads to a fundamental issue of performance degradation. The LAMMPS experiences performance degradation in message passing interface (MPI) [8]-based parallel computations due to overheads from inter-chip communication. For example, when using 2 MPI tasks with 256K atoms, the forward communication data for each process involves 22,282 atoms. Under the same conditions, increasing the number of atoms to 512K results in 46,066 atoms involved in forward communication per process. If this data is transferred via PCIe, which has a bandwidth of 8 GB/s as shown in Fig. 2(a), the communication time is 16711.5 ns for 256K atoms and 34549.5 ns for 512K atoms. As the volume of simulation data increases, these communication overheads increase, increasing the overall computation time and energy consumption, thus reducing the efficiency of large-scale simulations.

To address this problem, various software packages for the parallelization such as OPENMP and KOKKOS have been proposed [1]. OPENMP supports parallel processing through multithreading in shared memory systems, and KOKKOS provides flexible parallelization strategies for efficient computation across different hardware platforms. While these approaches are effective at improving computational performance, they have limitations in terms of communication overhead during the multi-core/-chip data transfer. In this study, we propose a compressed molecular dynamics (CMD) architecture to reduce communication overhead in large-scale MD simulations, thereby contributing to improved system-level energy efficiency as follows.

1) We analyze the data communication pattern with the floating-point (FP) number representation, specifically the sign and exponent fields.

2) For the most frequent patterns, we used the compressed data format with the reduced bit-width for the sign and exponent fields.

3) Based on these methods, we introduce a CMD (Compressed Molecular Dynamics) architecture that minimizes energy consumption.

4) Finally, our approach is the first work to minimize the amount of data communication in the fast and parallel MD computation.

II. PRELIMINARIES

1. Verlet Algorithm

MD simulations rely on computational methods that numerically solve Newton's equations of motion to predict the trajectories of atoms over time. Accurate force calculations are essential in these simulations, as they determine how atoms interact within a system.

The Verlet algorithm is one of the most widely used time integration methods in MD simulations, monitoring the dynamic behavior of the system by updating the positions and velocities of atoms. In this algorithm, the new position of each atom is calculated based on its previous and current positions, as well as the acceleration. To accelerate the computation, the forces acting on the atom must be accurately calculated, which requires considering the interactions with neighboring atoms.

The core of force calculation needs to collect the list of the neighboring atoms around each atom, and then it evaluates their interactions. To do this, a cutoff radius is used, which serves as a threshold of distance between effective atoms considered for interactions. Each atom considers its neighbors within the cutoff radius based on its current position and then calculates the forces using the distance and directional information between them.

Through this procedure, atoms interact in the following ways:

1) Neighbor list generation: Each atom must identify its neighboring atoms within the cutoff radius based on its position. To improve the indexing performance, spatial decomposition techniques (e.g., cell lists or grid methods) are used to reduce the search range.

2) Position information exchange: In a parallel computing environment, atoms may be distributed across different processors or nodes. If a neighboring atom within the cutoff radius is included in a different processor, communication between processors is necessary to retrieve the position information of that atom.

3) Force calculation and update: After receiving the positions of the neighboring atoms, each atom calculates the forces resulting from their interactions. The calculated forces determine the acceleration of the atom, which is then used to update its new position and velocity. In Fig. 1, the labels 0, 1, 2, 3, and 4 indicate the MPI process ranks. The blue and red regions represent the cut-off distance areas for each process. Fig. 1 illustrates the communication of atomic information, referred to as "ghost" atoms in LAMMPS, between Process 0 and neighboring Processes 1 to 4, and the computation of forces resulting from the interactions between atoms. A ghost atom is not a physical atom, but a copied atom used to handle boundary conditions, playing a crucial role in accurately calculating interactions at the boundary region. Process 0 sends the position or force information of atoms within the cut-off distance to Process 2. Subsequently, Process 0 receives atomic information within the cut-off distance from Process 2.

Fig. 1. Overview of MPI communication for parallel MD computation.

../../Resources/ieie/JSTS.2025.25.3.301/image1.png

Fig. 2. Dataflow of (a) baseline off-chip communication and (b) our FP data compression approach. MD chip: Conventional molecular dynamics simulation chip, CMD chip: Compressed molecular dynamics simulation chip.

../../Resources/ieie/JSTS.2025.25.3.301/image2.png

2. Tersoff Potential

The forces involved in this calculation are generally derived from interatomic potentials, which are mathematical functions that describe the potential energy between atoms based on their positions. Among these potentials, the Tersoff potential [5] is particularly significant for modeling covalently bonded materials such as silicon and carbon, which are fundamental to semiconductor devices. The total energy E of the system is calculated as follows [9]:

$ E= \frac{1}{2}\sum_i \sum_{j\neq i}f_c(r_{ij})[f_R(r_{ij})+b_{ij}f_A(r_{ij})]. $

In this equation, $r_{ij}$ represents the distance between particle pair $(i,j)$. The term $f_R(r_{ij})$ is the repulsive function, and $f_A(r_{ij})$ is the attractive function, which is multiplied by the bond order parameter $b_{ij}$. The total energy is computed as the summation of the above terms, where the result is further multiplied by the cutoff function $f_c(r_{ij})$, ensuring that interactions beyond a certain distance are neglected.

The repulsive function represents the short-range repulsive forces between atoms. It is defined as

$ f_R(r_{ij})=A\exp(-{\lambda }_1r_{ij}), $

where $A$ is a two-body interaction parameter representing the strength of the repulsive interaction, and ${\lambda }_1$ is a decay constant that determines how quickly the repulsive force decreases with distance. The attractive function represents the attractive forces driving bond formation between atoms. It is given by

$ f_A(r_{ij})=-B\exp(-{\lambda }_2r_{ij}), $

where $B$ is a two-body interaction parameter that determines the strength of the attractive interaction, and ${\lambda }_2$ is a decay constant for the attractive potential.

The cutoff function reduces to zero beyond a certain distance because calculations become more complex, making it difficult to handle in MD simulations [9]. It ensures that only nearby atoms within a specified cutoff radius $R$ contribute significantly to the total energy.

The cutoff function is defined as

$ f_c(r_{ij}) = \begin{cases} 1, & \text{if } r_{ij} < R - D \\[2ex] \frac{1}{2} + \frac{1}{2} \cos\left[ \frac{\pi (r_{ij} - R + D)}{2D} \right], & \text{if } R - D \leq r_{ij} \leq R + D \\[2ex] 0, & \text{if } r_{ij} > R + D \end{cases} $

where $D$ is a smoothing parameter that ensures continuity of the potential and its first derivative at the cutoff boundary.

III. PROPOSED COMPRESSED MOLECULAR DYNAMICS (CMD) ARCHITECTURE

In the previous section, we observed that it is necessary to exchange positional information of atoms with different processors or nodes during Verlet communication in a parallel computing environment. The energy consumption of inter-chip communication is significantly high as a result of MPI-based parallelized multi-chip computations in largescale simulation data cases. To mitigate the problem, we propose a CMD architecture in this section. The key idea behind our work is to count the sign and exponent cases for communicating atoms and encode the most frequent cases into a 1-bit.

1. Motivation: Expens4e Board to Board Communication

Fig. 2 shows the comparison of board-to-board communication between the baseline and our work. For the simple explanation, we assume an MPI-based parallelization with two processors, referred to as board 0 and board 1. For inter-chip communication, board 0 reads the position information of atoms from DRAM of board 0 and transmits this data to board 1 via the data bus (PCI-Express). Simultaneously, board 1 reads the position information of atoms from DRAM of board 1 and sends it to board 0 over the same data bus (PCI-Express). Upon receiving the position data, board 0 writes the data from board 1 into DRAM of board 0, while board 1 similarly writes the data received from board 0 into DRAM of board 1. This process involves a massive movement of positional information for the DRAM communication and data bus.

As shown in Fig. 2(a), the bandwidth of the DRAM is much larger than the bandwidth of PCIe, resulting in a bottleneck at the PCIe interface. Additionally, the energy consumption per bit for PCIe is higher than that of DRAM, making the board-to-board communication particularly critical in terms of energy efficiency.

To address this issue, as shown in Fig. 2(b), we reduce the data width of the atom information sent to the other processor to reduce energy consumption. The sign and exponent cases of the read data from the DRAM of board 0 are counted, and the most frequently used case is compressed into a 1bit representation. The compressed data is then transmitted to board 1 via the data bus, and board 1 decompresses the received data and writes it to the DRAM of board 1. Board 1 performs the same operations in parallel with board 0 (The method will be explained in the following section).

2. Case Count

Fig. 3 illustrates the process of case counting with an example. In step 1 of the case count process, when atom data in a conventional floating-point (FP) format is input, the sign and exponent of the input are used to analyze the frequency of the sign and exponent values. If the sign and exponent fields are matched by the case count SRAM, the count value in the corresponding address is increased by 1. If the incremented count (count $+ 1$) outside the top entries (top1/2) is larger than the current second-highest count value (top2), the top2 value is replaced with the new value (count $+ 1$). Similarly, if top2 exceeds the current highest count value (top1), the top2 replaces the previous top1 entry. This process is repeated for all atoms involved in communication.

Once the counts for all atoms have been completed, as shown in step 2 of Fig. 3, a lookup table is generated, containing the count values for the frequent sign and exponent cases of the input. The two cases with the highest count values are then passed to the compressor. For the Verlet algorithm, atomic positions and velocities are updated using the Initial_integrate() function. In this step, the sign and exponent cases of the position values are counted. Therefore, no additional iteration over the data is required for counting, and the associated latency overhead is negligible.

Fig. 3. The detailed illustration of the proposed compression approach. For the compressed floating-point format, 2 stages including case count and compressor are used.

../../Resources/ieie/JSTS.2025.25.3.301/image3.png

3. Compressor

When atom data in a conventional FP format is input, the sign and exponent of the input are evaluated to determine whether they correspond to one of the two cases identified in the previous case counting procedure. If the input belongs to a non-frequent case, the data is sent to another board in the vanilla FP format without compression. However, if the input matches the most frequent case, it is compressed into a 1-bit representation. As shown in Fig. 3, the decimal values in the conventional FP format and the proposed most frequently used case (MFC) format remain unchanged. This indicates that compression does not affect the accuracy of the data.

Fig. 4. Distribution of sign and exponent patterns in FP16 communication data during the simulation of 512K atoms with 2 MPI tasks. (a) Forward communication. (b) Reverse communication.

../../Resources/ieie/JSTS.2025.25.3.301/image4.png

Fig. 4 illustrates the distribution of sign and exponent patterns in FP16 communication data when simulating 512K atoms using 2 MPI tasks.

Fig. 4(a) shows the pattern distribution of the 1-bit sign and 5-bit exponent during forward communication, while Fig. 4(b) presents the distribution during reverse communication. When using two patterns for compression, 70% of the total communication data can be compressed. As the number of patterns increases, the portion of data that can be compressed also increases. However, as the compression target patterns expand to 2, 4, and 8, the number of compressed bits increases by 1 bit for each pattern. As a result, the overhead grows faster than the portion of compressible data. Therefore, we chose to use only two patterns for compression.

To analyze this, we performed a simulation with 512K FP16-format atoms using 2 MPI tasks. In the baseline approach that transfers all data without compression, 270K 16-bit data points were transmitted. When applying the 2-pattern compression technique, 210K data points were transmitted as 11-bit compressed data and 60K data points were transmitted as uncompressed 16-bit data. Based on the PCIe energy consumption of 7.5 J per bit, the proposed compression method reduced the total energy consumption from 32.4 kJ to 24.5 kJ. As a result, we confirmed that our method achieved a 23.6% reduction in energy consumption compared to the baseline approach. In the same simulation conditions compression using 4 patterns results in the transmission of 250K data points as 12-bit data and 20K data points as uncompressed 16-bit data. In this case, the total energy consumption is 24.9 kJ. Consequently, the proposed method achieves a 22.6% reduction in energy consumption compared to the baseline.

4. Verlet Algorithm with Compression

Algorithm 1 presents the pseudo-code for the Verlet algorithm in the proposed method. In each timestep, the Initial_integrate() function updates the position and velocity of the atoms. The count value for the frequent sign and exponent case of the updated position (Updated PSE case) is read. The read value is incremented by 1 and written back, and this process is repeated for the entire communication atom amount (EA). After the counting is completed, the position (P) is input for communication. If the PSE case matches the most frequently used case (M1), the PSE is compressed into 1-bit as 1 and concatenated with the mantissa of the position (PM). If the PSE case matches the second most frequently used case (M2), the PSE is also compressed into 1-bit as 0 and concatenated with PM in the same manner. If the PSE does not match M1 or M2, no compression is performed. Once all compression is completed, off-chip communication occurs with another processor to exchange the position (P) values of ghost atoms within the cut-off distance using the Forward_comm() function.

In the Force_compute() step, forces (F) for all atoms are computed based on $f_R\left(r_{ij}\right)$ and $f_A\left(r_{ij}\right)$, as described in Section II. The force (F) is processed in the same manner as the previous steps, with counting and compression repeated for EA. The compressed force data is then exchanged with another processor through off-chip communication using the Reverse_comm() function. Finally, the Final_integrate() function updates the velocity of the atoms. This procedure is repeated for timestep iteration (N).

To clearly distinguish between compressed data and uncompressed FP16 values during decoding, we employ a compression map. This map is implemented as a 1-bit stream, introducing an overhead of 6.25% over the original FP16 data. As shown in Fig. 4(a), if 71.3% of the total elements are compressed, the average number of bits per element is reduced to 13.4 bits. Therefore, the proposed compression method reduces the communication volume by 16% and the overhead introduced by the compression map becomes negligible.

5. Update Unit

The position update unit performs a function equivalent to Initial_integrate() in the Verlet algorithm. In the position update unit, Newton's second law $F=ma$ is used to update the velocity $V$ and position P based on the input force F and velocity $V$. Two parameters are used in this process: $dtfm$ and $dtv$. The dtfm parameter represents $F/M$ differentiated by time, effectively the velocity $V$, which is the acceleration integrated over time. The dtv parameter represents the velocity $V$ differentiated by time, yielding the position $P$. The value of input F, multiplied by the dtfm parameter, is added to the velocity $V$, while the value of input $V$, multiplied by the $dtv$ parameter, is added to the position $P$. This update is performed for all atoms, and once communication and the force computation are completed, the velocity update unit is triggered.

The velocity update unit performs a function equivalent to Final_integrate() in the Verlet algorithm. In this unit, the $V$ is updated based on the input force $F$. The input velocity $V$, multiplied by the dtfm parameter, is added to the velocity $V$. Similar to the position update unit, updates are performed for all atoms.

6. Compute Unit

The Compute Unit contains four calculation units: the repulsive unit, zeta unit, force_zeta unit, and attractive unit, each performing functions equivalent to the pair_tersoff() in the Verlet algorithm. The Compute Unit is responsible for calculating forces for both two-atom and three-atom interactions. The unit iterates over the neighbor list of all atoms. It receives the positions of atom $i$ and atom $j_1$, computes the distance $r$ between them, and activates the repulsive unit only if the distance is less than the cutoff distance, generating a repulsive force value through the repulsive unit. The process is repeated for each neighboring atom $j_1$ of atom $i$. At each iteration, the force value from the repulsive interaction is subtracted from the existing force for atom $j_1$.

Once the force generation for neighboring atom $j_1$ is complete, the positions of atom $j_2$ and atom $k_1$ are received. The distance between atom $i$ and atom $j_2$ is calculated, and if it is less than the cutoff distance, the distance between atom $i$ and atom $k_1$ is also computed. If this distance is also below the cutoff distance, the zeta unit is triggered, accumulating the zeta value for the $ij$ pair. Zeta is a parameter used to adjust the bond-order.

After repeating the zeta unit process, the force_zeta unit is executed to generate force values. Finally, the positions of a new atom $k_2$ are received, and the attractive unit is activated if the distance between atoms is below the cutoff distance.

During the repetitions for $k_2$ iterations, the forces from the attractive unit for atoms $i$, $j_2$, and $k_2$ are accumulated. After the loop for $k_2$ ends, the forces for atom $j_2$ are accumulated during the repetitions for $j_2$ iterations. Finally, once the loop for $j_2$ is complete, the forces for atom $i$ are accumulated during the repetitions for $i$ iterations. The accumulated or subtracted forces up to this point are written back to the global buffer at the end of each iteration.

7. Hardware Design

Fig. 5 illustrates the operation of the main components in the CMD architecture and the top-level structure with MPI based parallelization. The top-level architecture consists of multiple boards (K) connected by a data bus (PCIe). In this stage, we assume that each MPI process is dedicated to 1 board. Each board is composed of DRAM and a CMD chip.

The CMD chip includes an integrate unit, a compute unit, a communication unit, 9 global Buffers where each of them has 128KB, and a global controller. The 9 global buffers consist of 3 position buffers (PB), 3 velocity buffers (VB), and 3 force buffers (FB). The 3 buffers (PB/VB/FB) correspond to $x$, $y$, and $z$ dimensions, repectively. These global buffers distribute the required data to the integrate unit, compute unit, and communication unit.

The Integrate unit comprises velocity update unit and position update unit. The compute unit contains the repulsive unit, attractive unit, force_zeta unit, zeta unit. Additionally, the CMD architecture includes a case_counter and 8KB of case_count SRAM.

The communication unit consists of a compression unit, a decompression unit, and MPI_receive and MPI_send units used for off-chip communication. The position and velocity information are received from the global buffer and used to update the system. As mentioned in Section III-2 and III-4, the counting process for the cases is conducted. The case counter then stores the sign and exponent information along with the count value in the case count SRAM. As explained in Section III-3, the most frequently used case is sent to the compression unit, and the communicated atom data is received from the global buffer for compression. The compressed data is sent to another processor via MPI_send unit, and the atom information from the other processor is received through MPI_receive unit. The information received is then decompressed and sent to the global buffer.

The force compute unit receives input from the global buffer, performs force calculations, and the resulting force data is counted for force cases through the case_counter.

Similar to the position compression, the most frequent case of force is sent to the compression unit, and the communication force data from the global buffer is compressed. The off-chip communication of the force data with another processor is performed via MPI_send and MPI_receive unit, and the processed data is stored in the global buffer. The force and velocity information stored in the global buffer are transmitted to the velocity update unit, where the received force and velocity are used to update the velocity. The updated velocity is then written back to the global buffer.

Fig. 5. Top-level architecture of our proposed CMD approach. The number of CMD chip/board can be changed depending on the target MPI parallelism.

../../Resources/ieie/JSTS.2025.25.3.301/image5.png

Fig. 6. The energy consumption comparison between the baseline and the proposed architecture across various conditions.

../../Resources/ieie/JSTS.2025.25.3.301/image6.png

IV. RESULTS

1. Experimental Setup

In this section, we evaluate the performance of the CMD architecture in terms of energy consumption during data communication over the PCIe protocol. We set the vanilla communication method in LAMMPS as the baseline and compared it with the proposed compact communication scheme. Subsequently, we varied the total number of atoms, floating-point format, and MPI tasks for the evaluation. In this design, we used PCIe protocol $4.0 \times 4$ lanes, with the maximum bandwidth set to 8GB/s and energy per bit measured at 7.5 pJ/bit.

As the DRAM model, we used the DDR4, with a maximum bandwidth of 25.6GB/s and energy per bit of 5.1 pJ/bit [10]. The CMD architecture was designed using Verilog HDL, and gate-level synthesis was performed using Synopsys Design Compiler. The area of top-level architecture was obtained through PrimeTime.

2. Results

1) Normalized Communication Energy

In FP64 with 2 MPI configuration, the proposed compression scheme reduces the board-to-board communication by 11.64% in the 256K atom case. In the lower bits, the energy efficiency becomes higher considering that the number of mantissa bits is much smaller than the sign and exponent bits.

As a result, the energy consumption is reduced by 16.3% compared to the baseline. Under the same conditions, the floating-point 16-bit format shows even lower energy consumption than the 32-bit format by 23.2% reduction. This result indicates that reducing the bit width of the floating-point format leads to lower energy consumption. When comparing the energy consumption of the proposed communication method with a total of 256K atoms using floating-point 16bit formats, 4 MPI tasks achieve a 23.67% energy reduction compared to the baseline, which is more efficient than 2 MPI tasks. Under the same conditions, using 8 MPI tasks results in a 24.6% energy optimization compared to the baseline, leading to further energy savings.

2) Area breakdown

Fig. 7 represents the area breakdown of the CMD architecture. The memory components consist of input SRAM and count_SRAM, while the logic components include force_compute, initial_integrate, final_integrate, and the extra compressor and case_counter modules for the compressed MPI communication. The input SRAM and force_compute module occupy 97.93% of the total area. The area overhead including count_sram, compressor, and case_counter, accounts for 0.43% of the total area. The CMD architecture we propose can reduce energy consumption by up to 24.6% with the negligible area overhead.

Fig. 7. The area breakdown of the proposed CMD architecture.

../../Resources/ieie/JSTS.2025.25.3.301/image7.png

V. CONCLUSION

In this paper, we introduce a method for reducing board to board communication for the fast parallel computing in molecular dynamics simulations, which proposes a novel data compression approach called CMD. We first analyze the frequency of the sign and exponent field values in the communication phase, and then the frequently used data patterns are compressed into a 1-bit format. In the various floating point bit widths and MPI task configurations, the compression method allowed us to reduce energy consumption by up to 24.6%, while the area overhead was only 0.48%.

ACKNOWLEDGMENTS

This work was partly supported by Korea Institute for Advancement of Technology(KIAT) grant funded by the Korea Government(MOTIE) (P0017011, HRD Program for Industrial Innovation, 50\%), supported by National R&D Program through the National Research Foundation of Korea(NRF) funded by Ministry of Science and ICT(2021M3H2A1038042, 50\%). The EDA tool was supported by the IC Design Education Center(IDEC), Korea.

References

1 
A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. In’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton, ``Lammps-a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales,'' Computer Physics Communications, vol. 271, 108171, 2022.DOI
2 
H. Haberland, Z. Insepov, and M. Moseler, ``Molecular-dynamics simulation of thin-film growth by energetic cluster impact,'' Physical Review B, vol. 51, no. 16, 11061, 1995.DOI
3 
T. Aoki, S. Chiba, J. Matsuo, I. Yamada, and J. Biersack, ``Molecular dynamics and monte-carlo simulation of sputtering and mixing by ion irradiation,'' Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms, vol. 180, no. 1-4, pp. 312–316, 2001.DOI
4 
D. Srivastava, M. Menon, and K. Cho, ``Computational nanotechnology with carbon nanotubes and fullerenes,'' Computing in Science & Engineering, vol. 3, no. 4, pp. 42–55, 2001.DOI
5 
J. Tersoff, ``New empirical approach for the structure and energy of covalent systems,'' Physical review B, vol. 37, no. 12, 6991, 1988.DOI
6 
S. Munetoh, T. Motooka, K. Moriguchi, and A. Shintani, ``Interatomic potential for si–o systems using tersoff parameterization,'' Computational Materials Science, vol. 39, no. 2, pp. 334–339, 2007.DOI
7 
L. Verlet, ``Computer ``experiments'' on classical fluids. I. Thermodynamical properties of lennard-jones molecules,'' Physical Review, vol. 159, no. 1, 98, 1967.DOI
8 
W. Gropp, E. Lusk, N. Doss, and A. Skjellum, ``A high-performance, portable implementation of the MPI message passing interface standard,'' Parallel Computing, vol. 22, no. 6, pp. 789–828, 1996.DOI
9 
J. Tersoff, ``Modeling solid-state chemistry: Interatomic potentials for multicomponent systems,'' Physical review B, vol. 39, no. 8, 5566, 1989.DOI
10 
W. Zhao, Y. Du, M. Zhang, M. Liu, K. Jin, and R. Ausavarungnirun, ``Application-oriented data migration to accelerate in-memory database on hybrid memory,'' Micromachines, vol. 13, no. 1, 52, 2021.DOI
Seongmin Ki
../../Resources/ieie/JSTS.2025.25.3.301/author1.png

Seongmin Ki received his B.S. degree in electronic engineering from Soongsil University, Seoul, South Korea, in 2022. He is currently pursuing an M.S. degree with the Department of Electronic Engineering, Sogang University, Seoul, South Korea. His current research interests include deep neural network accelerator, processing-in memory, and hardware accelerator for variable bit precision.

Sungju Ryu
../../Resources/ieie/JSTS.2025.25.3.301/author2.png

Sungju Ryu (S'16-M'21) is currently an Assistant Professor in the Department of System Semiconductor Engineering at Sogang University, Seoul, Republic of Korea. Before joining Sogang, he was an Assistant Professor in School of Electronic Engineering and Department of Next Generation Semiconductor at Soongsil University from 2021 to 2023. At 2021, he was a Staff Researcher in the AI&SW Research Center of Samsung Advanced Institute of Technology (SAIT), Suwon, Republic of Korea. At SAIT, he focused on computer architecture design. He received his B.S. degree in electrical engineering from Pusan National University, Busan, Republic of Korea, in 2015, and a Ph.D. degree in creative IT engineering from Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea, in 2021. His current research interests include energy-efficient hardware accelerators for AI applications, high-performance computing, and in-near-memory computing.