In the previous section, we observed that it is necessary
to exchange positional information of atoms with different
processors or nodes during Verlet communication in
a parallel computing environment. The energy consumption
of inter-chip communication is significantly high as a
result of MPI-based parallelized multi-chip computations
in largescale simulation data cases. To mitigate the problem,
we propose a CMD architecture in this section. The
key idea behind our work is to count the sign and exponent
cases for communicating atoms and encode the most
frequent cases into a 1-bit.
1. Motivation: Expensive Board to Board Communication
Fig. 2 shows the comparison of board-to-board communication
between the baseline and our work. For the simple
explanation, we assume an MPI-based parallelization
with two processors, referred to as board 0 and board 1.
For inter-chip communication, board 0 reads the position
information of atoms from DRAM of board 0 and transmits
this data to board 1 via the data bus (PCI-Express).
Simultaneously, board 1 reads the position information of
atoms from DRAM of board 1 and sends it to board 0
over the same data bus (PCI-Express). Upon receiving the
position data, board 0 writes the data from board 1 into
DRAM of board 0, while board 1 similarly writes the data
received from board 0 into DRAM of board 1. This process
involves a massive movement of positional information
for the DRAM communication and data bus.
As shown in Fig. 2(a), the bandwidth of the DRAM
is much larger than the bandwidth of PCIe, resulting in
a bottleneck at the PCIe interface. Additionally, the energy
consumption per bit for PCIe is higher than that of
DRAM, making the board-to-board communication particularly
critical in terms of energy efficiency.
To address this issue, as shown in Fig. 2(b), we reduce
the data width of the atom information sent to the other
processor to reduce energy consumption. The sign and exponent
cases of the read data from the DRAM of board 0
are counted, and the most frequently used case is compressed
into a 1bit representation. The compressed data is
then transmitted to board 1 via the data bus, and board 1
decompresses the received data and writes it to the DRAM
of board 1. Board 1 performs the same operations in parallel
with board 0 (The method will be explained in the
following section).
Dataflow of (a) baseline off-chip communication and (b) our FP data compression approach. MD chip: Conventional molecular dynamics simulation chip, CMD chip: Compressed molecular dynamics simulation chip.
2. Case Count
Fig. 3 illustrates the process of case counting with an
example. In step 1 of the case count process, when atom
data in a conventional floating-point (FP) format is input,
the sign and exponent of the input are used to analyze the
frequency of the sign and exponent values. If the sign and
exponent fields are matched by the case count SRAM, the
count value in the corresponding address is increased by
1. If the incremented count (count +1) outside the top entries (top1/2) is larger than the current second-highest
count value (top2), the top2 value is replaced with the new
value (count +1). Similarly, if top2 exceeds the current
highest count value (top1), the top2 replaces the previous
top1 entry. This process is repeated for all atoms involved
in communication.
Once the counts for all atoms have been completed, as
shown in step 2 of Fig. 3, a lookup table is generated, containing
the count values for the frequent sign and exponent
cases of the input. The two cases with the highest count
values are then passed to the compressor. For the Verlet algorithm,
atomic positions and velocities are updated using
the Initial_integrate() function. In this step, the sign and
exponent cases of the position values are counted. Therefore,
no additional iteration over the data is required for
counting, and the associated latency overhead is negligible.
The detailed illustration of the proposed compression approach. For the compressed floating-point format, 2 stages including case count and compressor are used.
3. Compressor
When atom data in a conventional FP format is input,
the sign and exponent of the input are evaluated to determine
whether they correspond to one of the two cases
identified in the previous case counting procedure. If the
input belongs to a non-frequent case, the data is sent to another
board in the vanilla FP format without compression.
However, if the input matches the most frequent case, it is
compressed into a 1-bit representation. As shown in Fig.
3, the decimal values in the conventional FP format and
the proposed most frequently used case (MFC) format remain
unchanged. This indicates that compression does not
affect the accuracy of the data.
Fig. 4 illustrates the distribution of sign and exponent
patterns in FP16 communication data when simulating
512K atoms using 2 MPI tasks.
Fig. 4(a) shows the pattern distribution of the 1-bit sign
and 5-bit exponent during forward communication, while
Fig. 4(b) presents the distribution during reverse communication.
When using two patterns for compression, 70%
of the total communication data can be compressed. As
the number of patterns increases, the portion of data that
can be compressed also increases. However, as the compression
target patterns expand to 2, 4, and 8, the number
of compressed bits increases by 1 bit for each pattern.
As a result, the overhead grows faster than the portion of
compressible data. Therefore, we chose to use only two
patterns for compression.
To analyze this, we performed a simulation with 512K FP16-format atoms using 2 MPI tasks. In the baseline approach
that transfers all data without compression, 270K
16-bit data points were transmitted. When applying the
2-pattern compression technique, 210K data points were
transmitted as 11-bit compressed data and 60K data points
were transmitted as uncompressed 16-bit data. Based on
the PCIe energy consumption of 7.5 J per bit, the proposed
compression method reduced the total energy consumption
from 32.4 kJ to 24.5 kJ. As a result, we confirmed
that our method achieved a 23.6% reduction in energy
consumption compared to the baseline approach. In the
same simulation conditions compression using 4 patterns
results in the transmission of 250K data points as 12-bit
data and 20K data points as uncompressed 16-bit data. In
this case, the total energy consumption is 24.9 kJ. Consequently,
the proposed method achieves a 22.6% reduction
in energy consumption compared to the baseline.
Distribution of sign and exponent patterns in FP16 communication data during the simulation of 512K atoms with 2 MPI tasks. (a) Forward communication. (b) Reverse communication.
4. Verlet Algorithm with Compression
Algorithm 1 presents the pseudo-code for the Verlet algorithm
in the proposed method. In each timestep, the Initial_
integrate() function updates the position and velocity
of the atoms. The count value for the frequent sign
and exponent case of the updated position (Updated PSE
case) is read. The read value is incremented by 1 and
written back, and this process is repeated for the entire communication atom amount (EA). After the counting is
completed, the position (P) is input for communication.
If the PSE case matches the most frequently used case
(M1), the PSE is compressed into 1-bit as 1 and concatenated
with the mantissa of the position (PM). If the
PSE case matches the second most frequently used case
(M2), the PSE is also compressed into 1-bit as 0 and concatenated
with PM in the same manner. If the PSE does
not match M1 or M2, no compression is performed. Once
all compression is completed, off-chip communication occurs
with another processor to exchange the position (P)
values of ghost atoms within the cut-off distance using the
Forward_comm() function.
In the Force_compute() step, forces (F) for all atoms
are computed based on $f_R(r_{ij})$ and $f_A(r_{ij})$, as described
in Section II. The force (F) is processed in the same manner
as the previous steps, with counting and compression
repeated for EA. The compressed force data is then
exchanged with another processor through off-chip communication
using the Reverse_comm() function. Finally,
the Final_integrate() function updates the velocity of the
atoms. This procedure is repeated for timestep iteration
(N).
To clearly distinguish between compressed data and uncompressed
FP16 values during decoding, we employ a
compression map. This map is implemented as a 1-bit
stream, introducing an overhead of 6.25% over the original
FP16 data. As shown in Fig. 4(a), if 71.3% of the total
elements are compressed, the average number of bits per
element is reduced to 13.4 bits. Therefore, the proposed
compression method reduces the communication volume
by 16% and the overhead introduced by the compression
map becomes negligible.
5. Update Unit
The position update unit performs a function equivalent
to Initial_integrate() in the Verlet algorithm. In the position
update unit, Newton’s second law F = ma is used to update the velocity V and position P based on the input
force F and velocity V. Two parameters are used in
this process: dt fm and dtv. The dtfm parameter represents
F=M differentiated by time, effectively the velocity
V, which is the acceleration integrated over time. The dtv
parameter represents the velocity V differentiated by time,
yielding the position P. The value of input F, multiplied by
the dtfm parameter, is added to the velocity V, while the
value of input V, multiplied by the dtv parameter, is added
to the position P. This update is performed for all atoms,
and once communication and the force computation are
completed, the velocity update unit is triggered.
The velocity update unit performs a function equivalent
to Final_integrate() in the Verlet algorithm. In this unit,
the V is updated based on the input force F. The input
velocity V, multiplied by the dtfm parameter, is added to
the velocity V. Similar to the position update unit, updates
are performed for all atoms.
6. Compute Unit
The Compute Unit contains four calculation units:
the repulsive unit, zeta unit, force_zeta unit, and attractive
unit, each performing functions equivalent to the
pair_tersoff() in the Verlet algorithm. The Compute Unit
is responsible for calculating forces for both two-atom and
three-atom interactions. The unit iterates over the neighbor
list of all atoms. It receives the positions of atom i and
atom $j_1$, computes the distance r between them, and activates
the repulsive unit only if the distance is less than the
cutoff distance, generating a repulsive force value through
the repulsive unit. The process is repeated for each neighboring
atom $j_1$ of atom i. At each iteration, the force value
from the repulsive interaction is subtracted from the existing force for atom $j_1$.
Once the force generation for neighboring atom $j_1$ is
complete, the positions of atom $j_2$ and atom k1 are received.
The distance between atom i and atom $j_2$ is calculated,
and if it is less than the cutoff distance, the distance
between atom i and atom k1 is also computed. If this
distance is also below the cutoff distance, the zeta unit is
triggered, accumulating the zeta value for the i j pair. Zeta
is a parameter used to adjust the bond-order.
After repeating the zeta unit process, the force_zeta unit
is executed to generate force values. Finally, the positions
of a new atom $k_2$ are received, and the attractive unit is
activated if the distance between atoms is below the cutoff
distance.
During the repetitions for $k_2$ iterations, the forces from
the attractive unit for atoms i, $j_2$, and $k_2$ are accumulated.
After the loop for $k_2$ ends, the forces for atom $j_2$ are accumulated
during the repetitions for $j_2$ iterations. Finally,
once the loop for $j_2$ is complete, the forces for atom i are
accumulated during the repetitions for i iterations. The accumulated
or subtracted forces up to this point are written
back to the global buffer at the end of each iteration.
7. Hardware Design
Fig. 5 illustrates the operation of the main components
in the CMD architecture and the top-level structure
with MPI based parallelization. The top-level architecture
consists of multiple boards (K) connected by a data bus
(PCIe). In this stage, we assume that each MPI process is
dedicated to 1 board. Each board is composed of DRAM
and a CMD chip.
The CMD chip includes an integrate unit, a compute
unit, a communication unit, 9 global Buffers where each of them has 128KB, and a global controller. The 9 global
buffers consist of 3 position buffers (PB), 3 velocity
buffers (VB), and 3 force buffers (FB). The 3 buffers
(PB/VB/FB) correspond to x, y, and z dimensions, repectively.
These global buffers distribute the required data to
the integrate unit, compute unit, and communication unit.
The Integrate unit comprises velocity update unit and
position update unit. The compute unit contains the repulsive
unit, attractive unit, force_zeta unit, zeta unit. Additionally,
the CMD architecture includes a case_counter
and 8KB of case_count SRAM.
The communication unit consists of a compression unit,
a decompression unit, and MPI_receive and MPI_send
units used for off-chip communication. The position and
velocity information are received from the global buffer
and used to update the system. As mentioned in Section
III-2 and III-4, the counting process for the cases is conducted.
The case counter then stores the sign and exponent
information along with the count value in the case count
SRAM. As explained in Section III-3, the most frequently
used case is sent to the compression unit, and the communicated
atom data is received from the global buffer for
compression. The compressed data is sent to another processor
via MPI_send unit, and the atom information from
the other processor is received through MPI_receive unit.
The information received is then decompressed and sent
to the global buffer.
The force compute unit receives input from the global
buffer, performs force calculations, and the resulting force
data is counted for force cases through the case_counter.
Similar to the position compression, the most frequent
case of force is sent to the compression unit, and the
communication force data from the global buffer is compressed.
The off-chip communication of the force data
with another processor is performed via MPI_send and
MPI_receive unit, and the processed data is stored in the
global buffer. The force and velocity information stored
in the global buffer are transmitted to the velocity update
unit, where the received force and velocity are used to update
the velocity. The updated velocity is then written back
to the global buffer.
Top-level architecture of our proposed CMD approach. The number of CMD chip/board can be changed depending on the target MPI parallelism.
The energy consumption comparison between the baseline and the proposed architecture across various conditions.