I. INTRODUCTION
               Nowadays, cryptographic primitives are ubiquitously used in hardware and software
                  systems and continuously evolve through community efforts and standardization competitions.
                  The National Institute of Standards and Technology (NIST) has initiated the standardization
                  of Post-Quantum Cryptography (PQC) [1].
               
               Hardware-software (HW/SW) co-design is a System-on-Chip (SoC) methodology involving
                  both software design on microprocessors such as ARM or RISC-V, and hardware design
                  on Field-Programmable Gate Arrays (FPGAs) [2-4] or Application-Specific Integrated Circuits (ASICs) [5]. HW/SW co-design leverages the advantages of both hardware and software platforms.
                  Specifically, parallel and pipelined architectures can be utilized to accelerate the
                  critical components in hardware, while the non-critical parts can be implemented in
                  software with short development time. Therefore, HW/SW co-design offers a shorter
                  time-to-market than pure hardware designs [6,7] and achieves better performance than pure software implementations [8].
               
               In addition, the limited hardware resources of FPGAs make HW/SW co-design an effective
                  approach for efficient system implementations and a timely emergency solution for
                  hardware deployment. Systems that do not fit entirely within the desired FPGA can
                  be implemented as HW/SW co-designs with reduced cost.
               
               Some existing symmetric encryption systems (especially AES-GCM) and hash functions
                  (particularly SHA-2 and SHA-3) are considered secure for PQC scenarios, although their
                  effectiveness may be reduced [9,10]. Therefore, the development of public-key cryptographic systems that combine traditional
                  cryptographic algorithms with PQC is becoming increasingly important. Moreover, the
                  size of encryption keys plays a critical role in post-quantum security, as longer
                  keys resist factorization and search algorithms more effectively [11].
               
               High-Level Synthesis (HLS) is a design tool that accepts an algorithm written in C/C++
                  and generates a corresponding FPGA implementation. HLS offers a faster path to FPGA
                  realization compared to traditional VHDL/Verilog-based approaches. It enables cost-effective
                  exploration of the hardware design space and allows IP packages to be reused in other
                  projects without re-synthesis or reconfiguration. As such, HLS is a useful tool in
                  developing hardware accelerators. HLS has been widely used to implement NIST PQC competition
                  candidates, including lattice-based KEMs [11], the Classic McEliece code-based KEM [12], and comprehensive implementations of both lattice-based KEMs and digital signature
                  schemes [13]. HW/SW co-design approaches utilizing HLS for hardware accelerator design have also
                  been successfully applied to Classic McEliece [12] and BIKE [14].
               
               In 2024, a system combining three algorithms—classical, quantum, and PQC—was implemented
                  on an FPGA. This system integrated a pre-quantum Key Exchange scheme (KEX), a post-quantum
                  Key Encapsulation Mechanism (KEM), and a Quantum Key Distribution (QKD) algorithm
                  [15].
               
               ML-KEM is a promising protocol that has advanced to the final standardization round
                  for NIST's key encapsulation mechanism schemes. Its security is based on the hardness
                  of finding short vectors in lattices. ML-KEM is constructed based on the Module Learning
                  With Errors (MLWE) problem. More details can be found in the official technical documentation
                  [1].
               
               The main contributions of this paper can be summarized as follows:
                   • We propose a HW/SW hybrid architecture and implement a fast architecture for
                  a Post-Quantum Cryptography (PQC) system on the Xilinx Zynq UltraScale+ ZCU104 FPGA
                  platform, enabling data encryption and decryption using the HLS tool. The proposed
                  design supports efficient hardware resource reuse and can easily adapt to various
                  data types (texts, images, and videos) for IoT applications.
               
                   • HLS source code (written in C++ languages) for AES-GCM-256 and ML-KEM is generated
                  for simulation, synthesis, and IP block generation. The ML-KEM (formerly CRYSTALS-Kyber)
                  IP core supports post-quantum key encapsulation operations, including key generation,
                  encapsulation, and decapsulation for all Kyber variants, as shown in Table 1. This approach reduces pipeline-induced redundancy, resulting in high throughput
                  and lower resource usage on the Xilinx Zynq UltraScale+ FPGA. The proposed system
                  also presents an optimized hardware architecture for AES-GCM operations and the ML-KEM
                  scheme, with parameter sets and sizes of encapsulation key (ek), decapsulation key
                  (dk), and ciphertext (ct) illustrated in Table 1.
               
                   • Finally, we evaluate the proposed architecture on the UltraScale+ ZCU104 FPGA
                  platform and compare it with state-of-the-art works. Experimental results show that
                  the proposed design achieves comparable throughput while using fewer hardware resources
                  than existing studies.
               
               The remainder of this paper is organized as follows. Section II introduces the background
                  and related work. Section III presents the system design and the HLS-based HW/SW co-design
                  implementation of our PQC system. Section IV provides experimental results and a comparative
                  analysis of performance to evaluate the effectiveness of our approach. Section V concludes
                  the paper with a summary of our proposed PQC system.
               
               
                     
                     
Table 1. Parameter set for ML-KEM.
                   
             
            
                  II. BACKGROUND AND RELATED WORK
               
                     1. Module Lattice-based Key Encapsulation Mechanisms
                  NIST has been working on a public project to standardize quantum-safe algorithms,
                     including key encapsulation and digital signatures. At the end of Round 3, NIST selected
                     CRYSTALS-Kyber as the first Key Encapsulation Mechanism (KEM) for standardization
                     [1]. CRYSTALS-Kyber was then standardized as the Module-Lattice-based Key Encapsulation
                     Mechanism (ML-KEM) in the official reference document. ML-KEM was formally standardized
                     in 2024 [1].
                  
                  
                        1.1 Polynomial multiplication using NTT
                     ML-KEM is derived from the Round 3 version of Kyber [7]. Polynomial multiplication over $Z_{3329}[x]/(x^{256} + 1)$ is a fundamental operation
                        in ML-KEM. Utilizing the property $n \mid (q - 1)$, ML-KEM employs an incomplete Number
                        Theoretic Transform (NTT) to accelerate this operation. NTT is a special form of the
                        Discrete Fourier Transform (DFT) over finite fields. It is commonly employed to reduce
                        the computational complexity of polynomial multiplications [16].
                     
                     Polynomial multiplication involves convolving the coefficients of two polynomials,
                        which can be computationally intensive, especially for larger parameters. NTT leverages
                        the algebraic structure of the underlying finite field to enable more efficient polynomial
                        multiplication. The results of the multiplication are computed modulo $Q$ using Montgomery
                        and Barrett reductions in ML-KEM [18].
                     
                     In mathematics, the general polynomials $a(x)$ and $b(x)$ in $ R_q = \mathbb{Z}_q/(x^N
                        + 1) $ can be represented as follows:
                     
                     
                     
                     We can express the result of the multiplication of $a(x)$ and $b(x)$ as $c(x)$, as
                        follows:
                     
                     
                     The traditional schoolbook method for polynomial multiplication has a computational
                        complexity of $O(n^2)$, but by utilizing the NTT, this can be reduced to $O(n \log
                        n)$. Eq. ((4)) defines the NTT, and Eq. ((5)) defines the INTT.
                     
                     
                     
                     where $\omega$ is the twiddle factor (TF) and also the primitive $n$-th root of unity,
                        and $q$ is the modulus.
                     
                     By utilizing the NTT and INTT, Eq. ((3)) can be transformed into Eq. ((6)). In this context, the $\circ$ operator denotes point-wise multiplication.
                     
                     
                     ML-KEM uses four SHA-3 hash functions: SHA3-256, SHA3-512, SHAKE128, and SHAKE256.
                        For more details, readers can refer to the FIPS 203 draft [1].
                     
                     According to the FIPS 203 document, three primary algorithms are defined as KEM operations:
                        key generation, encapsulation, and decapsulation.
                     
                   
                  
                        1.2 Modular reduction for ML-KEM
                     The parameters of ML-KEM have already changed the prime number $Q$ from $7681$ to
                        $3329$, which consequently affects the ciphertext and key sizes. In [18], two efficient modular reduction algorithms—Algorithms 1 and 2—were proposed, referred
                        to as Montgomery Reduction and Barrett Reduction, for use in NTT and INTT, respectively.
                     
                     
                           
                           
Algorithm 1: Montgomery reduction for ML-KEM [18].
                           
                         
                     
                           
                           
Algorithm 2: Barrett Reduction for ML-KEM [18].
                           
                         
                   
                  
                        1.3 Key generation (Key gen)
                     Fig. 1 illustrates the public key $ek$ and private key $dk$ generated by a probabilistic
                        key generation algorithm. The encryption key ($ek$) is generated using the matrix
                        $A$, the value $\hat{s}$, and the value $\hat{e}$, following the equation in [1]:
                     
                     
                           
                           
Fig. 1. A general view of a system using ML-KEM and AES-GCM for real-time video encryption
                              and decryption. The Key Generation (Key Gen) algorithm in ML-KEM generates a key pair
                              ($ek$, $dk$), where $ek$ (encapsulation key) and $dk$ (decapsulation key) are used
                              in the Encapsulation and Decapsulation algorithms, respectively.
                           
                         
                     
                     where $\hat{A} \in R_{q}^{k \times k}$ forms the matrix, $\hat{s} \in R_{k}^{q}$ and
                        $\hat{e} \in R_{q}^{k}$ is the noise vector, and both $\hat{s}$ and $\hat{e}$ are
                        short polynomials whose coefficients are sampled from the central binomial distribution
                        $\beta_{\eta}$. The encoder sub-module is responsible for serializing the byte array
                        into a polynomial. 
                     
                     The encapsulation key is simply the encryption key.
                     
                     Then, the transformed value $\hat{s}$ is used to generate the decryption key $dk$,
                        as shown in ((9)).
                     
                     
                     The decapsulation key comprises the decryption key, the encapsulation key, a hash
                        of the encapsulation key, and a random 32-byte value $z$.
                     
                     
                   
                  
                        1.4 Encapsulation (Encaps)
                     A probabilistic encapsulation algorithm takes as input an encapsulation key $ek$ and
                        outputs a ciphertext $ct$ and a shared secret key $ss$.
                     
                     First, random bytes $m$ and the encapsulation key $ek$ are used in a hash sampler
                        sub-module to generate the shared secret key $ss$ and a sub-random value $r$.
                     
                     
                     The decoder sub-module converts the byte arrays $ek$ and $m$ into polynomial representations
                        and transforms them into the required format. Then, using the converted $ek$, the
                        matrix $A^T$ is generated via the hash sampler and rejection sampler sub-modules.
                        In this step, the internal padding order is applied in the reverse direction compared
                        to the key generation process.
                     
                     Next, the encryption randomness $r$ is used to generate noise values $y$, $e_1$ and
                        $e_2$ using the hash sampler and binomial sampler sub-modules, as specified in Algorithm
                        16 of the FIPS-203 standard document [1].
                     
                     The noise vector $y$ is transformed into the NTT domain as $\hat{y}$ through the NTT
                        sub-module. This prepares the noise vector in the domain required for the encryption
                        process.
                     
                     Using the generated values, the intermediate ciphertext components $u$ and $v$ are
                        computed as follows in Eqs. (12) and (13):
                     
                     
                     
                     The values $u$ and $v$, which comprise the ciphertext, are then compressed into $c_1$
                        and $c_2$ through a compression and encoding process using the corresponding compress
                        and encode sub-modules.
                     
                   
                  
                        1.5 Decapsulation (Decaps)
                     A decapsulation process takes as input a decapsulation key $dk$ and a ciphertext $ct$,
                        and performs decryption. The ciphertext $ct$ is separated into $u$ and $v$ through
                        the Decode and Decompress sub-module processes. Additionally, the decapsulation key
                        $dk$ is restored as $\hat{s}$ using the Decode sub-module, as described in Algorithm
                        17 in FIPS 203 [1].
                     
                     The restored $u$ value is transformed into the NTT domain through the NTT module,
                        followed by point-wise multiplication with $\hat{s}$. The resulting value is then
                        converted back to the base domain using the $NTT^{-1}$ sub-module. Subsequently, a
                        similar operation to Eq. Eq. (15) is performed to obtain the message $m'$ [1].
                     
                     
                     
                     The decapsulation algorithm then computes a candidate shared secret key $ss'$ using
                        $m'$ and a part of $dk$, following the same procedure used in the encapsulation process.
                     
                     
                     We briefly summarize the Crystals-Kyber public-private key cryptosystem. Crystals-Kyber
                        is a PQC KEM published in [17] and has been standardized by NIST under the name Module Lattice-Based Key Encapsulation
                        Mechanism (ML-KEM), as mentioned above. ML-KEM enables a sender and receiver to securely
                        establish a shared secret key. ML-KEM consists of three algorithms: Key Generation
                        (KeyGen), Encapsulation (Encaps), and Decapsulation (Decaps).
                     
                     First, a sender executes the KeyGen algorithm to generate an encapsulation key ($ek$)
                        and a decapsulation key ($dk$), and securely distributes $ek$ to the receiver. The
                        receiver then uses this $ek$ in Encaps to generate a shared secret key ($ss$) and
                        ciphertext ($ct$). The receiver sends this $ct$ to the sender. Finally, the sender
                        uses Decaps with its own $dk$ to decrypt $ct$ and obtain the shared secret key $ss$.
                     
                     From left to right in Fig. 1, which illustrates our case study for video encryption that combines symmetric and
                        asymmetric encryption techniques for robust data security. In this approach, asymmetric
                        encryption generates a pair of keys (called $ss$, $ss'$) to encrypt and decrypt the
                        data, which is then secured using symmetric encryption. Since both the sender and
                        receiver possess the shared secret key $ss$, they can use AES-GCM to Encrypt or Decrypt
                        and exchange data securely.
                     
                     In summary, ML-KEM is similar to Public Key Encryption (PKE) in that one party encrypts
                        a message using the public key $ek$, and the other decrypts it using the private key
                        $dk$.
                     
                   
                
               
                     2. AES-GCM
                  The Advanced Encryption Standard (AES) specifies a FIPS-approved cryptographic algorithm
                     that can be used to protect digital data. The AES-GCM algorithm is a symmetric block
                     cipher that can encrypt (encipher) and decrypt (decipher) information. Encryption
                     converts data into an unintelligible form called ciphertext, while decryption converts
                     the ciphertext back into its original form, known as plaintext. The AES-GCM algorithm
                     supports cryptographic keys of 128, 192, and 256 bits to encrypt and decrypt data
                     in blocks of 128 bits [23]. For our implementation, we select the highest security level, AES-GCM-256 (AES-GCM).
                  
                
               
                     3. High-Level Synthesis
                  Vivado HLS tools do not support all C/C++ functions for hardware implementation. For
                     example, dynamic memory allocation, function recursion, system calls, and file input/output
                     (I/O) operations are not supported by HLS tools. For this reason, there are two different
                     design approaches with HLS synthesis tools:
                  
                  First, a previously developed C reference design can be ported to a hardware implementation.
                     In this case, the developer starts with existing C code, and the role of the designer
                     is to modify the code sections that are not synthesizable and then optimize the design
                     by adding optimization directives to meet the design goals. The top-down design flow
                     is more practical in this situation because the design can be accelerated as a single
                     function.
                  
                  Second, the design can be developed from scratch, where the designer writes the code
                     with full awareness that it will be synthesized for hardware implementation. Both
                     top-down and bottom-up design styles are feasible. In the bottom-up approach, the
                     designer can begin by accelerating sub-functions in the application, with the flexibility
                     to later expand the accelerator to include additional functions.
                  
                
               
                     4. Xilinx PYNQ Platform
                  PYNQ is an open-source Xilinx project that facilitates easy programming and interaction
                     with Zynq FPGA devices through Python [21]. The Zynq architecture features a hard-core ARM Cortex-A53 processing system (PS)
                     embedded in the FPGA programmable logic (PL).
                  
                  By leveraging the Python language and its libraries, developers can write host code
                     in Python to be executed in the PS domain. This allows collaborative design between
                     the CPU (ARM processor on the FPGA board) in the Zynq FPGA's PS domain and the PL
                     domain. This approach enables the implementation of various embedded systems on a
                     single FPGA. Fig. 2 shows the PYNQ framework that runs on the CPU (ARM Cortex-A53) in the PS domain when
                     the PYNQ boot image is loaded and the Ultrascale+ Zynq 104 (ZCU104) FPGA is booted
                     in SD-card boot mode.
                  
                  
                        
                        
Fig. 2. Architecture of PYNQ framework.
                      
                  The framework consists of a PYNQ overlay, which is a firmware design incorporating
                     the blocks described in Section III and built using Vivado, and a Jupyter Notebook
                     written entirely in Python. The notebook server runs on an Ubuntu-based Linux kernel
                     in the PS domain and can be used to program and configure the PL, move data through
                     the PL, and visualize the results.
                  
                  
                        
                        
Fig. 3. High-level synthesis for ML-KEM and AES-GCM IPs.
                      
                
             
            
                  III. HARDWARE/SOFTWARE CO-DESIGN
               The internal hardware implementations of the ML-KEM (key generation, encapsulation,
                  and decapsulation) processes are illustrated in Fig. 4. These diagrams present the modular dataflow and integration of arithmetic blocks
                  such as NTT, INTT, samplers, and hash logic, all optimized for FPGA implementation.
                  Each module was synthesized using Vivado HLS and designed to optimize the trade-off
                  between throughput and resource utilization.
               
               
                     
                     
Fig. 4. HLS-based ML-KEM block diagrams for (a) key generation, (b) encapsulation,
                        and (c) decapsulation.
                     
                   
               As shown in Fig. 3, the design process began with C source-level simulation, where a testbench was developed
                  to validate the functional correctness of the C logic and the AXI interface. As part
                  of the high-level synthesis process, optimization directives such as PIPELINE and
                  UNROLL were explicitly applied—particularly to loop-intensive operations such as the
                  Number Theoretic Transform (NTT) and Inverse NTT (INTT)—to exploit parallelism and
                  improve performance. Among the various modules, the NTT and INTT blocks were identified
                  as primary performance bottlenecks, as they are used in all core operations of ML-KEM.
                  To mitigate this issue, we manually applied loop unrolling to the outermost loop in
                  the three-dimensional NTT/INTT structure to increase parallel execution. Additionally,
                  the HASH module was optimized by applying the PIPELINE directive to its internal round
                  function.
               
               
                     
                     
Fig. 5. HLS-based AES-256-GCM hardware block diagram.
                   
               
                     
                     
Fig. 6. Hardware/software co-design flow of the proposed ML-KEM and AES-GCM architecture
                        in Vivado HLS.
                     
                   
               As a result of these optimizations, the overall latency of the ML-KEM operation was
                  reduced by approximately 80,000 clock cycles compared to the initial implementation,
                  resulting in significant improvements in efficiency and execution time.
               
               The hardware architecture of AES-256 encryption and decryption is depicted in Fig. 5. The diagrams illustrate the structured dataflow and composition of fundamental cryptographic
                  components, including SubBytes, ShiftRows, MixColumns, and AddRoundKey, all tailored
                  for efficient FPGA realization. The design flow is illustrated in Fig. 3, following the same methodology as that of ML-KEM. The ARRAY\_PARTITION directive
                  was applied to two-dimensional arrays such as encrypt\_block, GF, state, and expanded
                  key, allowing each array dimension to be mapped to individual registers for parallel
                  processing. Furthermore, the internal KeyExpansion function was optimized through
                  loop unrolling and pipelining of the shiftrow128, xor128, and GF\_mult128 operations.
                  As a result of the applied optimizations, the overall latency of the AES-256 module
                  was reduced by approximately 53,000 clock cycles compared to the baseline implementation,
                  leading to substantial improvements in performance and processing efficiency.
               
               The AXI-Stream interface was also integrated to generate the corresponding RTL code.
                  The synthesized RTL was then verified through C/RTL co-simulation using waveform-based
                  simulation in Vivado to ensure functional equivalence between the C and RTL implementations.
                  The synthesized IP cores were mapped to the Programmable Logic (PL) region of the
                  Zynq UltraScale+ FPGA, finalizing the hardware system implementation.
               
               The top-level hardware/software co-design architecture is shown in Fig. 6. Our proposed system is implemented on the Xilinx ZCU104 FPGA board, which includes
                  two parts: the PS and the PL. The Advanced eXtensible Interface (AXI) standard is
                  used to interconnect the PS and PL. The PYNQ framework provides a software application
                  interface that runs on the ARM processor in the PS, as mentioned above, while the
                  designed hardware accelerators for ML-KEM and AES-GCM run on the reconfigurable logic
                  in the PL.
               
               The camera is used to generate video frames or capture videos with a resolution of
                  640×480. To interface with the user, the PYNQ framework is installed on the ARM Cortex-A53
                  CPU and is used to control the camera to record video data. The capture module converts
                  native video to the AXI stream protocol. This module samples the incoming data at
                  200 MHz and generates appropriate AXI stream signals that integrate easily with other
                  parts of the architecture using the same protocol.
               
               In addition, each accelerator exposes an AXI stream interface that connects to dedicated
                  DMA controllers (DMA0 for ML-KEM and DMA1 for AES-GCM), which provide high-bandwidth
                  access to local memory. The DMA controllers are connected to memory via an AXI interface,
                  while the CPU accesses their initialization, status, and management registers through
                  AXI4-Lite. The details of the values in these registers are shown in Table 2.
               
               
                     
                     
Table 2. HLS AXI-LITE register functions.
                   
               On the PS side, the processor accesses data in the DDR for computation. The processor
                  includes a cache to store temporary data for acceleration. The HPM0 and HPM1 ports
                  are high-performance interfaces that connect to the DDR controller through the AXI
                  interconnect block. They can read and write large volumes of data in memory using
                  the AXI protocol.
               
               On the PL side, the DMA serves as the intermediary for data communication with the
                  DDR and is connected to the HP port using the AXI stream protocol  [22]. The DMA interacts with the hardware accelerator through input and output FIFOs.
                  The read and write interrupt signals of the DMA pass to the IRQ port through the concat
                  IP. The processor controls the DMA data transfer and passes configured parameters
                  via the AXI HP ports using the AXI-Lite protocol. The AXI interconnect and AXI DMA
                  act as intermediaries between the endpoint IPs and the PS. The AXI DMA controller
                  supports memory copy (memcpy) and memory initialization (memset) functions, both of
                  which can operate on byte, half-word, and word granularity. The AXI stream data transmission
                  in this design uses a 64-bit bus, while the AXI-Lite control signal uses a 32-bit
                  bus.
               
               For operation, the ML-KEM block is executed first, and a pair of shared keys $ss$
                  and $ss'$ are generated by the Key Generation (KeyGen) sub-block, as shown in Fig. 6. The selected shared key $ss$ is then sent to the AES-GCM IP block, which is controlled
                  by DMA0, while $ss'$ is used in the decryption algorithm.
               
               Each IP module works independently. All modules interface with the same input and
                  output FIFOs using the AXI DMA protocol. The module control logic is implemented as
                  an arbiter designed to transmit control information between the PS and various acceleration
                  modules. The system's configurability is achieved through control registers, which
                  transmit control signals and design parameters. The four control registers are defined,
                  as shown in Table 2. The register0 uses 3 bits to control the startup of four modules, while the remaining
                  registers are used to convey parameter settings for the various modules.
               
               Fig. 9 shows the basic operation of DMA0, which transfers the generated shared key to the
                  AES-GCM IP block for re-encryption. The AXI stream handshake signals of the ML-KEM
                  IP block—clock ($clk$), last ($kem\_TLAST$), keep ($kem\_TKEEP$), ready ($kem\_TREADY$),
                  valid ($kem\_TVALID$), and data ($kem\_TDATA$)—are required for the IP block interface.
                  Specifically, the $kem\_TLAST$ signal is required to indicate the end of a frame.
                  In general, stream signals primarily facilitate handshake mechanisms. To clarify this,
                  we categorize them into three groups: $kem\_TVALID$, $kem\_TREADY$, and all other
                  signals grouped under $kem\_TDATA$.
               
               After encryption/decryption, the result is transmitted via DMA1 using the PYNQ framework.
                  The timing signals of the DMA1 protocol are depicted in Figs. 8 and 9.
               
               
                     
                     
Fig. 7. Flow chart for hybrid mode of the proposed HW/SW co-design.
                   
               
                     
                     
Fig. 8. Timing diagram of the signals involved in the encapsulation algorithm, with
                        the shared key from ML-KEM block, and encrypted data from AES-GCM block.
                     
                   
               
                     
                     
Fig. 9. Temporal trends of the signals involved in the encapsulation algorithm, encrypted
                        data output occurs on each rising edge of the clock.
                     
                   
               
                     
                     
Fig. 10. ML-KEM based hybrid cryptosystem architecture on FPGA.
                   
             
            
                  IV. EXPERIMENT RESULTS
                Table 3 shows a comparison of FPGA resource utilization for the AES-GCM implementation, while
                  Table 4 provides a comparison of hardware resource consumption for the ML-KEM and CRYSTALS-Kyber
                  implementations. Additional results from our case studies, which were not previously
                  included, are summarized in Table 5, showing the complete combination of ML-KEM and AES-GCM IP cores.
               
               
                     
                     
Table 3.  Performance comparison for AES-GCM architecture, with latency (Cycle): Lat.(CC),
                        frequency: Freq., and Encryption/Decryption: Enc/Dec.
                     
                   
               
                     
                     
Table 4.  Performance comparison for ML-KEM/CRYSTALS-Kyber architecture, with latency
                        (Cycle): Lat.(CC), frequency: Freq., and Key generation/Encapsulation/Decapsulation:
                        K/E/D.
                     
                   
               
                     
                     
Table 5. Performance comparison for ML-KEM based hybrid cryptosystem architecture,
                        with 320KB file, latency (Cycle): Lat.(CC), frequency: Freq.
                     
                   
               We evaluated our design using Xilinx Vivado HLS on the Zynq UltraScale+ Evaluation
                  Platform (ZCU104 board, xczu7ev-ffvc1156-2-e) with the open-source PYNQ framework
                  [21]. For the ML-KEM hardware, we selected the parameter set specified in the ML-KEM standard
                  with $n = 256$, $Q = 3329$, and $k = 2$. These parameters are summarized in Table 1.
               
               As shown in Table 4, the proposed ML-KEM IP, which supports key generation, encapsulation, and decapsulation,
                  reduces execution time by approximately $2.84\times$ compared to the software implementation
                  in  [26]. It also achieves higher throughput than the CRYSTALS-Kyber HLS design reported in
                  [27], improving the area-time product (ATP) by approximately 37.1\%.
               
               The hardware export file (.bit file) was synthesized in Vivado and loaded onto the
                  FPGA using the PYNQ Overlay class. Input data streams are passed to the AES-GCM block
                  in packets using the Xilinx DMA tool and the PYNQ DMA driver. The resulting data can
                  then be processed and visualized in the Jupyter notebook using standard Python tools.
               
               Throughput is defined as the number of bits processed per second for the whole operations
                  (key generation, encapsulation, decapsulation). The hardware latency of ML-KEM is
                  presented in Table 4.
               
               The throughput, $T_p$, achieved by the proposed ML-KEM design can be calculated using
                  the following equation:
               
               
               Here, $T_p$ represents the throughput in kilobits per second (kbps). The throughput
                  reported in  [27] is approximately 8.6 kbps, whereas the proposed design achieves a throughput of approximately
                  16.4 kbps, representing an improvement of nearly $1.91\times$.
               
             
            
                  V. CONCLUSION
               We implemented, compared, and analyzed the performance of HLS-based optimization and
                  the design results of a proposed post-quantum cryptographic system. Xilinx tools,
                  Vivado HLS (2022.2) and Vivado (2022.2), were utilized, with the Zynq FPGA (ZCU104
                  board) set as the target platform.
               
               As shown in Fig. 6, the HLS design was implemented on the Zynq FPGA. The developed module, enhanced
                  through directive-based optimizations in Vivado HLS, achieved reduced latency compared
                  to both the original NIST reference source code and previous HLS design. Experimental
                  results confirm improvements in performance and area-time product (ATP). In particular,
                  for 320 KB files, the encryption/decryption time was reduced by approximately $11.3\times$
                  compared to the software implementation used in the reference source code. The implemented
                  PQC design is promising and can be applied to a wide range of security-critical applications.
               
             
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This  work  was  supported  by  INHA  UNIVERSITY Research Grant.
                  			
               
             
            
                  
                     References
                  
                     
                        
                        National Institute of Standards and Technology, ``Module-lattice-based key encapsulation
                           mechanism standard,''  Department of Commerce, Washington, D.C., Federal Information
                           Processing Standards Publication (FIPS), NIST FIPS 203, 2024. [Online] Available:
                           https://doi.org/10.6028/NIST.FIPS.203

 
                     
                        
                        D. T. Nguyen, V. B. Dang and K. Gaj, ``A high-level synthesis approach to the software/hardware
                           codesign of NTT-based post-quantum cryptography algorithms,'' Proc. of International
                           Conference on Field-Programmable Technology (ICFPT), Tianjin, China, pp. 371-374,
                           2019.

 
                     
                        
                        J. P. Smith, J. I. Bailey, J. Tuthill, L. Stefanazzi, G. Cancelo, K. Treptow, and
                           B. A. Mazin, ``A high-throughput oversampled polyphase filter bank using vivado HLS
                           and PYNQ on a RFSoC,'' IEEE Open Journal of Circuits and Systems, vol. 2, pp. 241-252,
                           2021.

 
                     
                        
                        K. Haeublein, W. Brueckner, S. Vaas, S. Rachuj, M. Reichenbach, and D. Fey, ``Utilizing
                           PYNQ for accelerating image processing functions in ADAS applications,'' in Proceedings
                           of the 32nd International Conference on Architecture of Computing Systems (ARCS Workshop
                           2019), Copenhagen, Denmark, pp. 1-8, 2019.

 
                     
                        
                        S. Morioka, T. Isshiki, S. Obana, Y. Nakamura, and K. Sako, ``Flexible architecture
                           optimization and ASIC implementation of group signature algorithm using a customized
                           HLS methodology,'' in Proceedings of the 2011 IEEE International Symposium on Hardware-Oriented
                           Security and Trust (HOST), San Diego, CA, USA, 2011.

 
                     
                        
                        J. Kokila, N. Ramasubramanian, and S. Indrajeet, ``A survey of hardware and software
                           co-design issues for system on chip design,'' Advanced Computing and Communication
                           Technologies, Springer, Singapore, 2016.

 
                     
                        
                        Y. Zhang, Y. Zhao, J. Hu, and W. Zhang, ``AutoAI2C: an automated hardware generator
                           for DNN acceleration on both FPGA and ASIC,'' IEEE Transactions on Computer-Aided
                           Design of Integrated Circuits and Systems, 2024.

 
                     
                        
                        S. Srilakshmi and G. L. Madhumati, ``A comparative analysis of HDL and HLS for developing
                           CNN accelerators,'' in Proceedings of the 2023 Third International Conference on Artificial
                           Intelligence and Smart Energy (ICAIS), Coimbatore, India, pp. 1060-1065, 2023.

 
                     
                        
                        T. Takaki, Y. Li, K. Sakiyama, S. Nashimoto, D. Suzuki and T. Sugawara, “An Optimized
                           Implementation of AESGCM for FPGA Acceleration Using High-Level Synthesis,” 2020 IEEE
                           9th Global Conference on Consumer Electronics (GCCE). Kobe, Japan, pp. 176-180, 2020.

 
                     
                        
                        H. S. Jacinto, L. Daoud, and N. Rafla, ``High-level synthesis using Vivado HLS for
                           optimizations of SHA-3,'' in Proceedings of the 2017 IEEE 60th International Midwest
                           Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, pp. 563-566, 2017.

 
                     
                        
                        E. Homsirikamol, K. Gaj, and R. R. L. Pareschi, ``C vs. VHDL: benchmarking CAESAR
                           candidates using high-level synthesis and register-transfer level methodologies,''
                           in Directions in Authenticated Ciphers (DIAC), 2015.

 
                     
                        
                        V. Kostalabros, J. Ribes-González, O. Farràs, M. Moretó, and C. Hernandez, ``HLS-based
                           HW/SW co-design of the post-quantum Classic McEliece cryptosystem,'' in Proceedings
                           of the 2021 31st International Conference on Field-Programmable Logic and Applications
                           (FPL), Dresden, Germany, pp. 52-59, 2021.

 
                     
                        
                        Z. Zhou, D. He, Z. Liu, M. Luo, and K.-K. Choo, ``A software/hardware co-design of
                           CRYSTALS-Dilithium signature scheme,'' ACM Transactions on Reconfigurable Technology
                           and Systems, vol. 14, no. 2, 11, 2021.

 
                     
                        
                        G. Montanaro, A. Galimberti, E. Colizzi, and D. Zoni, ``Hardware-software co-design
                           of BIKE with HLS-generated accelerators,'' Proc. of the 2022 29th IEEE International
                           Conference on Electronics, Circuits and Systems (ICECS), Glasgow, United Kingdom,
                           pp. 1-4, 2022.

 
                     
                        
                        S. Ricci, P. Dobias, L. Malina, J. Hajny, and P. Jedlicka, ``Hybrid keys in practice:
                           combining classical, quantum and post-quantum cryptography,'' IEEE Access, vol. 12,
                           pp. 23206-23219, 2024.

 
                     
                        
                        J. W. Cooley and J. W. Tukey, ``An algorithm for the machine calculation of complex
                           Fourier series,'' Mathematics of Computation, vol. 19, no. 90, pp. 297-301, Jan. 1965.

 
                     
                        
                        “NIST post-quantum cryptography round 1 submissions,” National Institute of Standards
                           and Technology (NIST), [Online] https://csrc.nist.gov/Projects/PostQuantum-Cryptography/Round-1-Submissions,
                           2017.

 
                     
                        
                        P. Nannipieri, S. Di Matteo, L. Zulberti, F. Albicocchi, S. Saponara, and L. Fanucci,
                           ``A RISC-V post-quantum cryptography instruction set extension for number theoretic
                           transform to speed up CRYSTALS algorithms,'' IEEE Access, vol. 9, pp. 150798-150808,
                           2021.

 
                     
                        
                        Xilinx, Inc., UG902: Vivado High-Level Synthesis Guide, Version 2021, Xilinx, Inc.,
                           2021.

 
                     
                        
                        Xilinx, Inc., UG1207: Vivado HLS Optimization Methodology Guide, Version 2017, Xilinx,
                           Inc., 2017. [Online]. Available: https://usermanual.wiki/Document/ug1270vivadohlsoptmethodologyguide.880892326.pdf.

 
                     
                        
                        PYNQ Open-Source Framework, “SD card image version 2.7,” Mar. 2024. [Online]. Available:
                           https://www.pynq.io/.

 
                     
                        
                        T. N. Tan, P. Duong-Ngoc, T. X. Pham, and H. Lee, ``Novel performance evaluation approach
                           of AMBA AXI-based SoC design,'' in Proceedings of the 2021 18th International SoC
                           Design Conference (ISOCC), Jeju Island, Republic of Korea, pp. 403-404, 2021.

 
                     
                        
                        E. Karacan, A. Karakaya, and S. Akleylek, ``Quantum secure communication between service
                           provider and SIM,'' IEEE Access, vol. 10, pp. 69135-69146, 2022.

 
                     
                        
                        L. Daoud, F. Hussein, and N. Rafla, ``Optimization of advanced encryption standard
                           (AES) using Vivado high-level synthesis (HLS),'' Proc. of the 34th International Conference
                           on Computers and Their Applications (CATA 2019), vol. 58, pp. 36-44, 2019.

 
                     
                        
                        E. Homsirikamol and K. G. George, ``Toward a new HLS-based methodology for FPGA benchmarking
                           of candidates in cryptographic competitions: the CAESAR contest case study,'' in Proceedings
                           of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne,
                           VIC, Australia, pp. 120-127, 2017.

 
                     
                        
                        National Institute of Standards and Technology (NIST), “Post-quantum cryptography
                           round 3 submissions,” last modified Jun. 14, 2021. [Online]. Available: https://csrc.nist.gov/projects/post-quantum-cryptography/round-3-submissions
                           [Accessed: Oct. 18, 2021].

 
                     
                        
                        C.-H. Lee, J. Kim, H.-S. Park, and J.-W. Han, ``HLS-based HW/SW co-design and hybrid
                           HLS-RTL design for post-quantum cryptosystem,'' Journal of Semiconductor Technology
                           and Science, vol. 24, no. 3, pp. 191-198, 2024.

 
                   
                
             
            
            
               			Kyungkyun Kang received the B.S. degree in Information and Communication Engineering
               from Inha University, Incheon, South Korea, in 2024. He is currently pursuing a M.S.
               degree in Engineering at department of electrical and computer engineering from Inha
               University in Incheon, South Korea. His areas of interest in research include system
               on chip design, digital system design, digital integrated circuits, hardware acceleration,
               and post-quantum cryptography.
               		
            
            
            
               			Seulbee Yang received the B.S. degree in Information and Communication Engineering
               from Inha University, Incheon, South Korea, in 2025. She is currently pursuing an
               M.S. degree in  engineering at department of electrical and computer engineering from
               Inha University in Incheon, South Korea. Her areas of interest in research include
               digital system design, post-quantum cryptography, and FPGA-based demonstration of
               designed IP modules.
               		
            
            
            
               			Giang Truong Le received the B.E. degree in Electronics and Telecommunication Engineering
               from Ho Chi Minh City University of Technology, Ho Chi Minh, Vietnam, in 2011 and
               received his M.S. degree in Engineering at the Department of Electronic Engineering,
               Pukyong National University, Busan, Korea, in 2016. His areas of interest in research
               includes RFID hardware system design, Internet of Things (IoT) application, and digital
               integrated circuits.
               		
            
            
            
               			Hanho Lee (S’97-M’98-SM’13) received M.Sc and Ph.D. degrees, both in Electrical
               Computer Engineering, from the University of Minnesota, Minneapolis, USA, in 1996
               and 2000, respectively. In 1999, he was a Member of Technical Staff-1 at Lucent Technologies,
               Bell Labs, Holmdel, New Jersey, USA. From April 2000 to August 2002, he was a Member
               of Technical Staff (MTS) at Lucent Technologies (Bell Labs Innovations), Allentown,
               USA, where he was involved in the design of DSP multi-processor architecture. From
               August 2002 to August 2004, he was an Assistant Professor at the Department of Electrical
               and Computer Engineering, University of Connecticut, USA. He has been a faculty member
               at Inha University, Incheon, South Korea, since September 2004, initially in the Department
               of Information and Communication Engineering and, since 2025, in the Department of
               Electrical and Electronic Engineering, where he is currently a Full Professor. He
               leads the Digital Integrated Systems Lab and is the Director of Artificial Intelligence
               System on Chip (AI-SoC) Research Center, Inha University. He was a Visiting Researcher
               with the Electronics and Telecommunications Research Institute (ETRI), South Korea,
               in 2005. He was a Visiting Scholar with Bell Labs, Alcatel-Lucent, Murray Hill, USA,
               from 2010 to 2011, and a Visiting Professor with The University of Texas at Dallas,
               USA, from 2017 to 2018. His research interests include algorithm and VLSI architecture
               design for postquantum cryptography, homomorphic encryption, artificial intelligence,
               forward error correction coding, and digital signal processing. He served as a General
               Chair for ISICAS and Technical Program Chair for ISCAS and APCCAS. He was a Chair
               of the IEEE Circuits and Systems for Communications Technical Committee (CASCOM).
               He was a Board of Governor (BoG) of the IEEE Circuits and Systems Society (CASS),
               from 2020 to 2023. He is the Vice President of Technical Activities of the IEEE CASS.