Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 25, No. 05, p.530-541

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 9 Apr. 2025Revised : 9 Jul. 2025Accepted : 4 Aug. 2025

DOI :

https://doi.org/10.5573/JSTS.2025.25.5.530

HLS-based Hardware/Software Co-design of ML-KEM Post-quantum Cryptosystem for Real-time Video Encryption

KangKyungkyun^* YangSeulbee^** LeGiang Truong^*** LeeHanho^1,^†

(Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Korea)

^* Corresponding author: Hanho Lee hhlee@inha.ac.kr

E-mail :∗kyun7415953@gmail.com,∗∗sarena0824@gmail.com, ∗∗∗letruonggiang2211@gmail.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

CRYSTALS-Kyber is a lattice-based post-quantum cryptosystem that is resistant to attacks by quantum computers and was selected for standardization in the NIST PQC round-3 process. In 2023, NIST published Federal Information Processing Standard (FIPS) 203 for ML-KEM, which includes a set of algorithms (Key Generation, Encapsulation, and Decapsulation) as the next version of CRYSTALS-Kyber. However, the performance and design flexibility of ML-KEM still need to be evaluated. Our system presents a high-performance and fast HW/SW co-design implementation of ML-KEM based on the NIST PQC round-3 parameters using the Vivado HLS tool. HLS tools offer various optimization benefits through the use of directives to accelerate hardware modules. Point-wise multiplication, addition, and parallelism are incorporated in the design to accelerate time-consuming operations in both AES-GCM IP and ML-KEM IP. All hardware modules are parameterized, enabling full support for runtime configuration to increase versatility.

Moreover, the proposed HW/SW architecture and tightly coupled operational workflows reduce data transmission overhead between the processor and hardware modules. The hardware accelerator is implemented using reconfigurable logic on an FPGA and is integrated with a high-performance ARM Cortex-A53 processor in the Xilinx Zynq UltraScale+ architecture, supported by the PYNQ framework.

To evaluate the performance of the proposed HW/SW system for ML-KEM at NIST security levels 1, 3, and 5, we used various data types, including video (AVI, H.264), images (8-bit and 24-bit color), and text files.

For a fixed input size of 320 kB, the proposed hybrid cryptosystem based-on ML-KEM PQC achieved an average of $11.3\times$ improvement in execution time compared to software implementation, with runtimes of 605 ms and 6,894 ms, respectively.

Index Terms

Post-quantum cryptography, high-level synthesis, HW/SW co-design, hybrid HLS-RTL design, , FPGA, video streaming applications, hardware architectures

I. INTRODUCTION

Nowadays, cryptographic primitives are ubiquitously used in hardware and software systems and continuously evolve through community efforts and standardization competitions. The National Institute of Standards and Technology (NIST) has initiated the standardization of Post-Quantum Cryptography (PQC) ^[1].

Hardware-software (HW/SW) co-design is a System-on-Chip (SoC) methodology involving both software design on microprocessors such as ARM or RISC-V, and hardware design on Field-Programmable Gate Arrays (FPGAs) ^[2-^4] or Application-Specific Integrated Circuits (ASICs) ^[5]. HW/SW co-design leverages the advantages of both hardware and software platforms. Specifically, parallel and pipelined architectures can be utilized to accelerate the critical components in hardware, while the non-critical parts can be implemented in software with short development time. Therefore, HW/SW co-design offers a shorter time-to-market than pure hardware designs ^[6,^7] and achieves better performance than pure software implementations ^[8].

In addition, the limited hardware resources of FPGAs make HW/SW co-design an effective approach for efficient system implementations and a timely emergency solution for hardware deployment. Systems that do not fit entirely within the desired FPGA can be implemented as HW/SW co-designs with reduced cost.

Some existing symmetric encryption systems (especially AES-GCM) and hash functions (particularly SHA-2 and SHA-3) are considered secure for PQC scenarios, although their effectiveness may be reduced ^[9,^10]. Therefore, the development of public-key cryptographic systems that combine traditional cryptographic algorithms with PQC is becoming increasingly important. Moreover, the size of encryption keys plays a critical role in post-quantum security, as longer keys resist factorization and search algorithms more effectively ^[11].

High-Level Synthesis (HLS) is a design tool that accepts an algorithm written in C/C++ and generates a corresponding FPGA implementation. HLS offers a faster path to FPGA realization compared to traditional VHDL/Verilog-based approaches. It enables cost-effective exploration of the hardware design space and allows IP packages to be reused in other projects without re-synthesis or reconfiguration. As such, HLS is a useful tool in developing hardware accelerators. HLS has been widely used to implement NIST PQC competition candidates, including lattice-based KEMs ^[11], the Classic McEliece code-based KEM ^[12], and comprehensive implementations of both lattice-based KEMs and digital signature schemes ^[13]. HW/SW co-design approaches utilizing HLS for hardware accelerator design have also been successfully applied to Classic McEliece ^[12] and BIKE ^[14].

In 2024, a system combining three algorithms—classical, quantum, and PQC—was implemented on an FPGA. This system integrated a pre-quantum Key Exchange scheme (KEX), a post-quantum Key Encapsulation Mechanism (KEM), and a Quantum Key Distribution (QKD) algorithm ^[15].

ML-KEM is a promising protocol that has advanced to the final standardization round for NIST's key encapsulation mechanism schemes. Its security is based on the hardness of finding short vectors in lattices. ML-KEM is constructed based on the Module Learning With Errors (MLWE) problem. More details can be found in the official technical documentation ^[1].

The main contributions of this paper can be summarized as follows:

• We propose a HW/SW hybrid architecture and implement a fast architecture for a Post-Quantum Cryptography (PQC) system on the Xilinx Zynq UltraScale+ ZCU104 FPGA platform, enabling data encryption and decryption using the HLS tool. The proposed design supports efficient hardware resource reuse and can easily adapt to various data types (texts, images, and videos) for IoT applications.

• HLS source code (written in C++ languages) for AES-GCM-256 and ML-KEM is generated for simulation, synthesis, and IP block generation. The ML-KEM (formerly CRYSTALS-Kyber) IP core supports post-quantum key encapsulation operations, including key generation, encapsulation, and decapsulation for all Kyber variants, as shown in Table 1. This approach reduces pipeline-induced redundancy, resulting in high throughput and lower resource usage on the Xilinx Zynq UltraScale+ FPGA. The proposed system also presents an optimized hardware architecture for AES-GCM operations and the ML-KEM scheme, with parameter sets and sizes of encapsulation key (ek), decapsulation key (dk), and ciphertext (ct) illustrated in Table 1.

• Finally, we evaluate the proposed architecture on the UltraScale+ ZCU104 FPGA platform and compare it with state-of-the-art works. Experimental results show that the proposed design achieves comparable throughput while using fewer hardware resources than existing studies.

The remainder of this paper is organized as follows. Section II introduces the background and related work. Section III presents the system design and the HLS-based HW/SW co-design implementation of our PQC system. Section IV provides experimental results and a comparative analysis of performance to evaluate the effectiveness of our approach. Section V concludes the paper with a summary of our proposed PQC system.

Table 1. Parameter set for ML-KEM.

II. BACKGROUND AND RELATED WORK

1. Module Lattice-based Key Encapsulation Mechanisms

NIST has been working on a public project to standardize quantum-safe algorithms, including key encapsulation and digital signatures. At the end of Round 3, NIST selected CRYSTALS-Kyber as the first Key Encapsulation Mechanism (KEM) for standardization ^[1]. CRYSTALS-Kyber was then standardized as the Module-Lattice-based Key Encapsulation Mechanism (ML-KEM) in the official reference document. ML-KEM was formally standardized in 2024 ^[1].

1.1 Polynomial multiplication using NTT

ML-KEM is derived from the Round 3 version of Kyber ^[7]. Polynomial multiplication over $Z_{3329}[x]/(x^{256} + 1)$ is a fundamental operation in ML-KEM. Utilizing the property $n \mid (q - 1)$, ML-KEM employs an incomplete Number Theoretic Transform (NTT) to accelerate this operation. NTT is a special form of the Discrete Fourier Transform (DFT) over finite fields. It is commonly employed to reduce the computational complexity of polynomial multiplications ^[16].

Polynomial multiplication involves convolving the coefficients of two polynomials, which can be computationally intensive, especially for larger parameters. NTT leverages the algebraic structure of the underlying finite field to enable more efficient polynomial multiplication. The results of the multiplication are computed modulo $Q$ using Montgomery and Barrett reductions in ML-KEM ^[18].

In mathematics, the general polynomials $a(x)$ and $b(x)$ in $ R_q = \mathbb{Z}_q/(x^N + 1) $ can be represented as follows:

(1)

$ a(x)=a_{0}x^0+a_{1}x^1+a_{2}x^2+\cdots +a_{N-1}x^{N-1}, $

(2)

$ b(x)=b_{0}x^0+b_{1}x^1+b_{2}x^2+\cdots +b_{N-1}x^{N-1}. $

We can express the result of the multiplication of $a(x)$ and $b(x)$ as $c(x)$, as follows:

(3)

$ c(x)=a(x)\cdot b(x)=\sum_{i=0}^{n-1}\sum_{j=0}^{n-1}a_{i}b_{j}x^{i+j} . $

The traditional schoolbook method for polynomial multiplication has a computational complexity of $O(n^2)$, but by utilizing the NTT, this can be reduced to $O(n \log n)$. Eq. ((4)) defines the NTT, and Eq. ((5)) defines the INTT.

(4)

$ A_k = \sum_{i=0}^{N-1} a_{i} \omega^{ik} \, \bmod \, q, $

(5)

$ a_i = N^{-1} \sum_{k=0}^{N-1} A_k \omega_N^{-ik} \, \bmod \, q, $

where $\omega$ is the twiddle factor (TF) and also the primitive $n$-th root of unity, and $q$ is the modulus.

By utilizing the NTT and INTT, Eq. ((3)) can be transformed into Eq. ((6)). In this context, the $\circ$ operator denotes point-wise multiplication.

(6)

$ c(x) = a(x) \cdot b(x) = NTT^{-1}(NTT(a(x)) \circ NTT(b(x))). $

ML-KEM uses four SHA-3 hash functions: SHA3-256, SHA3-512, SHAKE128, and SHAKE256. For more details, readers can refer to the FIPS 203 draft ^[1].

According to the FIPS 203 document, three primary algorithms are defined as KEM operations: key generation, encapsulation, and decapsulation.

1.2 Modular reduction for ML-KEM

The parameters of ML-KEM have already changed the prime number $Q$ from $7681$ to $3329$, which consequently affects the ciphertext and key sizes. In ^[18], two efficient modular reduction algorithms—Algorithms 1 and 2—were proposed, referred to as Montgomery Reduction and Barrett Reduction, for use in NTT and INTT, respectively.

Algorithm 1: Montgomery reduction for ML-KEM ^[18].

Algorithm 2: Barrett Reduction for ML-KEM ^[18].

1.3 Key generation (Key gen)

Fig. 1 illustrates the public key $ek$ and private key $dk$ generated by a probabilistic key generation algorithm. The encryption key ($ek$) is generated using the matrix $A$, the value $\hat{s}$, and the value $\hat{e}$, following the equation in ^[1]:

Fig. 1. A general view of a system using ML-KEM and AES-GCM for real-time video encryption and decryption. The Key Generation (Key Gen) algorithm in ML-KEM generates a key pair ($ek$, $dk$), where $ek$ (encapsulation key) and $dk$ (decapsulation key) are used in the Encapsulation and Decapsulation algorithms, respectively.

(7)

$ \hat{t} := \hat{A} \circ \hat{s} + \hat{e}, $

where $\hat{A} \in R_{q}^{k \times k}$ forms the matrix, $\hat{s} \in R_{k}^{q}$ and $\hat{e} \in R_{q}^{k}$ is the noise vector, and both $\hat{s}$ and $\hat{e}$ are short polynomials whose coefficients are sampled from the central binomial distribution $\beta_{\eta}$. The encoder sub-module is responsible for serializing the byte array into a polynomial.

The encapsulation key is simply the encryption key.

(8)

$ ek := ek_{pke} = Encode_{12}(((\hat{A} \circ \hat{s}+\hat{e})\,\bmod^+ q)\parallel p). $

Then, the transformed value $\hat{s}$ is used to generate the decryption key $dk$, as shown in ((9)).

(9)

$ dk_{pke} := Encode_{12}(\hat{s} \,\bmod^+ q). $

The decapsulation key comprises the decryption key, the encapsulation key, a hash of the encapsulation key, and a random 32-byte value $z$.

(10)

$ dk := (dk_{pke} || ek || sha3\text{-}256(ek) || z). $

1.4 Encapsulation (Encaps)

A probabilistic encapsulation algorithm takes as input an encapsulation key $ek$ and outputs a ciphertext $ct$ and a shared secret key $ss$.

First, random bytes $m$ and the encapsulation key $ek$ are used in a hash sampler sub-module to generate the shared secret key $ss$ and a sub-random value $r$.

(11)

$ (ss, r) := sha3\text{-}512(m || sha3\text{-}256(ek)). $

The decoder sub-module converts the byte arrays $ek$ and $m$ into polynomial representations and transforms them into the required format. Then, using the converted $ek$, the matrix $A^T$ is generated via the hash sampler and rejection sampler sub-modules. In this step, the internal padding order is applied in the reverse direction compared to the key generation process.

Next, the encryption randomness $r$ is used to generate noise values $y$, $e_1$ and $e_2$ using the hash sampler and binomial sampler sub-modules, as specified in Algorithm 16 of the FIPS-203 standard document ^[1].

The noise vector $y$ is transformed into the NTT domain as $\hat{y}$ through the NTT sub-module. This prepares the noise vector in the domain required for the encryption process.

Using the generated values, the intermediate ciphertext components $u$ and $v$ are computed as follows in Eqs. (12) and (13):

(12)

$ u := NTT^{-1}(\hat{A}^T \circ \hat{y}) + e_1, $

(13)

$ v := NTT^{-1}(\hat{t}^T \circ \hat{y}) + e_2 + decompress_q(m). $

The values $u$ and $v$, which comprise the ciphertext, are then compressed into $c_1$ and $c_2$ through a compression and encoding process using the corresponding compress and encode sub-modules.

1.5 Decapsulation (Decaps)

A decapsulation process takes as input a decapsulation key $dk$ and a ciphertext $ct$, and performs decryption. The ciphertext $ct$ is separated into $u$ and $v$ through the Decode and Decompress sub-module processes. Additionally, the decapsulation key $dk$ is restored as $\hat{s}$ using the Decode sub-module, as described in Algorithm 17 in FIPS 203 ^[1].

The restored $u$ value is transformed into the NTT domain through the NTT module, followed by point-wise multiplication with $\hat{s}$. The resulting value is then converted back to the base domain using the $NTT^{-1}$ sub-module. Subsequently, a similar operation to Eq. Eq. (15) is performed to obtain the message $m'$ ^[1].

(14)

$ w := v - NTT^{-1}(\hat{s}^T \circ NTT(u)) , $

(15)

$ m' := Encode_1\!(Compress_q\!(\!v \!-\! NTT^{-\!1}\!(\hat{s}^T \!\circ\! NTT(u)\!)\!)\!) . $

The decapsulation algorithm then computes a candidate shared secret key $ss'$ using $m'$ and a part of $dk$, following the same procedure used in the encapsulation process.

(16)

$ (ss', r') := sha3\text{-}512(m' || dk[768k+32:768k+64]). $

We briefly summarize the Crystals-Kyber public-private key cryptosystem. Crystals-Kyber is a PQC KEM published in ^[17] and has been standardized by NIST under the name Module Lattice-Based Key Encapsulation Mechanism (ML-KEM), as mentioned above. ML-KEM enables a sender and receiver to securely establish a shared secret key. ML-KEM consists of three algorithms: Key Generation (KeyGen), Encapsulation (Encaps), and Decapsulation (Decaps).

First, a sender executes the KeyGen algorithm to generate an encapsulation key ($ek$) and a decapsulation key ($dk$), and securely distributes $ek$ to the receiver. The receiver then uses this $ek$ in Encaps to generate a shared secret key ($ss$) and ciphertext ($ct$). The receiver sends this $ct$ to the sender. Finally, the sender uses Decaps with its own $dk$ to decrypt $ct$ and obtain the shared secret key $ss$.

From left to right in Fig. 1, which illustrates our case study for video encryption that combines symmetric and asymmetric encryption techniques for robust data security. In this approach, asymmetric encryption generates a pair of keys (called $ss$, $ss'$) to encrypt and decrypt the data, which is then secured using symmetric encryption. Since both the sender and receiver possess the shared secret key $ss$, they can use AES-GCM to Encrypt or Decrypt and exchange data securely.

In summary, ML-KEM is similar to Public Key Encryption (PKE) in that one party encrypts a message using the public key $ek$, and the other decrypts it using the private key $dk$.

2. AES-GCM

The Advanced Encryption Standard (AES) specifies a FIPS-approved cryptographic algorithm that can be used to protect digital data. The AES-GCM algorithm is a symmetric block cipher that can encrypt (encipher) and decrypt (decipher) information. Encryption converts data into an unintelligible form called ciphertext, while decryption converts the ciphertext back into its original form, known as plaintext. The AES-GCM algorithm supports cryptographic keys of 128, 192, and 256 bits to encrypt and decrypt data in blocks of 128 bits ^[23]. For our implementation, we select the highest security level, AES-GCM-256 (AES-GCM).

3. High-Level Synthesis

Vivado HLS tools do not support all C/C++ functions for hardware implementation. For example, dynamic memory allocation, function recursion, system calls, and file input/output (I/O) operations are not supported by HLS tools. For this reason, there are two different design approaches with HLS synthesis tools:

First, a previously developed C reference design can be ported to a hardware implementation. In this case, the developer starts with existing C code, and the role of the designer is to modify the code sections that are not synthesizable and then optimize the design by adding optimization directives to meet the design goals. The top-down design flow is more practical in this situation because the design can be accelerated as a single function.

Second, the design can be developed from scratch, where the designer writes the code with full awareness that it will be synthesized for hardware implementation. Both top-down and bottom-up design styles are feasible. In the bottom-up approach, the designer can begin by accelerating sub-functions in the application, with the flexibility to later expand the accelerator to include additional functions.

4. Xilinx PYNQ Platform

PYNQ is an open-source Xilinx project that facilitates easy programming and interaction with Zynq FPGA devices through Python ^[21]. The Zynq architecture features a hard-core ARM Cortex-A53 processing system (PS) embedded in the FPGA programmable logic (PL).

By leveraging the Python language and its libraries, developers can write host code in Python to be executed in the PS domain. This allows collaborative design between the CPU (ARM processor on the FPGA board) in the Zynq FPGA's PS domain and the PL domain. This approach enables the implementation of various embedded systems on a single FPGA. Fig. 2 shows the PYNQ framework that runs on the CPU (ARM Cortex-A53) in the PS domain when the PYNQ boot image is loaded and the Ultrascale+ Zynq 104 (ZCU104) FPGA is booted in SD-card boot mode.

Fig. 2. Architecture of PYNQ framework.

The framework consists of a PYNQ overlay, which is a firmware design incorporating the blocks described in Section III and built using Vivado, and a Jupyter Notebook written entirely in Python. The notebook server runs on an Ubuntu-based Linux kernel in the PS domain and can be used to program and configure the PL, move data through the PL, and visualize the results.

Fig. 3. High-level synthesis for ML-KEM and AES-GCM IPs.

III. HARDWARE/SOFTWARE CO-DESIGN

The internal hardware implementations of the ML-KEM (key generation, encapsulation, and decapsulation) processes are illustrated in Fig. 4. These diagrams present the modular dataflow and integration of arithmetic blocks such as NTT, INTT, samplers, and hash logic, all optimized for FPGA implementation. Each module was synthesized using Vivado HLS and designed to optimize the trade-off between throughput and resource utilization.

Fig. 4. HLS-based ML-KEM block diagrams for (a) key generation, (b) encapsulation, and (c) decapsulation.

As shown in Fig. 3, the design process began with C source-level simulation, where a testbench was developed to validate the functional correctness of the C logic and the AXI interface. As part of the high-level synthesis process, optimization directives such as PIPELINE and UNROLL were explicitly applied—particularly to loop-intensive operations such as the Number Theoretic Transform (NTT) and Inverse NTT (INTT)—to exploit parallelism and improve performance. Among the various modules, the NTT and INTT blocks were identified as primary performance bottlenecks, as they are used in all core operations of ML-KEM. To mitigate this issue, we manually applied loop unrolling to the outermost loop in the three-dimensional NTT/INTT structure to increase parallel execution. Additionally, the HASH module was optimized by applying the PIPELINE directive to its internal round function.

Fig. 5. HLS-based AES-256-GCM hardware block diagram.

Fig. 6. Hardware/software co-design flow of the proposed ML-KEM and AES-GCM architecture in Vivado HLS.

As a result of these optimizations, the overall latency of the ML-KEM operation was reduced by approximately 80,000 clock cycles compared to the initial implementation, resulting in significant improvements in efficiency and execution time.

The hardware architecture of AES-256 encryption and decryption is depicted in Fig. 5. The diagrams illustrate the structured dataflow and composition of fundamental cryptographic components, including SubBytes, ShiftRows, MixColumns, and AddRoundKey, all tailored for efficient FPGA realization. The design flow is illustrated in Fig. 3, following the same methodology as that of ML-KEM. The ARRAY\_PARTITION directive was applied to two-dimensional arrays such as encrypt\_block, GF, state, and expanded key, allowing each array dimension to be mapped to individual registers for parallel processing. Furthermore, the internal KeyExpansion function was optimized through loop unrolling and pipelining of the shiftrow128, xor128, and GF\_mult128 operations. As a result of the applied optimizations, the overall latency of the AES-256 module was reduced by approximately 53,000 clock cycles compared to the baseline implementation, leading to substantial improvements in performance and processing efficiency.

The AXI-Stream interface was also integrated to generate the corresponding RTL code. The synthesized RTL was then verified through C/RTL co-simulation using waveform-based simulation in Vivado to ensure functional equivalence between the C and RTL implementations. The synthesized IP cores were mapped to the Programmable Logic (PL) region of the Zynq UltraScale+ FPGA, finalizing the hardware system implementation.

The top-level hardware/software co-design architecture is shown in Fig. 6. Our proposed system is implemented on the Xilinx ZCU104 FPGA board, which includes two parts: the PS and the PL. The Advanced eXtensible Interface (AXI) standard is used to interconnect the PS and PL. The PYNQ framework provides a software application interface that runs on the ARM processor in the PS, as mentioned above, while the designed hardware accelerators for ML-KEM and AES-GCM run on the reconfigurable logic in the PL.

The camera is used to generate video frames or capture videos with a resolution of 640×480. To interface with the user, the PYNQ framework is installed on the ARM Cortex-A53 CPU and is used to control the camera to record video data. The capture module converts native video to the AXI stream protocol. This module samples the incoming data at 200 MHz and generates appropriate AXI stream signals that integrate easily with other parts of the architecture using the same protocol.

In addition, each accelerator exposes an AXI stream interface that connects to dedicated DMA controllers (DMA0 for ML-KEM and DMA1 for AES-GCM), which provide high-bandwidth access to local memory. The DMA controllers are connected to memory via an AXI interface, while the CPU accesses their initialization, status, and management registers through AXI4-Lite. The details of the values in these registers are shown in Table 2.

Table 2. HLS AXI-LITE register functions.

On the PS side, the processor accesses data in the DDR for computation. The processor includes a cache to store temporary data for acceleration. The HPM0 and HPM1 ports are high-performance interfaces that connect to the DDR controller through the AXI interconnect block. They can read and write large volumes of data in memory using the AXI protocol.

On the PL side, the DMA serves as the intermediary for data communication with the DDR and is connected to the HP port using the AXI stream protocol ^[22]. The DMA interacts with the hardware accelerator through input and output FIFOs. The read and write interrupt signals of the DMA pass to the IRQ port through the concat IP. The processor controls the DMA data transfer and passes configured parameters via the AXI HP ports using the AXI-Lite protocol. The AXI interconnect and AXI DMA act as intermediaries between the endpoint IPs and the PS. The AXI DMA controller supports memory copy (memcpy) and memory initialization (memset) functions, both of which can operate on byte, half-word, and word granularity. The AXI stream data transmission in this design uses a 64-bit bus, while the AXI-Lite control signal uses a 32-bit bus.

For operation, the ML-KEM block is executed first, and a pair of shared keys $ss$ and $ss'$ are generated by the Key Generation (KeyGen) sub-block, as shown in Fig. 6. The selected shared key $ss$ is then sent to the AES-GCM IP block, which is controlled by DMA0, while $ss'$ is used in the decryption algorithm.

Each IP module works independently. All modules interface with the same input and output FIFOs using the AXI DMA protocol. The module control logic is implemented as an arbiter designed to transmit control information between the PS and various acceleration modules. The system's configurability is achieved through control registers, which transmit control signals and design parameters. The four control registers are defined, as shown in Table 2. The register0 uses 3 bits to control the startup of four modules, while the remaining registers are used to convey parameter settings for the various modules.

Fig. 9 shows the basic operation of DMA0, which transfers the generated shared key to the AES-GCM IP block for re-encryption. The AXI stream handshake signals of the ML-KEM IP block—clock ($clk$), last ($kem\_TLAST$), keep ($kem\_TKEEP$), ready ($kem\_TREADY$), valid ($kem\_TVALID$), and data ($kem\_TDATA$)—are required for the IP block interface. Specifically, the $kem\_TLAST$ signal is required to indicate the end of a frame. In general, stream signals primarily facilitate handshake mechanisms. To clarify this, we categorize them into three groups: $kem\_TVALID$, $kem\_TREADY$, and all other signals grouped under $kem\_TDATA$.

After encryption/decryption, the result is transmitted via DMA1 using the PYNQ framework. The timing signals of the DMA1 protocol are depicted in Figs. 8 and 9.

Fig. 7. Flow chart for hybrid mode of the proposed HW/SW co-design.

Fig. 8. Timing diagram of the signals involved in the encapsulation algorithm, with the shared key from ML-KEM block, and encrypted data from AES-GCM block.

Fig. 9. Temporal trends of the signals involved in the encapsulation algorithm, encrypted data output occurs on each rising edge of the clock.

Fig. 10. ML-KEM based hybrid cryptosystem architecture on FPGA.

IV. EXPERIMENT RESULTS

Table 3 shows a comparison of FPGA resource utilization for the AES-GCM implementation, while Table 4 provides a comparison of hardware resource consumption for the ML-KEM and CRYSTALS-Kyber implementations. Additional results from our case studies, which were not previously included, are summarized in Table 5, showing the complete combination of ML-KEM and AES-GCM IP cores.

Table 3. Performance comparison for AES-GCM architecture, with latency (Cycle): Lat.(CC), frequency: Freq., and Encryption/Decryption: Enc/Dec.

Table 4. Performance comparison for ML-KEM/CRYSTALS-Kyber architecture, with latency (Cycle): Lat.(CC), frequency: Freq., and Key generation/Encapsulation/Decapsulation: K/E/D.

Table 5. Performance comparison for ML-KEM based hybrid cryptosystem architecture, with 320KB file, latency (Cycle): Lat.(CC), frequency: Freq.

We evaluated our design using Xilinx Vivado HLS on the Zynq UltraScale+ Evaluation Platform (ZCU104 board, xczu7ev-ffvc1156-2-e) with the open-source PYNQ framework ^[21]. For the ML-KEM hardware, we selected the parameter set specified in the ML-KEM standard with $n = 256$, $Q = 3329$, and $k = 2$. These parameters are summarized in Table 1.

As shown in Table 4, the proposed ML-KEM IP, which supports key generation, encapsulation, and decapsulation, reduces execution time by approximately $2.84\times$ compared to the software implementation in ^[26]. It also achieves higher throughput than the CRYSTALS-Kyber HLS design reported in ^[27], improving the area-time product (ATP) by approximately 37.1\%.

The hardware export file (.bit file) was synthesized in Vivado and loaded onto the FPGA using the PYNQ Overlay class. Input data streams are passed to the AES-GCM block in packets using the Xilinx DMA tool and the PYNQ DMA driver. The resulting data can then be processed and visualized in the Jupyter notebook using standard Python tools.

Throughput is defined as the number of bits processed per second for the whole operations (key generation, encapsulation, decapsulation). The hardware latency of ML-KEM is presented in Table 4.

The throughput, $T_p$, achieved by the proposed ML-KEM design can be calculated using the following equation:

(17)

$ T_p =\frac{Frequency \times bits}{Cycles} $

Here, $T_p$ represents the throughput in kilobits per second (kbps). The throughput reported in ^[27] is approximately 8.6 kbps, whereas the proposed design achieves a throughput of approximately 16.4 kbps, representing an improvement of nearly $1.91\times$.

V. CONCLUSION

We implemented, compared, and analyzed the performance of HLS-based optimization and the design results of a proposed post-quantum cryptographic system. Xilinx tools, Vivado HLS (2022.2) and Vivado (2022.2), were utilized, with the Zynq FPGA (ZCU104 board) set as the target platform.

As shown in Fig. 6, the HLS design was implemented on the Zynq FPGA. The developed module, enhanced through directive-based optimizations in Vivado HLS, achieved reduced latency compared to both the original NIST reference source code and previous HLS design. Experimental results confirm improvements in performance and area-time product (ATP). In particular, for 320 KB files, the encryption/decryption time was reduced by approximately $11.3\times$ compared to the software implementation used in the reference source code. The implemented PQC design is promising and can be applied to a wide range of security-critical applications.

ACKNOWLEDGMENTS

This work was supported by INHA UNIVERSITY Research Grant.

References

National Institute of Standards and Technology, ``Module-lattice-based key encapsulation mechanism standard,'' Department of Commerce, Washington, D.C., Federal Information Processing Standards Publication (FIPS), NIST FIPS 203, 2024. [Online] Available: https://doi.org/10.6028/NIST.FIPS.203

D. T. Nguyen, V. B. Dang and K. Gaj, ``A high-level synthesis approach to the software/hardware codesign of NTT-based post-quantum cryptography algorithms,'' Proc. of International Conference on Field-Programmable Technology (ICFPT), Tianjin, China, pp. 371-374, 2019.

J. P. Smith, J. I. Bailey, J. Tuthill, L. Stefanazzi, G. Cancelo, K. Treptow, and B. A. Mazin, ``A high-throughput oversampled polyphase filter bank using vivado HLS and PYNQ on a RFSoC,'' IEEE Open Journal of Circuits and Systems, vol. 2, pp. 241-252, 2021.

K. Haeublein, W. Brueckner, S. Vaas, S. Rachuj, M. Reichenbach, and D. Fey, ``Utilizing PYNQ for accelerating image processing functions in ADAS applications,'' in Proceedings of the 32nd International Conference on Architecture of Computing Systems (ARCS Workshop 2019), Copenhagen, Denmark, pp. 1-8, 2019.

S. Morioka, T. Isshiki, S. Obana, Y. Nakamura, and K. Sako, ``Flexible architecture optimization and ASIC implementation of group signature algorithm using a customized HLS methodology,'' in Proceedings of the 2011 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), San Diego, CA, USA, 2011.

J. Kokila, N. Ramasubramanian, and S. Indrajeet, ``A survey of hardware and software co-design issues for system on chip design,'' Advanced Computing and Communication Technologies, Springer, Singapore, 2016.

Y. Zhang, Y. Zhao, J. Hu, and W. Zhang, ``AutoAI2C: an automated hardware generator for DNN acceleration on both FPGA and ASIC,'' IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024.

S. Srilakshmi and G. L. Madhumati, ``A comparative analysis of HDL and HLS for developing CNN accelerators,'' in Proceedings of the 2023 Third International Conference on Artificial Intelligence and Smart Energy (ICAIS), Coimbatore, India, pp. 1060-1065, 2023.

T. Takaki, Y. Li, K. Sakiyama, S. Nashimoto, D. Suzuki and T. Sugawara, “An Optimized Implementation of AESGCM for FPGA Acceleration Using High-Level Synthesis,” 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE). Kobe, Japan, pp. 176-180, 2020.

H. S. Jacinto, L. Daoud, and N. Rafla, ``High-level synthesis using Vivado HLS for optimizations of SHA-3,'' in Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, pp. 563-566, 2017.

E. Homsirikamol, K. Gaj, and R. R. L. Pareschi, ``C vs. VHDL: benchmarking CAESAR candidates using high-level synthesis and register-transfer level methodologies,'' in Directions in Authenticated Ciphers (DIAC), 2015.

V. Kostalabros, J. Ribes-González, O. Farràs, M. Moretó, and C. Hernandez, ``HLS-based HW/SW co-design of the post-quantum Classic McEliece cryptosystem,'' in Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, pp. 52-59, 2021.

Z. Zhou, D. He, Z. Liu, M. Luo, and K.-K. Choo, ``A software/hardware co-design of CRYSTALS-Dilithium signature scheme,'' ACM Transactions on Reconfigurable Technology and Systems, vol. 14, no. 2, 11, 2021.

G. Montanaro, A. Galimberti, E. Colizzi, and D. Zoni, ``Hardware-software co-design of BIKE with HLS-generated accelerators,'' Proc. of the 2022 29th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Glasgow, United Kingdom, pp. 1-4, 2022.

S. Ricci, P. Dobias, L. Malina, J. Hajny, and P. Jedlicka, ``Hybrid keys in practice: combining classical, quantum and post-quantum cryptography,'' IEEE Access, vol. 12, pp. 23206-23219, 2024.

J. W. Cooley and J. W. Tukey, ``An algorithm for the machine calculation of complex Fourier series,'' Mathematics of Computation, vol. 19, no. 90, pp. 297-301, Jan. 1965.

“NIST post-quantum cryptography round 1 submissions,” National Institute of Standards and Technology (NIST), [Online] https://csrc.nist.gov/Projects/PostQuantum-Cryptography/Round-1-Submissions, 2017.

P. Nannipieri, S. Di Matteo, L. Zulberti, F. Albicocchi, S. Saponara, and L. Fanucci, ``A RISC-V post-quantum cryptography instruction set extension for number theoretic transform to speed up CRYSTALS algorithms,'' IEEE Access, vol. 9, pp. 150798-150808, 2021.

Xilinx, Inc., UG902: Vivado High-Level Synthesis Guide, Version 2021, Xilinx, Inc., 2021.

Xilinx, Inc., UG1207: Vivado HLS Optimization Methodology Guide, Version 2017, Xilinx, Inc., 2017. [Online]. Available: https://usermanual.wiki/Document/ug1270vivadohlsoptmethodologyguide.880892326.pdf.

PYNQ Open-Source Framework, “SD card image version 2.7,” Mar. 2024. [Online]. Available: https://www.pynq.io/.

T. N. Tan, P. Duong-Ngoc, T. X. Pham, and H. Lee, ``Novel performance evaluation approach of AMBA AXI-based SoC design,'' in Proceedings of the 2021 18th International SoC Design Conference (ISOCC), Jeju Island, Republic of Korea, pp. 403-404, 2021.

E. Karacan, A. Karakaya, and S. Akleylek, ``Quantum secure communication between service provider and SIM,'' IEEE Access, vol. 10, pp. 69135-69146, 2022.

L. Daoud, F. Hussein, and N. Rafla, ``Optimization of advanced encryption standard (AES) using Vivado high-level synthesis (HLS),'' Proc. of the 34th International Conference on Computers and Their Applications (CATA 2019), vol. 58, pp. 36-44, 2019.

E. Homsirikamol and K. G. George, ``Toward a new HLS-based methodology for FPGA benchmarking of candidates in cryptographic competitions: the CAESAR contest case study,'' in Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, VIC, Australia, pp. 120-127, 2017.

National Institute of Standards and Technology (NIST), “Post-quantum cryptography round 3 submissions,” last modified Jun. 14, 2021. [Online]. Available: https://csrc.nist.gov/projects/post-quantum-cryptography/round-3-submissions [Accessed: Oct. 18, 2021].

C.-H. Lee, J. Kim, H.-S. Park, and J.-W. Han, ``HLS-based HW/SW co-design and hybrid HLS-RTL design for post-quantum cryptosystem,'' Journal of Semiconductor Technology and Science, vol. 24, no. 3, pp. 191-198, 2024.

Kyungkyun Kang

Kyungkyun Kang received the B.S. degree in Information and Communication Engineering from Inha University, Incheon, South Korea, in 2024. He is currently pursuing a M.S. degree in Engineering at department of electrical and computer engineering from Inha University in Incheon, South Korea. His areas of interest in research include system on chip design, digital system design, digital integrated circuits, hardware acceleration, and post-quantum cryptography.

Seulbee Yang

Seulbee Yang received the B.S. degree in Information and Communication Engineering from Inha University, Incheon, South Korea, in 2025. She is currently pursuing an M.S. degree in engineering at department of electrical and computer engineering from Inha University in Incheon, South Korea. Her areas of interest in research include digital system design, post-quantum cryptography, and FPGA-based demonstration of designed IP modules.

Giang Truong Le

Giang Truong Le received the B.E. degree in Electronics and Telecommunication Engineering from Ho Chi Minh City University of Technology, Ho Chi Minh, Vietnam, in 2011 and received his M.S. degree in Engineering at the Department of Electronic Engineering, Pukyong National University, Busan, Korea, in 2016. His areas of interest in research includes RFID hardware system design, Internet of Things (IoT) application, and digital integrated circuits.

Hanho Lee

Hanho Lee (S’97-M’98-SM’13) received M.Sc and Ph.D. degrees, both in Electrical Computer Engineering, from the University of Minnesota, Minneapolis, USA, in 1996 and 2000, respectively. In 1999, he was a Member of Technical Staff-1 at Lucent Technologies, Bell Labs, Holmdel, New Jersey, USA. From April 2000 to August 2002, he was a Member of Technical Staff (MTS) at Lucent Technologies (Bell Labs Innovations), Allentown, USA, where he was involved in the design of DSP multi-processor architecture. From August 2002 to August 2004, he was an Assistant Professor at the Department of Electrical and Computer Engineering, University of Connecticut, USA. He has been a faculty member at Inha University, Incheon, South Korea, since September 2004, initially in the Department of Information and Communication Engineering and, since 2025, in the Department of Electrical and Electronic Engineering, where he is currently a Full Professor. He leads the Digital Integrated Systems Lab and is the Director of Artificial Intelligence System on Chip (AI-SoC) Research Center, Inha University. He was a Visiting Researcher with the Electronics and Telecommunications Research Institute (ETRI), South Korea, in 2005. He was a Visiting Scholar with Bell Labs, Alcatel-Lucent, Murray Hill, USA, from 2010 to 2011, and a Visiting Professor with The University of Texas at Dallas, USA, from 2017 to 2018. His research interests include algorithm and VLSI architecture design for postquantum cryptography, homomorphic encryption, artificial intelligence, forward error correction coding, and digital signal processing. He served as a General Chair for ISICAS and Technical Program Chair for ISCAS and APCCAS. He was a Chair of the IEEE Circuits and Systems for Communications Technical Committee (CASCOM). He was a Board of Governor (BoG) of the IEEE Circuits and Systems Society (CASS), from 2020 to 2023. He is the Vice President of Technical Activities of the IEEE CASS.

JSTSJournal of Semiconductor Technology and Science

Journal Search

Journal XML

Journal Information

HLS-based Hardware/Software Co-design of ML-KEM Post-quantum Cryptosystem for Real-time Video Encryption

Abstract

Index Terms

I. INTRODUCTION

II. BACKGROUND AND RELATED WORK

1. Module Lattice-based Key Encapsulation Mechanisms

1.1 Polynomial multiplication using NTT

(1)

(2)

(3)

(4)

(5)

(6)

1.2 Modular reduction for ML-KEM

1.3 Key generation (Key gen)

(7)

(8)

(9)

(10)

1.4 Encapsulation (Encaps)

(11)

(12)

(13)

1.5 Decapsulation (Decaps)

(14)

(15)

(16)

2. AES-GCM

3. High-Level Synthesis

4. Xilinx PYNQ Platform

III. HARDWARE/SOFTWARE CO-DESIGN

IV. EXPERIMENT RESULTS

(17)

V. CONCLUSION

ACKNOWLEDGMENTS

References

Kyungkyun Kang

Seulbee Yang

Giang Truong Le

Hanho Lee

Article Information (continued)

Index Terms

JSTS
Journal of Semiconductor Technology and Science