Mobile QR Code QR CODE

  1. (Dept. of Electrical and Computer Engineering, Inha University, Incheon, 22212, Korea)



Post-quantum cryptography, high-level synthesis, HW/SW co-design, hybrid HLS-RTL design, FPGA

I. INTRODUCTION

Artificial intelligence (AI) and cloud services are prominent fields that will gain increased importance in the future. These technologies are expected to be applied in a wide range of fields as advances in network technology enable rapid data communication.

According to the trends in these technologies, experts believe that future services will be in the form of combined cloud services and AI. The traditional approach involves on-demand availability, in which the required data is immediately accessed from cloud servers. However, by leveraging the data provided by AI, it is possible to achieve more diverse and personalized services, thereby enabling more efficient operations between operators and users.

Computer technology has also exhibited rapid progress. Quantum computers, which utilize principles of quantum mechanics such as entanglement and superposition, are being actively researched by global IT companies for the purpose of commercialization. When quantum computers become commercially available, these are expected to provide overwhelmingly superior computational power compared with existing computers, enabling the quick resolution of time-consuming or even unsolvable mathematical problems.

Therefore, it is necessary to establish new secure cryptographic systems for quantum computers in preparation for their commercialization. Existing cryptographic systems (e.g., RSA and ECC) used in AI and cloud technologies are vulnerable to the Shor algorithm [1] in quantum computers, which increases the need for post-quantum cryptography (PQC). To address this issue, the National Institute of Standards and Technology (NIST) organized an algorithm competition [2] for PQC in 2016. Through three rounds of evaluation, the Crystals–Kyber cryptographic algorithm [3], which was included among lattice-based candidates, was selected as the final candidate in the public key encryption category scheduled for July 2022 [4].

Crystals–Kyber is a cryptographic system based on mathematical problems (e.g., shortest vector problem and closest vector problem) [5]. It offers high security and is resistant to quantum computers. The use of the ring learning with errors (Ring-LWE) technique has gained significant attention owing to its relatively short key and ciphertext sizes. In this paper, we propose methodologies for the high-level synthesis (HLS)-based hardware/ software (HW/SW) co-design and hybrid HLS and register transfer level (RTL) design (HLS-RTL) of secure PQC systems. We conducted a comparative analysis of the performance and results to evaluate the effectiveness of these approaches.

II. BACKGROUND

1. Ring Learning with Errors

The Ring-LWE technique was proposed by Oded Regev in 2010 [6] to address the complexity of multiplication operations (e.g., convolution) and the limitations of long encryption keys in the LWE technique. It performs all the operations and encryptions on a polynomial ring in a finite field, R$_{\mathrm{q}}$ = Z$_{\mathrm{q}}$[x] / f(x). Typically, f(x) is defined as f(x) = (X$^{\mathrm{N}}$ + 1), where N is a power of 2 and q is a prime number that satisfies q (1 mod 2N).

(1)

a(x) = a$_{0}$ + a$_{1}$x + a$_{2}$x$^{2}$ + {\ldots} + a$_{\mathrm{N-1}}$x$^{\mathrm{N-1}}$

a$_{\mathrm{n}}$${\in}$ Z / q, q ${\equiv}$ (1 mod 2N)

The Ring-LWE technique adopts a public key encryption scheme with two asymmetric keys: a public key for encryption and a secret key for decryption. A public key is generated during the key-generation process (a, b), where ``a'' is a random number, ``s'' represents the secret key, and ``e'' is the value generated by a discrete Gaussian distribution representing small noise (error). Using the public key (a, b), the process of encrypting a message (m) to obtain (c$_{1}$, c$_{2}$) can be verified through Eq. (2).

(2)

b = a ${\times}$ s + e

c$_{1}$ = a ${\times}$ e$_{1}$ + e$_{2}$

c$_{2}$ = b ${\times}$ e$_{1}$ + e$_{3}$ + m

The decryption process uses the secret key (s) to recover the ciphertext (c$_{1}$, c$_{2}$) from the original message (m). In the key-generation step, the secret key (s) corresponding to the public key (a, b) values used during encryption must also be used to eliminate the small noise (error) inserted during the encryption process and obtain the original message (m) value.

(3)
m = s ${\times}$ c$_{1}$+ c$_{2}$

In the Ring-LWE technique, polynomial multiplication is the key operation, which is typically performed by convolution. However, by leveraging the number theoretic transform (NTT), the convolution operation (*) can be replaced with a point-wise multiplication operation (•) to reduce the overall time complexity from O(N$^{2}$) to O(NlogN) based on Big-O notation.

2. High-level Synthesis

HLS consists of several stages, as shown in Fig. 1. It begins with the C simulation process, in which the testbench code is written to validate the C logic against the C/C++ source code. The next stage is C synthesis, where directives such as UNROLL and PIPELINE provided by the HLS tool are applied for optimization, and the interface (AXI-Stream) is incorporated to generate the RTL source code. The synthesized RTL code is then verified through C/RTL co-simulation using Vivado simulation (waveform) to ensure the equivalence of the C and RTL logics. Replacing the RTL coding and verification steps in the conventional hardware design process with HLS offers the advantage of higher productivity.

Fig. 1. High-level synthesis design flow.
../../Resources/ieie/JSTS.2024.24.3.191/fig1.png

3. Xilinx PYNQ Platform

The Xilinx PYNQ platform is an open-source project provided by Xilinx to facilitate the easy design of embedded systems using the All Programmable Systems on Chip (APSoC) field-programmable gate array (FPGA).

By leveraging the Python language and libraries, developers can write the host code to be executed in the processing system (PS) domain, allowing collaborative design between the ARM processor in the ZYNQ FPGA's PS domain and the programmable logic (PL) domain. This enables the implementation of various embedded systems in a single FPGA.

Fig. 2 shows the PYNQ framework that runs on the CPU in the PS domain when the PYNQ boot image (provided by Xilinx) is loaded onto an SD card, and the ZYNQ 104 FPGA is booted in the SD-card boot mode.

Fig. 2. PYNQ framework.
../../Resources/ieie/JSTS.2024.24.3.191/fig2.png

III. ARCHITECTURE

1. HLS-based HW/SW Co-design

The Xilinx Vitis-HLS tool was used with the ZYNQ 104 FPGA set as the target board for the design process. The tool provides various pragma directives, such as loops, block random-access memory (BRAM), and resource allocation, which are required for optimization. These directives were utilized to design the Crystals–Kyber cryptographic algorithm in HW. The Advanced eXtensible Interface (AXI) Stream was applied to the input and output ports in the top function.

1.1 Optimization Using Directives

The directives primarily utilized in this design are PIPELINE and UNROLL, which optimize the declared loops and assign specific resources to certain operations when synthesizing from C to RTL. These directives were applied within the main functions of Crystals–Kyber.

The "#pragma HLS PIPELINE" directive optimizes the declared functions or loops by introducing concurrency in ongoing tasks and operations, reducing the initial interval (II) [7]. As shown in Fig. 3, based on the specified II value, a new input can be received at each clock cycle to speed up the output generation. Specifically, the core operations of the NTT function are described, and the PIPELINE directive is applied to the loops and internal functions of the hash function for optimization.

Fig. 3. PIPELINE directive.
../../Resources/ieie/JSTS.2024.24.3.191/fig3.png

By default, the "#pragma HLS UNROLL" directive maintains the rolling state of loops in the C/C++ functions, in which the logic for one iteration is synthesized and sequentially executed for each subsequent iteration. However, as shown in Fig. 4, applying the UNROLL directive allows the loop to be unrolled and transformed into multiple independent operations [7]. Consequently, multiple copies of the loop can be generated, enabling parallel computation for all iterations. In addition, partial UNROLL can be applied by specifying a factor value. In this study, the UNROLL directive was applied to the loops in the top function, specifically in sections where the output values during the encryption and decryption stages are transmitted via the interface (AXI-Stream), thus achieving parallelization.

Fig. 4. UNROLL directive.
../../Resources/ieie/JSTS.2024.24.3.191/fig4.png

The "#pragma HLS BIND_OP" directive specifies the resources to be mapped when synthesizing the variables and operations declared within the function. By assigning resources to the desired variables and operations (e.g., mul, add, sub, div), the synthesis process incorporates the specified mappings. If no explicit directive is provided, then the tool automatically determines the resource allocation. The options include digital signal processing (DSP) and Fabric (non-DSP), in which DSP utilization results in a synthesis that utilize DSP resources, while the Fabric designation excludes the use of DSP resources. Notably, the BIND_OP directive was applied to the operations (e.g., multiplication and reduction) within a finite field (Montgomery reduction and Barrett reduction), and polynomial multiplication in the NTT function, enabling synthesis with DSP resource utilization.

1.2 Interface (AXI-Stream)

The AXI (Advanced eXtensible Interface) applied to the input and output ports in the top function of Crystals–Kyber is one of the protocols developed by ARM and designed for arbitrary unidirectional data transfer [7]. Fig. 5 shows the signals and communication channels of the AXI-Stream, demonstrating the utilization of a ready/valid-based handshake communication.

Fig. 5. AXI-Stream signal and communication channel.
../../Resources/ieie/JSTS.2024.24.3.191/fig5.png

To effectively apply the AXI-Stream in HLS, specific headers must be included, such as ap_axi_sdata.h, hls_stream.h, and hls_vector.h. These headers define the dedicated data structures for streams, which to convert variables of the defined stream structure data-type to regular data-types (Interface ${\rightarrow}$ input variables) and vice versa (output variables ${\rightarrow}$ Interface).

Therefore, by utilizing the stream headers and dedicated structures, additional code was added to the top function to establish the connection between the input/output signals of Crystals–Kyber and the Interface, facilitating the synthesis process.

1.3 HW/SW Co-design

The block design in the PL area was configured using the Xilinx Vivado intellectual property (IP) integrator. It incorporates the PS IP of the ZYNQ 104 FPGA, optimized Crystals–Kyber IP designed with HLS, and direct memory access (DMA) IP. The PS and PL areas communicate through two channels: stream-to-memory-mapped (S2MM) and memory-mapped to stream (MM2S). Fig. 6 shows the block diagram of the co-designed HW/SW (PS-PL).

Fig. 6. HW/SW co-design block diagram and PS part design.

../../Resources/ieie/JSTS.2024.24.3.191/fig6-1.png

../../Resources/ieie/JSTS.2024.24.3.191/fig6-2.png

The design of the PS area utilized the PYNQ platform described in Section 3. Accordingly, the ARM core in the PS area operates within the PYNQ framework. The Python host code performs the following steps. It programs the bitstream file of the final HW onto the target board (ZYNQ 104) using the Overlay Library; imports information on the DMA IP using the DMA Library; declares the S2MM and MM2S channels; instantiates the DMA; and assigns the input message to be encrypted to the MM2S channel of the DMA. As a result, the input message is transferred from the PS area to the PL area, where the Crystals–Kyber IP performs encryption and decryption. The resulting output is assigned to the S2MM channel of the DMA and communicated back from the PL area to the PS area. The message used for validation in the PYNQ environment is "Inha_DIS_Lab.''

2. Hybrid HLS-RTL Design

The Verilog HDL code of Crystals–Kyber, generated through the optimization, synthesis, and verification processes in HLS, was imported into Xilinx Vivado. A testbench code for the AXI-Stream was developed, and the simulation waveform was analyzed to explore the modules that consumed a significant number of cycles and exhibited poor data throughput. Specifically, for the complex NTT module responsible for polynomial multiplication in Crystals–Kyber, a hybrid HLS-RTL design was devised to replace it with a hand-coded NTT (RTL) module.

The hybrid design approach was adopted to deal with the complexity of the NTT module and improve its performance. Replacing the original HLS-generated module with a hand-coded RTL implementation is expected to improve the data throughput and reduce the number of cycles required for processing.

2.1 Number Theoretic Transform

The internal structure of the hand-coded NTT module applied in this hybrid design involved storing the twiddle factor values as read-only memory (ROM) to multiply them with the coefficients of the input polynomial. As the size of incoming data increases, Montgomery reduction becomes more advantageous. Montgomery is used after multiplication operations, while Barrett is used after addition or subtraction operations. Modulo operations were performed using Montgomery to reduce the values within the parameter q (3329) of Crystals–Kyber. In addition, the butterfly unit was employed to perform Cooley–Turkey butterfly operations, followed by Barrett. These operations were combined into a single processing element. The structure follows a total of N (2$^{8}$) stages, determined by the parameter N of Crystals-Kyber. To parallelize the process, we designed a 2-parallel MDF structure. A block diagram of this process is shown in Fig. 7.

(4)
Throughput = (Frequency ${\times}$ bits) / Cycles

Eq. (4) calculates the data throughput. Comparing the data throughput of the NTT module generated by the HLS and the hand-coded NTT (RTL) module, the former achieved a throughput of approximately 0.225 Mbps, while the latter achieved a throughput of approximately 3.351 Mbps, representing an approximately 15 times increase. Because the NTT module was used in all stages of key generation, encryption, and decryption within Crystals–Kyber, its optimization and design using HLS improved the overall latency by 40% compared with that of the original Crystals–Kyber design.

Fig. 7. Hand-coded NTT module block diagram.
../../Resources/ieie/JSTS.2024.24.3.191/fig7.png

2.2 Hybrid Design

In the HW generated by the HLS, communication between modules was based on the following signals: ap_clk (clock), ap_rst (reset), ap_start (indicating the execution state and waiting for the next input), ap_idle (indicating the idle state when no execution is in progress), ap_done (indicating the completion of the operation), and ap_ready (indicating the ability to receive the next input). These four signals were utilized to communicate through the ap_ctrl_hs (handshake) interface. In the top code of the NTT module, the input, output, and control signals of the hand-coded NTT (RTL) module were divided into four states: Idle, Read, Run, and Write. Within each state, the necessary operations and logic were declared and connected to design the finite-state machine.

Fig. 8 shows the top diagram of the hybrid design, where the topmost layer represents the key-generation process responsible for generating public and secret keys. The bottom layer consists of the encryption module which takes the message to be encrypted and public key as inputs and the decryption module, which takes the secret key as an input and performs the decryption process. It is worth noting that within the entire set of modules generated by HLS for Crystals–Kyber, the HLS-based NTT module was replaced with a hand-coded Verilog HDL-based NTT module in the hybrid structure.

Fig. 8. Hybrid HLS-RTL design block diagram.
../../Resources/ieie/JSTS.2024.24.3.191/fig8.png

IV. IMPLEMENTATION RESULTS

In this study, we compared and analyzed the performance of HLS-based optimization and the design results of a proposed post-quantum cryptographic system. The Xilinx tools Vitis-HLS (2021.1) and Vivado (2020.1) were utilized, and the target board was set as the ZYNQ 104 FPGA for implementation.

As shown in Table 1, Hand-coded NTT requires more LUTs compared to HLS but offers higher data throughput and lower latency. The HLS-based NTT module does not apply pipeline, parallel, etc. and no optimization using HLS directives, therefore the area may look smaller, but the latency is 22 times worse. If it makes an HLS-based NTT using similar HW resources as an RTL-based NTT, it will perform well enough, but an RTL-based NTT will perform better.

As shown in Table 2, the hybrid HLS-RTL design utilized in the ZYNQ 104 FPGA. By utilizing the hand-coded NTT module, a shorter latency was achieved compared with the results obtained from the HLS-based NTT module, resulting in improved performance in terms of the area ${\times}$ time (ATP) metric. In particular, the productivity benefits of HLS significantly reduced implementation time compared to a Verilog HDL-based implementation that was 100% hand-coded.

These results suggest that the hybrid design has achieved performance gains, particularly in the Key Generation, Encryption, and Decryption phases, a speedup of approximately 1.8x and 2.1x, respectively.

Table 1. Performance comparison between NTT generated with Hand-coded and HLS NTT

Hand-coded

HLS

Throughput (Mbps)

3.351

0.225

LUTs

3,732

232

Latency (μs)

5.1

112.7

Area × Time

(LUTs × Latency)

0.019

0.026

FPGA

ZYNQ104

ZYNQ104

Table 2. Implementation result and performance comparisons (Key + Enc + Dec)

Hybrid design

HLS design

LUTs

39,942 (17.3%)

31,807 (13.8%)

FFs

26,877 (5.8%)

19,820 (4.3%)

BRAMs

30.5 (9.8%)

32 (10.3%)

DSPs

282 (16.3%)

290 (16.8%)

Clock Freq (MHz)

201.2

207.7

Latency (μs) (Key/Enc/Dec)

201.95/ 224.91/ 74.44

363.81/ 410.39/ 157.5

Area × Time

(LUTs × Latency)

20

29.6

FPGA

ZYNQ104

ZYNQ104

V. CONCLUSIONS

In this study, the proposed HLS-based HW/SW co-design allowed us to implement and validate a system that controls and communicates with the Crystals–Kyber HW IP, which was designed using HLS, utilizing the ARM processor (Cortex-A53) and memory in the PS domain of the ZYNQ 104 FPGA on a single board. In the hybrid HLS-RTL design, the specific modules generated by the HLS were replaced with hand-coded RTL modules that offered better data throughput, resulting in the improved overall latency for Crystals–Kyber.

ACKNOWLEDGMENTS

This work was supported by INHA UNIVERSITY Research Grant.

References

1 
P. W. Shor, “Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer,” SIAM J. Comput., vol. 26, pp. 1484-1509, Oct. 1997.URL
2 
National Institute of Standards and Technology, “PQC (Post-Quantum Cryptography) 3rd Round Submissions,” last modified June 14, 2021, accessed https://csrc.nist.gov/projects/post-quantum- cryptography/round-3-, Oct. 18. 2021.URL
3 
R. Avanzi, et. al. “CRYSTALS-Kyber Algorithm Specifications And Supporting Documentation (version 3.0),” NIST PQC Round 3 submission, Oct. 2020.URL
4 
National Institute of Standards and Technology, “Status Report on the Third Round of the NIST Post-Quantum Cryptography Standardization Process,” accessed https://doi.org/10.6028/NIST.IR.8413-upd1, July 2022.DOI
5 
O. Regev, “On lattices, learning with errors, random linear codes, and cryptography,” in Proceedings of the ACM Symposium on Theory of Computing, Baltimore, MD, USA, pp. 84-93, May. 2005.DOI
6 
V. Lyubashevsky, “On ideal lattices and learning with errors over rings,” Annual international conference on the theory and applications of cryptographic techniques, pp. 1-23, 2010.DOI
7 
Vitis High-Level Synthesis, “User Guide,” UG 1399, v2021.2, October 22, 2021.URL
Chang-Hyeon Lee
../../Resources/ieie/JSTS.2024.24.3.191/au1.png

Chang-Hyeon Lee received a B.S. degree in Electronic Engineering from Tech University of Korea in 2020, and an M.S. degree in Electrical and Computer Engineering from Inha University, South Korea, in 2023. He is currently a System-on-Chip designer at Next Chip, South Korea. His research interests include algorithms and architectures for post-quantum cryptography.

Jae-Hyeok Lee
../../Resources/ieie/JSTS.2024.24.3.191/au2.png

Jae-Hyeok Lee received a B.S. degree in Information and Communi-cation Engineering from Inha University, South Korea, in 2021, and an M.S. degree in Electrical and Computer Engineering from Inha University, South Korea, in 2023. He is currently a System-on-chip digital IC designer at Realtek Korea. His research interests include algorithm and architecture for lattice-based homomorphic encryption.

Haesung Jung
../../Resources/ieie/JSTS.2024.24.3.191/au3.png

Haesung Jung received a B.S. degree in Information and Communi-cation Engineering from Inha University, South Korea, in 2023. Since March 2023, he has been pursuing the M.S. degree in Electrical and Computer Engineering from Inha University, South Korea. He currently specializes in post-quantum cryptography hardware architecture design, and his research interests include VLSI architecture design for cryptography.

Hanyoung Lee
../../Resources/ieie/JSTS.2024.24.3.191/au4.png

Hanyoung Lee received a B.S. degree in Information and Communi-cation Engineering from Inha University, South Korea, in 2023. Since March 2023, he has been pursuing the M.S. degree in Electrical and Computer Engineering from Inha University, South Korea. He currently specializes in homomorphic encryption hardware architecture design, and his research interests include VLSI architecture design for cryptography.

Hanho Lee
../../Resources/ieie/JSTS.2024.24.3.191/au5.png

Hanho Lee received M.Sc. and Ph.D. degrees in Electrical and Computer Engineering from the University of Minnesota, MN, USA, in 1996 and 2000, respectively. He was a Technical Staff Member at Lucent Technologies (Bell Laboratories Innovations), Allentown, PA, USA, from April 2000 to August 2002, and an Assistant Professor of the Department of Electrical and Computer Engineering, University of Connecticut, USA, from August 2002 to August 2004. He was a Visiting Scholar at Bell Laboratories, Alcatel Lucent, Murray Hill, NJ, USA, from August 2010 to August 2011. He has been with the Department of Information and Communication Engineering, Inha University, South Korea, since August 2004, where he is currently a professor. His research interests include algorithms and architectures design for post-quantum cryptography, homomorphic encryption, artificial intelligence, forward error correction, and digital signal processing.