Tan Tuy Nguyen1
Nguyen Tram Thi Bao2
Lee Hanho2*
-
(Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam)
-
(Dept. of Information and Communication Eng., Inha University, Incheon, 22212, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Cryptography, multiple-path delay feedback, NTT polynomial multiplier, ring-LWE
I. Introduction
Conventional cryptosystems such as RSA and elliptic curve cryptography (ECC) can be
solved in polynomial time with the algorithms proposed by Shor using a quantum computer
(1). To replace these classic cryptosystems, post-quantum cryptosystems offering higher
levels of security have been researched recently. Among the existing post-quantum
cryptosystems, ring-LWE cryptography is a promising candidate because its security
is based on the worst-case hardness problem (2). The basic concept of ring-LWE cryptography can be found in (2,3). The block diagram of a ring-LWE cryptosystem is shown in Fig. 1. The input message m is encrypted into ciphertext $\left(c_{1}, \quad c_{2}\right)$
using arithmetic computation on public key (a, p) and error polynomials e1, e2, and
e3. Original message m can be recovered from ciphertext $\left(c_{1}, \quad c_{2}\right)$
and private key $r_2$ through the decryption operation. The arithmetic operations
in ring-LWE cryptography include polynomial addition, modulus, and polynomial multiplication,
in which polynomial multiplication is the basic and most computationally intensive
operation (4).
Several architectures to perform polynomial multiplication have been introduced in
literatures. Among the existing approaches used to conduct polynomial multiplication,
the number theoretic transform (NTT) studied in (4-6) is an efficient method. Let $a(x)$ be a polynomial over the ring $\left.R_{q}=Z_{q}
/ \underline{(} x^{n}+1\right)$. Then, the NTT transform of each coefficient of $a(x)$
and its inverse NTT (INTT) with a primitive n-th root of unity ${\omega}$ can be expressed
as follows:
Fig. 2. Block diagram of the proposed NTT polynomial multiplier.
Fig. 3. Timing diagram of the proposed NTT polynomial multiplier.
The multiplication of two polynomials $a(x)$ and $b(x)$ is computed as (3), where ∘ denotes a point-wise multiplication, and two polynomials $a^{\prime}(x)$
and $b^{\prime}(x)$ are constructed as shown in (4).
In (4), authors implemented an n-point NTT-core architecture using a systolic array to speed
up the multiplication process compared with previous studies. However, this architecture
is deployed using a single-path delay feedback (SDF), which offers a low throughput
and efficiency. In addition, the SDF and multiple-path delay commutator (MDC) architectures
deployed in recent researches (5,6) are not sufficient to process the polynomial multiplication.
In this work, a multiple-path delay feedback (MDF) based NTT polynomial multiplier
offering high efficiency with low latency is presented. By employing the NTT process
of two polynomials $a(x)$ and $b(x)$ concurrently, the multiplication latency is reduced
considerably. Additionally, the NTT core is designed using MDF architecture to achieve
a better performance compared with previous polynomial multipliers.
II. Proposed NTT Polynomial Multiplier
The block diagram of the proposed NTT polynomial multiplication is shown in Fig. 2. To conduct polynomial multiplication, the coefficients of input polynomials $a(x)$
and $b(x)$ are initially multiplied by a factor ${\varphi}$, where ${\varphi}$ ${\equiv}$
${\omega}$ mod q, to get the polynomials $a^{\prime}(x)$ and $b^{\prime}(x)$. These
multiplication results are then transformed using NTT cores, followed by a point-wise
multiplication and an INTT computation.
The polynomial multiplication result $c(x)$ is completely computed by multiplying
the output of INTT with the factor ${\varphi}$-1. The entire operations of the multiplier
are directed by a controller. To speed up the multiplication of the two polynomials
$a(x)$ and $b(x)$, negative wrapped convolutions as well as NTT operations are controlled
so that the multiplication latency is minimized. Specifically, the two NTT operations
NTT($a^{\prime}$) and NTT($b^{\prime}$) are processed simultaneously, resulting in
the reduction of one NTT operation compared with the architecture in (5). The negative wrapped convolutions of the two input polynomials $a(x)$ and $b(x)$
are also conducted in parallel to decrease the processing time. Fig. 3 shows the timing diagram of the proposed NTT polynomial multiplier. As can be seen
from Fig. 3, two computations a ${\times}$ ${\varphi}$ and b ${\times}$ ${\varphi}$ are scheduled
to work simultaneously. Two NTT cores transforms NTT($a^{\prime}$) and NTT($b^{\prime}$)
also work in parallel. Compared with the conventional architecture presented in (5), the proposed architecture offers a reduction in the processing time of one NTT computation
and one negative wrapped convolution.
Fig. 4 presents the proposed NTT core using an 8-datapath MDF architecture to process the
input data in parallel. Specifically, the coefficients of the input polynomial $a(x)$
are divided into 8 datapaths, which are processed by different processing elements
(PE1s, PE2s). Each PE consists of modular reduction units (MR1, MR2), constant multiplier,
butterfly unit (BU), and one first-in-first-out (FIFO) register.
III. Results and Comparisons
Table 1. Comparison of the proposed NTT polynomial multiplier architecture with other
works for parameter n=512, q=12,289
|
Proposed
|
[4]
|
[5]
|
Devices
|
Virtex-7
|
Spartan-6
|
Stratix
|
LUTs
|
63,522
|
6,119$^{(1)}$
|
3,750$^{(2)}$
|
2,064
|
Slices
|
13,588
|
2,062$^{(1)}$
|
1,348$^{(2)}$
|
1,819
|
Cycles
|
226
|
3,618$^{(1)}$
|
3,622$^{(2)}$
|
2,601
|
Frequency (MHz)
|
353
|
224
|
254
|
239
|
Time ($\mu$s)
|
0.64
|
16.11$^{(1)}$
|
14.26$^{(2)}$
|
10.85
|
Area-latency product
(slice × $\mu$s)
|
8,696
|
33,218$^{(1)}$
|
19,222$^{(2)}$
|
19,736
|
$^{(1), (2)}$ Design 1 and design 2 in [4].
Fig. 4. (a) Proposed 8-datapath MDF-based NTT core architecture, (b) PE1, (c) PE2
(a)
(b)
(c)
The proposed NTT polynomial multiplier architecture is synthesized and implemented
on a Xilinx Virtex-7 FPGA board. The simulation results and performance comparison
are provided in Table 1. As can be seen from Table 1, the number of clock cycles necessary to complete a polynomial multiplication of
the proposed architecture is only 226 clock cycles, which is about 6.25% and 8.69%
of that of design 1 in (4) and the architecture in (5), respectively. In addition, the proposed architecture can operate at a higher clock
frequency than others, resulting in a reduction in the multiplication latency. To
compare the balance between area and latency of the proposed architecture with that
of others, a parameter named area-latency product described in (4) is used. As presented in Table 1, the proposed polynomial multiplier architecture offers a much lower area-latency
product compared with that of other architectures.
IV. Conclusions
An MDF-based NTT polynomial multiplier for ring-LWE cryptography is presented in this
paper. By processing the NTT operation of the input polynomials concurrently, the
multiplication process is speeded up. The implementation results prove that the proposed
NTT polynomial multiplier achieves better performance than previous approaches. Therefore,
it can be applied in ring-LWE and lattice-based cryptosystems to boost polynomial
multiplication and improve encryption and decryption processes.
ACKNOWLEDGMENTS
This work was supported by the Basic Science Research Program (2019R1F1A1061926) through
the NRF funded by the MSIT and by Inha University, Incheon, Korea.
REFERENCES
Shor P. W., Feb 1997, Polynomial-time algorithms for prime factorization and discrete
logarithms on a quantum computer, SILAM J. Computing, Vol. 26, No. 5, pp. 1484-1509
Regev O., May 2005, On lattices, learning with errors, random linear codes, and cryptography,
Annual ACM Symp. Theory of Computing, pp. 84-93
Tan T. N., Lee H., Feb 2019, High-secure fingerprint authentication system using ring-LWE
crypto-graphy, IEEE Access, Vol. 7, No. 1, pp. 23379-23387
Chen D. D., Jan 2015, High-speed polynomial multiplication architecture for ring-LWE
and SHE cryptosystems, IEEE Trans. Circuits Syst. I, Reg. Papers,, Vol. 62, No. 1,
pp. 157-166
Rentería-Mejía C. P., Velasco-Medina J., 2014, Hardware design of an NTT-based polynomial
multiplier, 2014 IX Southern Conf. Programmable Logic, Buenos Aires, pp. 1-5
Du C., Bai G., 2016, Efficient polynomial multiplier architecture for ring-LWE based
public key cryptosystem, 2016 IEEE Int. Symp. Circuits Syst. (ISCAS2016), Montreal,
QC, pp. 1162-1165
Author
Tuy Nguyen Tan received the M.S. and Ph.D. degrees in Information and Communication
Engineering from Inha University, South Korea, in 2016 and 2019, respectively.
From 2009 to 2013, he had been a Telecommunication engineer at GTel Mobile JSC., Danang
Branch, Vietnam.
Since 2019, he has been a researcher at Faculty of Information Technology, Ton Duc
Thang University, Vietnam.
His research interest includes algorithm and VLSI architecture design for cryptosystems
and error correction codes.
Tram Thi Bao Nguyen received the M.S. and Ph.D. degrees in Information and Communication
Engineering from Inha University, South Korea, in 2016 and 2020, respectively.
Since 2013, she has been with Danang University, Vietnam where she is a lecturer.
Her research interests include VLSI architecture design for digital signal processing
and communication systems.
Hanho Lee received the M.Sc. and Ph.D. degrees in electrical and computer engineering
from the University of Minnesota, Minnea-polis, in 1996 and 2000, respectively.
From 2000 to 2002, he was a Member of the Technical Staff with Lucent Technologies
(Bell Labs Innovations), Allentown.
From 2002 to 2004, he was an Assistant Professor with the Department of Electrical
and Computer Engineering, University of Connecticut, USA.
Since 2004, he has been with the Department of Information and Communication Engineering,
Inha University, where he is currently a Professor.
From 2010 to 2011, he was a Visiting Scholar with Bell Labs, Alcatel-Lucent, Murray
Hill, NJ, USA.
His research interests include algorithm and architecture design for cryptographic,
forward error correction coding, and digital signal processing.