I. INTRODUCTION
Connected and automated vehicles (CAVs) are rapidly emerging as a transformative technology
in the automotive domain, promising significant improvements in traffic safety, energy
efficiency, and user convenience [1-
7]. By leveraging advanced sensing, communication, and control systems, CAVs enable
real-time cooperation among vehicles, roadside infrastructure, and cloud-based services.
A critical enabler of this ecosystem is vehicle-to-everything (V2X) communication,
supported by dedicated short-range communications (DSRC) and 5G New Radio V2X (NR-V2X),
which provide high data rates and low latency for exchanging information among dynamic
entities on the road [8-
12]. However, the proliferation of wireless connectivity also exposes CAVs to severe
cybersecurity risks. Each vehicle continuously generates and transmits a large volume
of sensitive information, including positional data, control commands, and sensor
streams, all of which are attractive targets for malicious adversaries [13]. The expansion of the attack surface significantly increases the likelihood of unauthorized
access, message injection, replay attacks, and other forms of intrusion [14-
16]. Consequently, ensuring secure, reliable, and low-latency communication is paramount
for the safe operation of CAVs. Cryptography provides the fundamental mechanism for
securing vehicular communications. Among the numerous algorithms available, the Advanced
Encryption Standard (AES) and the ChaCha20 stream cipher are the two most widely adopted
solutions in practice [17,
18]. AES, as the de facto standard for symmetric-key encryption, is widely deployed due
to its proven security strength and extensive hardware/software support. ChaCha20,
a modern ARX (Addition-Rotation-XOR) based stream cipher, has been standardized in
IETF protocols and adopted in security frameworks such as TLS 1.3 and WireGuard due
to its efficiency and strong resistance to timing-based side-channel attacks. These
two algorithms constitute the core cryptographic primitives for safeguarding both
internal in-vehicle data exchange and external V2X communication.
Although both AES and ChaCha20 can be implemented purely in software, their software-based
execution on CPUs or GPUs presents limitations in the context of CAVs. Real-time vehicular
communication is characterized by short yet frequent packets, often with strict latency
constraints. Software implementations incur non-negligible per-packet overhead due
to padding, mode initialization, and memory access operations in the case of AES,
or due to repeated ARX operations in ChaCha20. Moreover, when these operations compete
with other computation-heavy automotive workloads, the cumulative cost can impose
substantial burdens on the host processor, leading to increased latency and reduced
energy efficiency. This overhead is particularly problematic in safety-critical scenarios
where even microseconds of delay can impact the responsiveness of autonomous driving
functions. To address these challenges, dedicated hardware acceleration of symmetric-key
ciphers has emerged as a promising direction. Hardware implementations can exploit
parallelism and pipelining at the architectural level, thereby achieving significantly
higher throughput, lower latency, and improved energy efficiency compared with software-only
approaches. Prior studies have explored hardware-oriented realizations of AES, ChaCha20,
and other crypto-algorithms, demonstrating their ability to sustain high-performance
under resource constraints [19-
37]. Building upon this trend, the focus of this work is to design an optimized ChaCha20
hardware accelerator tailored for secure and real-time CAV communications.
Several prior studies have explored ChaCha20 hardware implementations with varying
degrees of unrolling and optimization. Early works investigated datapaths instantiating
$1\times$, $4\times$, and $8\times$ quarter-round (QR) units, where the labels denote
spatial parallelism. Henzen et al. reported that the $8\times$QR design achieved 6.78 Gbps, while the $4\times$QR variant
offered a more favorable trade-off between throughput and area efficiency [20]. Nevertheless, these single-state designs preserve the sequential execution of column
and diagonal rounds, leaving the round hardware underutilized during alternating phases.
Other approaches emphasized different design goals. Mozaffari-Kermani et al. prioritized reliability in a 65-nm CMOS implementation, where the baseline ChaCha20
core reached 9.6 Gbps [21]. Serrano et al. presented an area-efficient ChaCha20 core in 180-nm technology, evaluating $1\times$,
$2\times$, and $4\times$QR configurations, with the $4\times$QR design reaching 3.65
Gbps and being integrated into a RISC-V SoC [22]. However, the constraints of single-state scheduling and the use of older process
nodes limited performance scalability. At the system level, Le et al. introduced a reconfigurable crypto accelerator that achieved up to 75.4 Gbps throughput
through resource sharing and multi-core parallelism, albeit at the expense of area
efficiency [23]. At the microarchitectural level, Rashidi et al. improved the ARX datapath using a sparse parallel prefix adder and resource sharing
to reduce area and delay [24]. On reconfigurable platforms, Dani demonstrated an Artix-7 FPGA design achieving
5.6 Gbps using pipelining, though duplicating round logic increased hardware cost
[25]. These studies collectively demonstrate the trade-offs in ChaCha20 hardware design:
while unrolling and pipelining enhance throughput, they often either leave round hardware
idle or demand excessive resources. This motivates the need for an alternative architecture
that simultaneously maintains compact area, avoids idle cycles, and sustains high
throughput-objectives that form the foundation of the dual-state interleaving approach
proposed in this work.
In this paper, we first present three single-state based ChaCha20 architectures, CR-1,
CR-2, and CR-4, which instantiate different degrees of round unrolling and serve as
reference points for performance and efficiency comparisons. These baseline designs
were originally introduced in our prior work [19], where we demonstrated their effectiveness in balancing throughput, area, and power.
Building on this foundation, we now extend the work by proposing a dual-state ChaCha20
(DSCC20) hardware design that interleaves two independent states in an alternating
fashion, thereby eliminating idle cycles and maximizing round utilization. The proposed
design delivers high throughput and area efficiency while maintaining moderate hardware
cost, making it well suited for resource-constrained embedded systems, including latency-sensitive
applications in the automotive domain. In contrast to conventional single-state architectures,
the DSCC20 alternately processes two states with only one additional cycle of overhead,
resulting in significantly improved throughput and area efficiency. The architecture
is described in Verilog HDL, synthesized using 28-nm CMOS technology, and comprehensively
evaluated in terms of area, power, delay, throughput, and area efficiency, with results
compared against both the CR baselines and existing ChaCha20 and AES designs.
In summary, the contributions of this work are outlined as follows:
-
We implement three single-state based ChaCha20 architectures (CR-1, CR-2, and CR-4),
representing different levels of round unrolling, to establish a consistent reference
for evaluating throughput, latency, and area efficiency.
-
We propose a novel dual-state ChaCha20 architecture that interleaves two independent
states on shared round hardware, thereby eliminating idle cycles between column and
diagonal rounds while preserving a compact datapath.
-
We perform a comprehensive evaluation of the proposed DSCC20, comparing it not only
with the CR baselines but also with other ChaCha20 and AES hardware implementations
across various technology nodes, demonstrating its superior throughput and area efficiency
for resource-constrained CAV systems.
II. BACKGROUND
ChaCha20 is a member of the ChaCha family, introduced as a modification of Salsa20
to achieve faster diffusion per round while retaining algorithmic simplicity [18]. It is a symmetric-key stream cipher that generates a 512-bit keystream block from
a $4 \times 4$ matrix of 32-bit words, referred to as the state. The initial state
consists of four components: a 128-bit constant, a 256-bit secret key, a 32-bit block
counter, and a 96-bit nonce. These values are arranged in row-major order, with the
constants in the first row, the key in the second and third rows, the counter in the
first column of the fourth row, and the nonce in the remaining positions. All values
are represented in little-endian format, consistent with the algorithm's internal
word ordering. The ChaCha20 block function transforms the initialized 512-bit state
through 20 rounds of computation, organized into 10 double rounds. Each double round
consists of a ColumnRound() function, which operates on the vertical columns of the
state, followed by a DiagonalRound() function, which mixes the diagonal elements.
Once all rounds are completed, the transformed state is added element-wise to the
preserved initial state to generate the final keystream block. This feed-forward addition
enhances diffusion and ensures that the transformation is not trivially invertible.
Algorithm 1 illustrates how the state $s$ is processed during two consecutive rounds.
The function explicitly takes the 512-bit state $s = s[0], s[1], \dots, s[15]$ as
input and modifies it in place. During the ColumnRound(), four independent QuarterRound()
operations are applied to each of the four columns of the $4 \times 4$ state matrix.
During the DiagonalRound(), another four QuarterRound() operations are applied, this
time to diagonal word groups, ensuring that every word interacts with different neighbors
across rounds. By alternating these two transformations, ChaCha20 guarantees rapid
and uniform diffusion of the key, counter, and nonce across the entire state. At the
core of both ColumnRound() and DiagonalRound() is the QuarterRound() function, shown
in Algorithm 2. This function operates on four 32-bit words of the state, $(a,b,c,d)$,
and performs a fixed sequence of ARX operations. Specifically, $+$ denotes 32-bit
integer addition modulo $2^{32}$, $\oplus$ is bitwise XOR, and $\ll n$ is a left rotation
by $n$ bits. The choice of rotation distances (16, 12, 8, and 7) ensures that different
bit positions are mixed uniformly across rounds. The QuarterRound() function provides
the fundamental nonlinear transformation that propagates changes from one word across
the others, and when applied repeatedly across the state matrix, it ensures strong
diffusion. Over the course of 20 rounds, each word of the state is updated multiple
times through different pairings, which greatly enhances security margin of the cipher.
Algorithm 1: Two round of ChaCha20 stream cipher.
Algorithm 2: Quarter round of ChaCha20.
ChaCha20's reliance solely on ARX operations makes it highly attractive for accelerator
design. Unlike AES, it does not require substitution boxes (S-boxes), lookup tables,
or Galois field multiplications, thus eliminating the need for complex memory structures
or logic. Its regular matrix structure maps naturally to parallel datapaths and pipelining,
allowing designers to scale throughput without introducing irregular control overhead.
The predictable timing of ARX operations simplifies synthesis and timing closure,
while its side-channel resistance against cache-based attacks is advantageous for
secure embedded implementations. Furthermore, since ChaCha20 is stream-oriented and
does not require padding, it introduces minimal per-packet overhead, which is especially
beneficial in CAV communications where short, frequent messages dominate. In addition,
the lightweight nature of ARX-based processing reduces switching activity and energy
consumption, making it well suited for automotive platforms with stringent power budgets.
These structural advantages also allow ChaCha20 accelerators to be seamlessly integrated
into heterogeneous systems-on-chip (SoC), supporting both dedicated security modules
and reconfigurable cryptographic engines in emerging vehicular architectures.
III. PROPOSED CHACHA20 DESIGN
The conventional ChaCha20 hardware accelerator processes one 512-bit keystream block
at a time and requires multiple clock cycles to complete the 20-round block function
[20]. In a single-state schedule, strict data dependencies between the column and diagonal
rounds enforce sequential execution, which causes only one round module to be active
at any given cycle. As a result, the other round logic remains idle, leading to underutilization
of hardware resources and suboptimal throughput.
To mitigate this inefficiency, our prior work introduced an $N\times$ combinational-round
(CR) design, in which $4 \times N$ QuarterRound units are instantiated as combinational
logic to perform $N$ rounds per-cycle [19]. Fig. 1 illustrates the CR-1 baseline architecture. A single round is composed of four combinational
QuarterRound instances, with the ColumnRound and DiagonalRound implemented as distinct
combinational blocks. A round controller time-multiplexes these phases: it activates
either the column or diagonal round block in each cycle and routes the 512-bit state
through a multiplexer (MUX) back into the state register. This mechanism serializes
the two phases over successive cycles, such that one block remains idle while the
other is active. After completing the 20 rounds, the transformed state is word-wise
added to the preserved initial state, and the 512-bit result is serialized in little-endian
order to form the keystream block. Building on the CR-1, we also explored the CR-2
and CR-4 architectures, which reduce the total cycle count by chaining multiple rounds
into a single cycle. Specifically, the CR-2 combines a column and diagonal round into
one cycle, requiring 15 cycles to complete the 20-round block function, while the
CR-4 chains four rounds into one cycle, reducing the latency to 10 cycles. Although
this approach reduces nominal cycle count, it also lengthens the critical path substantially
due to deeper combinational logic, thereby lowering the maximum achievable clock frequency.
Consequently, CR-2 and CR-4 demonstrate higher theoretical throughput but suffer from
degraded timing closure and reduced efficiency in practice. These observations demonstrate
a fundamental trade-off: unrolling more logic per-cycle improves cycle count but worsens
frequency scalability, while single-state scheduling leaves hardware resources underutilized.
This motivates the need for an alternative approach that raises utilization without
incurring a critical path penalty.
Fig. 1. CR-1 baseline single-state ChaCha20 hardware architecture with QuarterRound
ARX dataflow and little-endian serializer mapping from 512-bit state to keystream.
To address this challenge, we propose a dual-state ChaCha20 architecture that interleaves
two independent states in an alternating fashion. By introducing a lightweight interleaving
mechanism, the DSCC20 eliminates idle cycles while maintaining the same critical path
as CR-1, thereby improving hardware utilization without additional round unrolling.
The key idea is to introduce a lightweight interleaving mechanism that allows two
independent states to be processed alternately, thereby eliminating idle cycles while
keeping the round hardware fully utilized. Fig. 2(a) shows the overall DSCC20 hardware architecture. The inputs are provided as two consecutive
tuples (Key$_1$, Nonce$_1$, Count$_1$) and (Key$_2$, Nonce$_2$, Count$_2$), representing
successive blocks rather than separate physical ports. Compared with the CR-1 baseline
in Fig. 1, the datapath adds only a second 512-bit state register file and a small amount of
steering logic. The round logic itself is unchanged: a single ColumnRound module and
a single DiagonalRound module are shared between the two states. After 20 rounds,
the transformed state is combined with the preserved initial state in the Add-and-Serialize
block, and the resulting keystream is XORed with the message (i.e. Message$_1$ and
Message$_2$) to produce Cipher$_1$ and Cipher$_2$, respectively. The Round Control
block simply selects which state feeds the round logic and where the results are written
back, requiring only a small finite state machine (FSM) and a round counter.
Fig. 2. Proposed ChaCha20 hardware design; (a) dual-state architecture with shared
round logic and duplicated state banks and (b) interleaved execution schedule showing
that ColumnRound and DiagonalRound remain active every cycle.
This interleaved execution achieves several important advantages. By alternating two
states, the design eliminates idle cycles and ensures that the Column and Diagonal
modules are fully utilized in every clock cycle. At the same time, because only one
round is evaluated per-cycle as in CR-1, the critical path length remains unchanged,
which allows the maximum operating frequency to be preserved. The additional hardware
overhead is also minimal, since the architecture requires only a second 512-bit state
register, rather than duplicating the round logic. As a result, two keystream blocks
are produced in nearly the same latency as a single block in CR-1, leading to a substantial
improvement in effective throughput. It is worth noting that this design differs fundamentally
from CR-2 and CR-4. Those architectures reduce cycle count by chaining multiple rounds
in a single cycle, which improves nominal latency but significantly lengthens the
critical path, making it difficult to sustain high clock frequencies. In contrast,
the DSCC20 maintains the same per-cycle workload as CR-1, but achieves higher throughput
by eliminating idle cycles through interleaving. This distinction allows DSCC20 to
combine the advantages of high utilization and frequency scalability, resulting in
a balanced trade-off between throughput and hardware cost.
Although the proposed architecture focuses on dual-state interleaving, the concept
can be naturally extended to a multi-state design. By introducing additional state
registers, multiple states could be interleaved in a round-robin fashion, keeping
the Column and Diagonal modules continuously occupied across an even larger number
of inputs. Such a multi-state scheme has the potential to further increase throughput
and hardware utilization, while still avoiding the excessive logic replication of
deeply unrolled designs. However, scaling beyond two states also introduces greater
control complexity and additional register overhead, which must be carefully balanced
against the expected performance gains.
IV. EXPERIMENTAL RESULTS
To evaluate the performance of the proposed DSCC20 architecture, we implemented it
along with three baseline ChaCha20 designs, CR-1, CR-2, and CR-4, using Verilog HDL
and synthesized them with Synopsys Design Compiler and a 28-nm CMOS standard-cell
library to obtain area, power, delay, and maximum operating frequency. Throughput
was derived from the number of processed rounds per-cycle and the maximum achievable
frequency, while area efficiency was calculated as throughput normalized by the total
gate equivalent (GE) count, where one GE corresponds to the area of a two-input NAND
gate. For a fair comparison, each design was evaluated at its own maximum frequency.
The area was reported in kilo-gate equivalents (kGE), and power consumption was measured
using gate-level switching activity estimated under typical conditions. The proposed
DSCC20 architecture was further benchmarked against three categories of designs: 1)
the CR-1/2/4 baselines to highlight improvements from interleaving, 2) prior ChaCha20
hardware implementations reported in the literature, and 3) representative AES hardware
accelerators across various technology nodes. In addition, the software-based Libsodium
ChaCha20 implementation executed on general-purpose CPUs (x86 and ARM architectures)
was included as a baseline reference to demonstrate the clear performance gap between
hardware and software realizations [38]. Finally, although AES and ChaCha20 employ different cryptographic structures (block
cipher versus stream cipher), the AES hardware designs were considered as comparison
points because both algorithms are widely deployed in CAV security frameworks, making
the evaluation relevant to real-world automotive communication scenarios.
Table 1 summarizes the area, power, throughput, and area efficiency results for the baseline
CR-1/2/4 designs and the proposed DSCC20 synthesized in 28-nm CMOS technology, as
well as software ChaCha20 performance measured on modern CPUs. At 763 MHz, the proposed
DSCC20 achieves 30.06 Gbps, which corresponds to $1.88\times$, $1.87\times$, and $2.32\times$
the throughput of CR-1, CR-2, and CR-4, respectively. In terms of area, DSCC20 is
larger than CR-1 and CR-2 by 17.0% and 10.4%, respectively, but is 37.9% smaller than
CR-4. The power consumption of the DSCC20 is $1.20\times$, $2.01\times$, and $2.62\times$
higher than CR-1, CR-2, and CR-4, respectively, primarily due to the additional state
register and control logic. However, the total consumption of 16.08 mW is several
orders of magnitude lower than the operating power of common automotive electronic
control units (ECUs), which typically range from a few watts to hundreds of watts.
As such, this overhead is insignificant and does not hinder real-time V2X operation.
Nevertheless, DSCC20 delivers the best area efficiency at 904.92 Kbps/GE, reflecting
the benefit of removing idle cycles while keeping a compact datapath. These results
demonstrate that, while CR-2 and CR-4 reduce idle cycles through round unrolling,
such chaining extends the critical path, prevents operation at high frequencies, and
results in lower efficiency. In contrast, DSCC20 leverages dual-state interleaving
to sustain higher operating frequency with balanced area overhead, thereby nearly
doubling throughput compared to single-state architectures.
Table 1. Comparison of ChaCha20 hardware accelerators (CR-1, CR-2, CR-4, and DSCC20)
and software baselines on Apple M4 and AMD Threadripper CPUs.
|
Design
|
Freq. (MHz)
|
Area (kGE)
|
Power (mW)
|
Thru. (Gbps)
|
Area Eff. (Kbps/GE)
|
|
CR-1
|
781
|
28.39
|
13.45
|
16.00
|
563.58
|
|
CR-2
|
472
|
30.10
|
8.01
|
16.10
|
534.97
|
|
CR-4
|
253
|
53.53
|
6.13
|
12.96
|
242.14
|
|
Proposed
|
763
|
33.22
|
16.08
|
30.06
|
904.92
|
|
Apple M4
|
4400
|
166 mm$^2$
|
40 W
|
2.06
|
-
|
|
AMD 5975WX
|
4500
|
324 mm$^2$
|
280 W
|
9.36
|
-
|
The software performance of ChaCha20 was evaluated using the Libsodium library on
two CPU platforms: an Apple M4 processor (ARM architecture) and an AMD Threadripper
PRO 5975WX processor (x86 architecture). For a realistic comparison with vehicular
communication traffic, throughput was measured with an 800-byte payload size representative
of V2X messages [39]. To improve timing precision and amortize non-cryptographic overheads, each encryption
was repeated 1,000 times per run, and the average execution time was used to compute
throughput. The Apple M4 achieved 2.06 Gbps and the AMD Threadripper reached 9.36
Gbps, both with significantly higher power budgets (40 W and 280 W, respectively)
compared with the 16.08 mW of DSCC20. By contrast, DSCC20 achieves 30.06 Gbps, which
corresponds to roughly $15\times$ the throughput of the M4 and $3.2\times$ that of
the AMD CPU. These results clearly highlight the decisive advantage of hardware acceleration,
demonstrating that the proposed DSCC20 not only outperforms prior ChaCha20 baseline
hardware but also vastly exceeds software implementations, making it highly suitable
for real-time secure communication in CAV systems.
Table 2 compares the proposed DSCC20 design with previously presented ChaCha20 hardware accelerators
across different technology nodes. For clarity, we explicitly state that all ChaCha20
accelerators evaluated here, including the proposed DSCC and previously reported designs,
use the standard 20-round ChaCha20 block function, ensuring that all throughput and
efficiency comparisons are performed under the same cryptographic configuration. Even
though the earlier designs were implemented in various process technologies, the proposed
DSCC20 clearly demonstrates superior performance and efficiency. Operating at 763
MHz in 28-nm CMOS node, the DSCC20 achieves a throughput of 30.06 Gbps, which is up
to $8.2\times$ higher than the values earlier designs in [20-
22]. Compared to the most recent design in [23], DSCC20 still delivers higher throughput while operating at a lower frequency, confirming
the benefit of sustaining utilization without aggressive pipelining or duplication
of round logic. In terms of area, the proposed design requires 33.22 kGE, which is
comparable to or smaller than many prior works, despite delivering much higher throughput.
For example, the design in [21] consumes 56.5 kGE and the one in [22] requires 25.05 kGE, yet both offer significantly lower performance. Most notably,
the proposed DSCC20 achieves an area efficiency of 904.92 Kbps/GE, which surpasses
all prior ChaCha20 implementations by a wide margin. It improves efficiency by $4.6\times$,
$5.3\times$, and $6.2\times$ compared to the designs in [20,
21], and [22], respectively. Even when compared with [23], which reports relatively high efficiency at 666.57 Kbps/GE, the proposed design
still delivers 36% better efficiency, validating the advantage of dual-state interleaving
over conventional single-state or partially unrolled designs. These comparisons confirm
that the proposed DSCC20 architecture offers the best balance between throughput and
hardware cost among existing ChaCha20 accelerators. By sustaining higher utilization
without incurring deep unrolling or excessive logic replication, DSCC20 demonstrates
a scalable architecture that outperforms prior works in both absolute throughput and
normalized efficiency.
Table 2. Comparison of proposed DSCC20 with existing ChaCha20 hardware implementations.
|
|
SCS'08 [20]
|
TECS'16 [21]
|
ISOCC'22 [22]
|
MCSoC'23 [23]
|
IJCTA'24 [24]
|
Proposed
|
|
Technology
|
180-nm
|
65-nm
|
180-nm
|
45-nm
|
180-nm
|
28-nm
|
|
Frequency (MHz)
|
215
|
307
|
150
|
510
|
352
|
763
|
|
Throughput (Gbps)
|
5.51
|
9.60
|
3.65
|
23.83
|
1.10
|
30.06
|
|
Gate Count (kGE)
|
28.11
|
56.5
|
25.05
|
35.75
|
6.17
|
33.22
|
|
Area efficiency (Kbps/GE)
|
196.02
|
169.91
|
145.71
|
666.57
|
178.28
|
904.92
|
|
Normalized frequency (MHz)
|
871
|
458
|
608
|
617
|
1426
|
763
|
|
Normalized throughput (Gbps)
|
22.31
|
14.31
|
14.78
|
28.85
|
4.45
|
30.06
|
Furthermore, to enable a fair comparison of throughput and frequency across different
process technologies, all ChaCha20 implementations in Table 2 were normalized to an equivalent 28-nm reference node. We employed the DeepScaleTool
framework, which provides voltage-frequency-delay scaling across 7-nm to 130-nm CMOS
nodes [40]. For designs originally implemented in 180-nm technology, which DeepScaleTool does
not directly support, we first applied a linear geometric scaling factor to map the
results to an equivalent 130-nm node, consistent with prior literature [41], and subsequently normalized them to 28-nm using DeepScaleTool. This two-step process
ensures consistent and defensible normalization for all evaluated designs and isolates
architectural effects from technology-dependent frequency variation. The normalized
results in Table 2 show that the performance advantage of DSCC20 remains substantial even after removing
technology-node differences. When scaled to a common 28-nm reference, the throughput
of prior ChaCha20 designs increases proportionally with technology scaling, with several
implementations reaching normalized values in the 14~29 Gbps range. Despite this upward
shift, the proposed DSCC20 design still achieves the highest normalized throughput
at 30.06 Gbps, indicating that its performance benefit originates from architectural
efficiency rather than fabrication node. Although our DSCC20 does not exhibit the
highest normalized frequency among all designs, it still has the highest due to the
elimination of idle cycles through dual-state interleaving. These normalized results
confirm that the proposed architecture provides consistent advantages independent
of process technology, reinforcing the DSCC20 design as one of the most scalable and
resource-efficient ChaCha20 accelerator designs to date.
To further assess the competitiveness of the proposed DSCC20 architecture, we compare
it against several existing AES hardware accelerators. Although ChaCha20 and AES employ
different cryptographic primitives, ChaCha20 is an ARX-based stream cipher while AES
is an S-box-based block cipher, both are widely adopted in secure automotive and vehicular
communication frameworks. Therefore, evaluating the proposed DSCC20 alongside various
AES designs provides meaningful insight into its viability as a hardware accelerator
in practical CAV deployments.
As shown in Table 3, the proposed DSCC20 achieves 30.06 Gbps throughput with 33.22 kGE, yielding an area
efficiency of 904.92 Kbps/GE. This value is considerably higher than most AES designs,
which typically achieve 100 ~ 234 Kbps/GE, and it also exceeds the efficiency of the
most competitive prior AES implementations. Even against the optimized design in [30], which reaches 831.10 Kbps/GE, the DSCC20 still delivers 9% higher efficiency while
operating in a more advanced process node. These results highlight that ChaCha20,
when mapped to hardware with interleaving, can rival or even surpass the efficiency
of carefully optimized AES architectures. Throughput comparisons also emphasize this
trend. Certain AES accelerators, such as designs in [28] and [29], report higher peak throughput values of 35.5 ~ 42.7 Gbps. However, these designs
incur heavy area costs (182 ~ 352 kGE), resulting in far lower normalized efficiency.
In contrast, the proposed DSCC20 provides a more balanced profile: it sustains over
30 Gbps throughput with a compact 33.22 kGE area footprint, demonstrating that interleaving
achieves high utilization without excessive logic replication. This balance is particularly
advantageous in embedded platforms where area and power budgets are strictly constrained.
The structural distinction between ChaCha20 and AES also explains the observed efficiency
gap. The AES relies on S-box lookups and Galois field multiplications, which are expensive
in hardware and often limit area efficiency unless aggressively optimized. ChaCha20,
by contrast, consists solely of addition, rotation, and XOR (ARX) operations, which
map naturally onto simple arithmetic and logic units. Our DSCC20 architecture further
exploits this property by ensuring that the round hardware is continuously active,
translating algorithmic simplicity into practical gains in both throughput and efficiency.
Table 3. Comparison of the proposed DSCC20 with existing AES hardware accelerators.
|
|
TIFS'18 [26]
|
TVLSI'21 [27]
|
CECCT'24 [28]
|
JSTS'25 [29]
|
TC'19 [30]
|
Proposed
|
|
Technology
|
65-nm
|
7-nm
|
180-nm
|
180-nm
|
45-nm
|
28-nm
|
|
Frequency (MHz)
|
847
|
2550
|
277
|
333
|
787
|
763
|
|
Throughput (Gbps)
|
8.32
|
29.45
|
35.50
|
42.67
|
10.08
|
30.06
|
|
Gate count (kGE)
|
51.31
|
131.45
|
352.07
|
182.58
|
12.13
|
33.22
|
|
Area efficiency (Kbps/GE)
|
162.15
|
224.04
|
100.83
|
233.69
|
831.10
|
904.92
|
|
Normalized frequency (MHz)
|
1265
|
2017
|
1122
|
1349
|
951
|
763
|
|
Normalized throughput (Gbps)
|
12.40
|
23.23
|
143.72
|
172.75
|
12.20
|
30.06
|
The normalized results in Table 3 provide a fair comparison of AES accelerators after removing technology-node differences.
When scaled to an equivalent 28-nm reference, several AES designs exhibit significantly
higher normalized frequencies and throughputs, with some exceeding 1.3 GHz in normalized
frequency and achieving normalized throughputs above 140~170 Gbps. These increases
mainly stem from the aggressive round unrolling and deep pipelining strategies typically
applied in high-performance AES cores. However, such throughput scaling comes at the
cost of substantial hardware resources: these designs require more than 180~350 kGE,
which places them far to the right on the throughput-area trade-off curve. In contrast,
the proposed DSCC20 maintains a normalized throughput of 30.06 Gbps within a compact
33.22 kGE footprint, yielding a normalized efficiency that remains competitive despite
its lower absolute throughput. The normalized comparison therefore confirms that the
DSCC20 design achieves a more favorable throughput-area trade-off, delivering high
sustained performance without the heavy logic duplication characteristic of high-throughput
AES accelerators.
In summary, these comparisons demonstrate that the proposed architecture not only
advances ChaCha20 acceleration beyond prior works but also establishes ChaCha20 as
a compelling alternative to AES in automotive security. By combining lightweight ARX-based
operations with dual-state interleaving, the DSCC20 achieves a superior balance of
throughput, area efficiency, and implementation cost, making it particularly well
suited for latency-sensitive and resource-constrained CAV environments.
To provide a more comprehensive comparison, we perform a joint analysis of throughput
and hardware area, as these two metrics together determine the practicality of a cryptographic
accelerator in embedded systems. High throughput alone may be insufficient if achieved
at excessive silicon cost, while minimal area is of limited value if throughput cannot
meet real-time communication demands. Therefore, plotting throughput against area
reveals the efficiency frontier and highlights which designs achieve the best balance.
Fig. 3 presents the joint throughput-area comparison of the proposed DSCC20, baseline ChaCha20
designs, earlier ChaCha20 accelerators, AES hardware implementations, and software
baseline under AMD 5975WX and Apple M4. The proposed DSCC20 is clearly positioned
at the Pareto front, achieving the best balance between throughput and area. With
30.06 Gbps throughput at only 33.22 kGE, the DSCC20 lies in the upper-left corner
of the design space, which corresponds to the region of highest efficiency. This position
reflects the effectiveness of dual-state interleaving in keeping the round logic continuously
utilized without the excessive logic replication that characterizes deeply unrolled
designs. Compared with the single-state baselines (CR-1/2/4), the proposed DSCC20
shifts upward significantly in throughput while remaining within a similar area range.
The CR-4, for instance, consumes more than 50 kGE but achieves only 12.96 Gbps, whereas
the DSCC20 delivers over twice the throughput with a smaller area. This demonstrates
that interleaving is superior to unrolling as a means of eliminating idle cycles,
since unrolling quickly inflates gate count and extends the critical path. Relative
to prior ChaCha20 accelerators, our DSCC20 consistently dominates in both axes. Reported
implementations either occupy the lower-left corner with modest throughput (3 ~ 10
Gbps) or scale throughput upward at the expense of large area footprints. The proposed
design, in contrast, demonstrates that high throughput can be achieved within a compact
datapath, establishing a new benchmark for ChaCha20 hardware design. When compared
with AES accelerators, the proposed DSCC20 again demonstrates its balanced efficiency.
Some AES designs, such as [28] and [29], reach raw throughput values exceeding 35 Gbps, but they do so with extremely high
area costs ($\ge 180$ kGE), pushing them far to the right on the plot. Our DSCC20
instead provides nearly comparable throughput while using an order of magnitude fewer
gates, which translates into substantially higher normalized efficiency. Finally,
the AMD 5975WX software baseline demonstrates the stark contrast between hardware
and software execution. Despite operating at 4.5 GHz and consuming 280 W, the CPU
achieves only 9.36 Gbps, placing it far below and to the right of dedicated hardware
designs. A similar observation holds for the Apple M4 processor, which delivers only
2.06 Gbps under a 40 W power budget. Both cases highlight the severe inefficiency
of software-based ChaCha20, reinforcing the necessity of hardware acceleration to
achieve the throughput and energy efficiency required for CAV communication workloads.
The DSCC20 delivers more than $3\times$ the throughput at milliwatt-scale power and
within a fraction of the area, underscoring the indispensability of hardware acceleration
for CAV communication workloads.
Fig. 3. Joint throughput-area comparison of the proposed DSCC20, prior ChaCha20 and
AES designs, and software based ChaCha20 implementations.
Overall, the joint throughput-area analysis confirms that DSCC20 achieves the most
favorable balance among ChaCha20 and AES designs, outperforming both prior hardware
and software baselines. Its location at the Pareto frontier validates the dual-state
interleaving approach as an efficient and scalable solution for high-performance,
resource-constrained secure communication systems.
V. CONCLUSION
This paper presented a dual-state based ChaCha20 hardware accelerator tailored for
secure and real-time communication in CAVs. By interleaving two independent states
on shared round hardware, the proposed DSCC20 architecture eliminates idle cycles
while preserving the critical path length, thereby achieving a favorable balance of
throughput, area, and power. When implemented in a 28-nm CMOS technology, the proposed
design achieves 30.06 Gbps throughput with 904.92 Kbps/GE area efficiency, representing
up to $2.3\times$ improvement in throughput over baseline single-state architectures
(CR-1/2/4). In addition, the DSCC20 surpasses earlier ChaCha20 designs and demonstrates
competitive performance compared with the AES accelerators.
Overall, our DSCC20 establishes ChaCha20 as a practical and efficient cryptographic
primitive for vehicular security, delivering high throughput within a compact and
energy-conscious design. When contrasted with software baselines, the DSCC20 provides
more than $15\times$ higher throughput than Apple M4 and over $3\times$ higher throughput
than AMD Threadripper, while consuming only 16.08 mW of power. The proposed interleaving
strategy can also be naturally extended beyond two states, offering further opportunities
to scale throughput while retaining efficiency. These results demonstrate that the
proposed DSCC20 provides a compact and efficient ChaCha20 accelerator well suited
for future secure and real-time CAV communication systems.
ACKNOWLEDGMENT
We would like to acknowledge the financial support from the Platform R&D Program of
KIAT (Korea Institute for Advancement of Technology) of Republic of Korea (No. P0025186)
and the Digital Innovation Hub project supervised by the Daegu Digital Innovation
Promotion Agency (DIP) grant funded by the Korea government (MSIT and Daegu Metropolitan
City) in 2025 (No. 25DIH-18).
REFERENCES
Shladover S. E. , 2017, Connected and automated vehicle systems: Introduction and
overview, Journal of Intelligent Transportation Systems, Vol. 22, No. 3, pp. 190-200

Elliott D. , Keen W. , Miao L. , 2019, Recent advances in connected and automated
vehicles, journal of traffic and transportation engineering (English edition), Vol.
6, No. 2, pp. 109-131

Matin A. , Dia H. , 2022, Impacts of connected and automated vehicles on road safety
and efficiency: A systematic literature review, IEEE Transactions on Intelligent Transportation
Systems, Vol. 24, No. 3, pp. 2705-2736

Liu W. , Hua M. , Deng Z. , Meng Z. , Huang Y. , Hu C. , Song S. , Gao L. , Liu C.
, Shuai B. , Khajepour A. , Xiong L. , Xia X. , 2023, A systematic survey of control
techniques and applications in connected and automated vehicles, IEEE Internet of
Things Journal, Vol. 10, No. 24, pp. 21892-21916

Vahidi A. , Sciarretta A. , 2018, Energy saving potentials of connected and automated
vehicles, Transportation Research Part C: Emerging Technologies, Vol. 95, pp. 822-843

Li J. , Fotouhi A. , Liu Y. , Zhang Y. , Chen Z. , 2024, Review on eco-driving control
for connected and automated vehicles, Renewable and Sustainable Energy Reviews, Vol.
189, pp. 114025

Tamang L. D. , Kim B. W. , 2021, Optical camera communication for vehicular applications:
A survey, IEIE Transactions on Smart Processing and Computing, Vol. 10, No. 2, pp.
136-145

Kenney J. B. , 2011, Dedicated short-range communications (DSRC) standards in the
United States, Proc. IEEE, Vol. 99, No. 7, pp. 1162-1182

Garcia M. H. C. , Molina-Galan A. , Boban M. , Gozalvez J. , Coll-Perales B. , Sahin
T. , Kousaridas A. , 2021, A tutorial on 5G NR V2X communications, IEEE Communications
Surveys & Tutorials, Vol. 23, No. 3, pp. 1972-2026

Jellid K. , Mazri T. , 2020, DSRC vs LTE V2X for autonomous vehicle connectivity,
Proc. International Conference on Smart City Applications (SCA), pp. 381-394

Elbery A. , Sorour S. , Hassanein H. , Sediq A. B. , Abou-zeid H. , 2021, To dsrc
or 5g? a safety analysis for connected and autonomous vehicles, Proc. Global Communications
Conference (GLOBECOM), pp. 1-6

Clancy J. , Mullins D. , Deegan B. , Horgan J. , Ward E. , Eising C. , Denny P. ,
Jones E. , Glavin M. , 2024, Wireless access for V2X communications: Research, challenges
and opportunities, IEEE Communications Surveys & Tutorials, Vol. 26, No. 3, pp. 2082-2119

Yelure B. , Patokar A. , Patil S. , Mawale R. , Nemade S. , Gaikwad V. , 2024, Impact
and Analysis of Attacks on Routing Protocols in Vehicular Ad hoc Network (VANET):
Assessing Security Threats, IEIE Transactions on Smart Processing & Computing, Vol.
13, No. 3, pp. 294-302

Gupta S. , Maple C. , Passerone R. , 2023, An investigation of cyber-attacks and security
mechanisms for connected and autonomous vehicles, IEEE Access, Vol. 11, pp. 90641-90669

Wang Z. , Wei H. , Wang J. , Zeng X. , Chang Y. , 2022, Security issues and solutions
for connected and autonomous vehicles in a sustainable city: A survey, Sustainability,
Vol. 14, No. 19, pp. 12409

Sadaf M. , Iqbal Z. , Javed A. R. , Saba I. , Krichen M. , Majeed S. , Raza A. , 2023,
Connected and automated vehicles: Infrastructure, applications, security, critical
challenges, and future aspects, Technologies, Vol. 11, No. 5, pp. 117

Daemen J. , Rijmen V. , 1998, Aes proposal: Rijndael, Proc. Advanced Encryption Standard
Candidate Conference

Bernstein D. J. , 2008, ChaCha, a variant of Salsa20, Workshop record of SASC, Vol.
8, No. 1, pp. 3-5

Kwak M. , Lee T. H. , Lee D. H. , Kim T.-H. , Kim Y. , 2025, An Area-Efficient ChaCha20
Hardware Accelerator Design for Secure and Real-Time Communication in CAVs, Proc.
International Conference on Ubiquitous and Future Networks (ICUFN)

Henzen L. , Carbognani F. , Felber N. , Fichtner W. , 2008, VLSI hardware evaluation
of the stream ciphers Salsa20 and ChaCha, and the compression function Rumba, Proc.
International Conference on Signals, Circuits and Systems (SCS), pp. 1-5

Mozaffari-Kermani M. , Azarderakhsh R. , Aghaie A. , 2016, Fault detection architectures
for post-quantum cryptographic stateless hash-based secure signatures benchmarked
on ASIC, ACM Transactions on Embedded Computing Systems, Vol. 16, No. 2, pp. 1-19

Serrano R. , Sarmiento M. , Duran C. , Hoang T. , Pham C. , 2022, A 3.65 Gb/s Area-Efficiency
ChaCha20 Cryptocore, Proc. International SoC Design Conference (ISOCC), pp. 79-80

Le V. T. D. , Pham H. L. , Tran T. H. , Duong T. S. , Nakashima Y. , 2023, High-efficiency
Reconfigurable Crypto Accelerator Utilizing Innovative Resource Sharing and Parallel
Processing, Proc. International Symposium on Embedded Multicore/Many-core Systems-on-Chip
(MCSoC), pp. 576-583

Rashidi B. , 2024, High-Performance Hardware Structure of ChaCha20 Stream Cipher Based
on Sparse Parallel Prefix Adder, International Journal of Circuit Theory and Applications,
Vol. 53, No. 5, pp. 2947-2957

Dani V. , 2023, Implementing ChaCha20: analysis on performance, resource utilization
and side-channel protection

Pammu A. A. , Ho W. , Chong K. , Gwee B. , 2018, A high throughput and secure authentication-encryption
AES-CCM algorithm on asynchronous multicore processor, IEEE Transactions on Information
Forensics and Security, Vol. 14, No. 4, pp. 1023-1036

Nannipieri P. , Matteo S. D. , Baldanzi L. , Crocetti L. , Zulberti L. , Saponara
S. , Fanucci L. , 2021, VLSI design of Advanced-Features AES CryptoProcessor in the
framework of the European Processor Initiative, IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, Vol. 30, No. 2, pp. 177-186

Deng Q. , Li T. , Wang H. , Wang Y. , 2024, A pipelined hardware implementation of
AES with S-box based on right-skewed ECA, Proc. International Conference on Electronics,
Computers and Communication Technology (CECCT), pp. 32-36

Lee Y. , Kang J. , Lee J. , 2025, A Lightweight AES-256 Accelerator Design through
Processing Order Optimization for Low-cost Hardware Security, Journal of Semiconductor
Technology and Science, Vol. 25, No. 4, pp. 406-413

Ueno R. , Morioka S. , Miura N. , Matsuda K. , Nagata M. , Bhasin S. , Mathieu Y.
, Graba T. , Danger J.-L. , Homma N. , 2019, High throughput/gate AES hardware architectures
based on datapath compression, IEEE Transactions on Computers, Vol. 69, No. 4, pp.
534-548

Lee D. , Kwak M. , Lee J. , Kim B. , Kim Y. , 2022, A Light-Weight AES Design using
LFSR-based S-Box for IoT Applications, IEIE Transactions on Smart Processing & Computing,
Vol. 11, No. 2, pp. 140-148

Choi I. , Kim J.-H. , 2016, Area-Optimized Multi-Standard AES-CCM Security Engine
for IEEE 802.15. 4/802.15. 6, Journal of Semiconductor Technology and Science, Vol.
16, No. 3, pp. 293-299

Baik J. , Kim Y. , 2022, A High-Throughput and Energy-Efficient SHA-256 Design using
Approximate Arithmetic, IEIE Transactions on Smart Processing & Computing, Vol. 11,
No. 5, pp. 385-391

Kong W. , Choi P. , Kim D. K. , 2020, Hardware Implementation of Lightweight Block
Ciphers for IoT Sensors, Journal of Semiconductor Technology and Science, Vol. 20,
No. 4, pp. 391-389

Jeong C. , Kim Y. , 2017, Efficient FPGA Implementation of AES-CCM for IEEE 1609.2
Vehicle Communications Security, IEIE Transactions on Smart Processing & Computing,
Vol. 6, No. 2, pp. 133-139

Yu H. , Kim Y. , 2020, New RSA Encryption Mechanism Using One-Time Encryption Keys
and Unpredictable Bio-Signal for Wireless Communication Devices, Electronics, Vol.
9, No. 2, pp. 1-12

Lee D. , Kim Y. , 2021, Design of a Light-Weight Key Scheduler for AES using LFSR
for IoT Applications, Proc. IEEE International Conference on Consumer Electronics-Asia
(ICCE-Asia), pp. 1-2

2025, Libsodium

Bauder M. , Festag A. , Kubjatko T. , Schweiger H. , 2024, Data accuracy in Vehicle-to-X
cooperative awareness messages: An experimental study for the first commercial deployment
of C-ITS in Europe, Vehicular Communications, Vol. 47, pp. 100744

Sarangi S. , Baas B. , 2021, DeepScaleTool: A tool for the accurate estimation of
technology scaling in the deep-submicron era, Proc. International Symposium on Circuits
and Systems (ISCAS), pp. 1-5

Chen Y. , Chang B. , Yang C. , Chiueh T. , 2021, A high-throughput FPGA accelerator
for short-read mapping of the whole human genome, IEEE Transactions on Parallel and
Distributed Systems, Vol. 32, No. 6

Myeongjin Kwak received his B.S. and M.S. degrees in computer science and engineering
from Kyungpook National University, Daegu, Republic of Korea, in 2021 and 2023, where
he is currently pursuing a Ph.D. degree. His research interests include neuromorphic
computing, quantum computing, and hardware accelerators.
Jaewoong Jeong is currently pursuing the B.S. degrees in the School of Computer Science
and Engineering at Kyungpook National University, Daegu, Republic of Korea. His research
interests include computer architecture, cryptographic hardware, and hardware design.
Tae Hee Lee received his B.S. degree in control and measurement engineering from Changwon
National University, Gyeongsangnam-do, Republic of Korea, in 2003, and his M.S., and
Ph.D. degrees in electrical engineering from Kyungpook National University, Daegu,
Republic of Korea, in 2012 and 2019, respectively. He worked as a researcher at PHA,
Daegu, Republic of Korea, from 2003 to 2008, and has been the Director of the Testing
and Evaluation Division at the Korea Intelligent Automotive Parts Promotion Institute
(KIAPI), Daegu, Republic of Korea, since 2008. His current research interests include
autonomous driving and electric vehicle evaluation and certification.
Do Hoon Lee received his M.S. degree in mechanical engineering from Kyungpook National
University, Daegu, Republic of Korea, in 2024. He has been with the Korea Intelligent
Automotive Parts Promotion Institute (KIAPI), Daegu, Republic of Korea, since 2015,
where he is engaged in real-vehicle testing and performance evaluation of autonomous
driving and advanced driver-assistance systems (ADAS).
Tae-Hyoung Kim received his B.S., M.S., and Ph.D. degrees in electrical engineering
from Kyungsung University, Busan, Republic of Korea, in 2003, 2005, and 2009, respectively.
He worked a Principal researcher of Future Vehicle Research Team at Daegu Mechatronics
& Materials Institute, from 2010 to 2022. He has been with Korea Intelligent Automotive
Parts Promotion Institute (KIAPI), Daegu, Republic of Korea, as a General Manager
in the Autonomous Driving Evaluation Department since 2023. His current research interests
are autonomous mobility system control and evaluation.
Yongtae Kim received his B.S. and M.S. degrees in electrical engineering from the
Korea University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and a
Ph.D. degree from the Department of Electrical and Computer Engineering from the Texas
A&M University, College Station, TX, in 2013. From 2013 to 2018, he was a software
engineer with Intel Corporation, Santa Clara, CA. Since 2018, he has been with the
School of Computer Science and Engineering at Kyungpook National University, Daegu,
Republic of Korea, where he is currently an Associate Professor. His research interests
are in energy-efficient integrated circuits and systems, particularly, neuromorphic
computing, approximate computing, quantum computing, and new memory devices and architecture.