Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 26, No. 1, p.69-80

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 22 Sep. 2025Revised : 15 Nov. 2025Accepted : 25 Nov. 2025

DOI :

https://doi.org/10.5573/JSTS.2026.26.1.69

An Efficient Dual-State ChaCha20 Accelerator for Secure and Real-time CAV Communications

Myeongjin Kwak¹ Jaewoong Jeong¹ Tae Hee Lee² Do Hoon Lee² Tae-Hyoung Kim² Yongtae Kim^1,^*

(School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Korea)
(Korea Intelligent Automotive Parts Promotion Institute (KIAPI), Daegu 43011, Republic of Korea)

^*Corresponding author: Yongtae Kim E-mail: yongtae@knu.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

This paper presents a dual-state hardware accelerator for the ChaCha20 stream cipher, optimized for secure and low-latency communication in connected and automated vehicles (CAVs). The proposed dual-state ChaCha20 (DSCC20) architecture employs an interleaving mechanism that alternately processes two independent states, thereby eliminating idle cycles between column and diagonal rounds and keeping the round hardware fully utilized. With only one additional cycle compared to conventional single-state designs, our DSCC20 achieves substantially higher throughput while incurring minimal hardware overhead. When the design was implemented in Verilog HDL and synthesized using a 28-nm CMOS technology, the DSCC20 delivers 30.06 Gbps throughput and 904.92 Kbps/GE area efficiency at 763 MHz, outperforming baseline and earlier ChaCha20 and AES designs. Compared to software execution on general-purpose CPUs, the DSCC20 further demonstrates a decisive advantage in both throughput and area efficiency. These results confirm that the DSCC20 offers a compact and efficient ChaCha20 accelerator suited for secure and real-time communication in resource-constrained automotive systems.

Index Terms

ChaCha20, cipher, connected and automated vehicle (CAV), security, area efficiency

I. INTRODUCTION

Connected and automated vehicles (CAVs) are rapidly emerging as a transformative technology in the automotive domain, promising significant improvements in traffic safety, energy efficiency, and user convenience ^[1- ^7]. By leveraging advanced sensing, communication, and control systems, CAVs enable real-time cooperation among vehicles, roadside infrastructure, and cloud-based services. A critical enabler of this ecosystem is vehicle-to-everything (V2X) communication, supported by dedicated short-range communications (DSRC) and 5G New Radio V2X (NR-V2X), which provide high data rates and low latency for exchanging information among dynamic entities on the road ^[8- ^12]. However, the proliferation of wireless connectivity also exposes CAVs to severe cybersecurity risks. Each vehicle continuously generates and transmits a large volume of sensitive information, including positional data, control commands, and sensor streams, all of which are attractive targets for malicious adversaries ^[13]. The expansion of the attack surface significantly increases the likelihood of unauthorized access, message injection, replay attacks, and other forms of intrusion ^[14- ^16]. Consequently, ensuring secure, reliable, and low-latency communication is paramount for the safe operation of CAVs. Cryptography provides the fundamental mechanism for securing vehicular communications. Among the numerous algorithms available, the Advanced Encryption Standard (AES) and the ChaCha20 stream cipher are the two most widely adopted solutions in practice ^[17, ^18]. AES, as the de facto standard for symmetric-key encryption, is widely deployed due to its proven security strength and extensive hardware/software support. ChaCha20, a modern ARX (Addition-Rotation-XOR) based stream cipher, has been standardized in IETF protocols and adopted in security frameworks such as TLS 1.3 and WireGuard due to its efficiency and strong resistance to timing-based side-channel attacks. These two algorithms constitute the core cryptographic primitives for safeguarding both internal in-vehicle data exchange and external V2X communication.

Although both AES and ChaCha20 can be implemented purely in software, their software-based execution on CPUs or GPUs presents limitations in the context of CAVs. Real-time vehicular communication is characterized by short yet frequent packets, often with strict latency constraints. Software implementations incur non-negligible per-packet overhead due to padding, mode initialization, and memory access operations in the case of AES, or due to repeated ARX operations in ChaCha20. Moreover, when these operations compete with other computation-heavy automotive workloads, the cumulative cost can impose substantial burdens on the host processor, leading to increased latency and reduced energy efficiency. This overhead is particularly problematic in safety-critical scenarios where even microseconds of delay can impact the responsiveness of autonomous driving functions. To address these challenges, dedicated hardware acceleration of symmetric-key ciphers has emerged as a promising direction. Hardware implementations can exploit parallelism and pipelining at the architectural level, thereby achieving significantly higher throughput, lower latency, and improved energy efficiency compared with software-only approaches. Prior studies have explored hardware-oriented realizations of AES, ChaCha20, and other crypto-algorithms, demonstrating their ability to sustain high-performance under resource constraints ^[19- ^37]. Building upon this trend, the focus of this work is to design an optimized ChaCha20 hardware accelerator tailored for secure and real-time CAV communications.

Several prior studies have explored ChaCha20 hardware implementations with varying degrees of unrolling and optimization. Early works investigated datapaths instantiating $1\times$, $4\times$, and $8\times$ quarter-round (QR) units, where the labels denote spatial parallelism. Henzen et al. reported that the $8\times$QR design achieved 6.78 Gbps, while the $4\times$QR variant offered a more favorable trade-off between throughput and area efficiency ^[20]. Nevertheless, these single-state designs preserve the sequential execution of column and diagonal rounds, leaving the round hardware underutilized during alternating phases. Other approaches emphasized different design goals. Mozaffari-Kermani et al. prioritized reliability in a 65-nm CMOS implementation, where the baseline ChaCha20 core reached 9.6 Gbps ^[21]. Serrano et al. presented an area-efficient ChaCha20 core in 180-nm technology, evaluating $1\times$, $2\times$, and $4\times$QR configurations, with the $4\times$QR design reaching 3.65 Gbps and being integrated into a RISC-V SoC ^[22]. However, the constraints of single-state scheduling and the use of older process nodes limited performance scalability. At the system level, Le et al. introduced a reconfigurable crypto accelerator that achieved up to 75.4 Gbps throughput through resource sharing and multi-core parallelism, albeit at the expense of area efficiency ^[23]. At the microarchitectural level, Rashidi et al. improved the ARX datapath using a sparse parallel prefix adder and resource sharing to reduce area and delay ^[24]. On reconfigurable platforms, Dani demonstrated an Artix-7 FPGA design achieving 5.6 Gbps using pipelining, though duplicating round logic increased hardware cost ^[25]. These studies collectively demonstrate the trade-offs in ChaCha20 hardware design: while unrolling and pipelining enhance throughput, they often either leave round hardware idle or demand excessive resources. This motivates the need for an alternative architecture that simultaneously maintains compact area, avoids idle cycles, and sustains high throughput-objectives that form the foundation of the dual-state interleaving approach proposed in this work.

In this paper, we first present three single-state based ChaCha20 architectures, CR-1, CR-2, and CR-4, which instantiate different degrees of round unrolling and serve as reference points for performance and efficiency comparisons. These baseline designs were originally introduced in our prior work ^[19], where we demonstrated their effectiveness in balancing throughput, area, and power. Building on this foundation, we now extend the work by proposing a dual-state ChaCha20 (DSCC20) hardware design that interleaves two independent states in an alternating fashion, thereby eliminating idle cycles and maximizing round utilization. The proposed design delivers high throughput and area efficiency while maintaining moderate hardware cost, making it well suited for resource-constrained embedded systems, including latency-sensitive applications in the automotive domain. In contrast to conventional single-state architectures, the DSCC20 alternately processes two states with only one additional cycle of overhead, resulting in significantly improved throughput and area efficiency. The architecture is described in Verilog HDL, synthesized using 28-nm CMOS technology, and comprehensively evaluated in terms of area, power, delay, throughput, and area efficiency, with results compared against both the CR baselines and existing ChaCha20 and AES designs.

In summary, the contributions of this work are outlined as follows:

We implement three single-state based ChaCha20 architectures (CR-1, CR-2, and CR-4), representing different levels of round unrolling, to establish a consistent reference for evaluating throughput, latency, and area efficiency.
We propose a novel dual-state ChaCha20 architecture that interleaves two independent states on shared round hardware, thereby eliminating idle cycles between column and diagonal rounds while preserving a compact datapath.
We perform a comprehensive evaluation of the proposed DSCC20, comparing it not only with the CR baselines but also with other ChaCha20 and AES hardware implementations across various technology nodes, demonstrating its superior throughput and area efficiency for resource-constrained CAV systems.

II. BACKGROUND

ChaCha20 is a member of the ChaCha family, introduced as a modification of Salsa20 to achieve faster diffusion per round while retaining algorithmic simplicity ^[18]. It is a symmetric-key stream cipher that generates a 512-bit keystream block from a $4 \times 4$ matrix of 32-bit words, referred to as the state. The initial state consists of four components: a 128-bit constant, a 256-bit secret key, a 32-bit block counter, and a 96-bit nonce. These values are arranged in row-major order, with the constants in the first row, the key in the second and third rows, the counter in the first column of the fourth row, and the nonce in the remaining positions. All values are represented in little-endian format, consistent with the algorithm's internal word ordering. The ChaCha20 block function transforms the initialized 512-bit state through 20 rounds of computation, organized into 10 double rounds. Each double round consists of a ColumnRound() function, which operates on the vertical columns of the state, followed by a DiagonalRound() function, which mixes the diagonal elements. Once all rounds are completed, the transformed state is added element-wise to the preserved initial state to generate the final keystream block. This feed-forward addition enhances diffusion and ensures that the transformation is not trivially invertible.

Algorithm 1 illustrates how the state $s$ is processed during two consecutive rounds. The function explicitly takes the 512-bit state $s = s[0], s[1], \dots, s[15]$ as input and modifies it in place. During the ColumnRound(), four independent QuarterRound() operations are applied to each of the four columns of the $4 \times 4$ state matrix. During the DiagonalRound(), another four QuarterRound() operations are applied, this time to diagonal word groups, ensuring that every word interacts with different neighbors across rounds. By alternating these two transformations, ChaCha20 guarantees rapid and uniform diffusion of the key, counter, and nonce across the entire state. At the core of both ColumnRound() and DiagonalRound() is the QuarterRound() function, shown in Algorithm 2. This function operates on four 32-bit words of the state, $(a,b,c,d)$, and performs a fixed sequence of ARX operations. Specifically, $+$ denotes 32-bit integer addition modulo $2^{32}$, $\oplus$ is bitwise XOR, and $\ll n$ is a left rotation by $n$ bits. The choice of rotation distances (16, 12, 8, and 7) ensures that different bit positions are mixed uniformly across rounds. The QuarterRound() function provides the fundamental nonlinear transformation that propagates changes from one word across the others, and when applied repeatedly across the state matrix, it ensures strong diffusion. Over the course of 20 rounds, each word of the state is updated multiple times through different pairings, which greatly enhances security margin of the cipher.

Algorithm 1: Two round of ChaCha20 stream cipher.

Algorithm 2: Quarter round of ChaCha20.

ChaCha20's reliance solely on ARX operations makes it highly attractive for accelerator design. Unlike AES, it does not require substitution boxes (S-boxes), lookup tables, or Galois field multiplications, thus eliminating the need for complex memory structures or logic. Its regular matrix structure maps naturally to parallel datapaths and pipelining, allowing designers to scale throughput without introducing irregular control overhead. The predictable timing of ARX operations simplifies synthesis and timing closure, while its side-channel resistance against cache-based attacks is advantageous for secure embedded implementations. Furthermore, since ChaCha20 is stream-oriented and does not require padding, it introduces minimal per-packet overhead, which is especially beneficial in CAV communications where short, frequent messages dominate. In addition, the lightweight nature of ARX-based processing reduces switching activity and energy consumption, making it well suited for automotive platforms with stringent power budgets. These structural advantages also allow ChaCha20 accelerators to be seamlessly integrated into heterogeneous systems-on-chip (SoC), supporting both dedicated security modules and reconfigurable cryptographic engines in emerging vehicular architectures.

III. PROPOSED CHACHA20 DESIGN

The conventional ChaCha20 hardware accelerator processes one 512-bit keystream block at a time and requires multiple clock cycles to complete the 20-round block function ^[20]. In a single-state schedule, strict data dependencies between the column and diagonal rounds enforce sequential execution, which causes only one round module to be active at any given cycle. As a result, the other round logic remains idle, leading to underutilization of hardware resources and suboptimal throughput.

To mitigate this inefficiency, our prior work introduced an $N\times$ combinational-round (CR) design, in which $4 \times N$ QuarterRound units are instantiated as combinational logic to perform $N$ rounds per-cycle ^[19]. Fig. 1 illustrates the CR-1 baseline architecture. A single round is composed of four combinational QuarterRound instances, with the ColumnRound and DiagonalRound implemented as distinct combinational blocks. A round controller time-multiplexes these phases: it activates either the column or diagonal round block in each cycle and routes the 512-bit state through a multiplexer (MUX) back into the state register. This mechanism serializes the two phases over successive cycles, such that one block remains idle while the other is active. After completing the 20 rounds, the transformed state is word-wise added to the preserved initial state, and the 512-bit result is serialized in little-endian order to form the keystream block. Building on the CR-1, we also explored the CR-2 and CR-4 architectures, which reduce the total cycle count by chaining multiple rounds into a single cycle. Specifically, the CR-2 combines a column and diagonal round into one cycle, requiring 15 cycles to complete the 20-round block function, while the CR-4 chains four rounds into one cycle, reducing the latency to 10 cycles. Although this approach reduces nominal cycle count, it also lengthens the critical path substantially due to deeper combinational logic, thereby lowering the maximum achievable clock frequency. Consequently, CR-2 and CR-4 demonstrate higher theoretical throughput but suffer from degraded timing closure and reduced efficiency in practice. These observations demonstrate a fundamental trade-off: unrolling more logic per-cycle improves cycle count but worsens frequency scalability, while single-state scheduling leaves hardware resources underutilized. This motivates the need for an alternative approach that raises utilization without incurring a critical path penalty.

Fig. 1. CR-1 baseline single-state ChaCha20 hardware architecture with QuarterRound ARX dataflow and little-endian serializer mapping from 512-bit state to keystream.

To address this challenge, we propose a dual-state ChaCha20 architecture that interleaves two independent states in an alternating fashion. By introducing a lightweight interleaving mechanism, the DSCC20 eliminates idle cycles while maintaining the same critical path as CR-1, thereby improving hardware utilization without additional round unrolling. The key idea is to introduce a lightweight interleaving mechanism that allows two independent states to be processed alternately, thereby eliminating idle cycles while keeping the round hardware fully utilized. Fig. 2(a) shows the overall DSCC20 hardware architecture. The inputs are provided as two consecutive tuples (Key$_1$, Nonce$_1$, Count$_1$) and (Key$_2$, Nonce$_2$, Count$_2$), representing successive blocks rather than separate physical ports. Compared with the CR-1 baseline in Fig. 1, the datapath adds only a second 512-bit state register file and a small amount of steering logic. The round logic itself is unchanged: a single ColumnRound module and a single DiagonalRound module are shared between the two states. After 20 rounds, the transformed state is combined with the preserved initial state in the Add-and-Serialize block, and the resulting keystream is XORed with the message (i.e. Message$_1$ and Message$_2$) to produce Cipher$_1$ and Cipher$_2$, respectively. The Round Control block simply selects which state feeds the round logic and where the results are written back, requiring only a small finite state machine (FSM) and a round counter.

Fig. 2. Proposed ChaCha20 hardware design; (a) dual-state architecture with shared round logic and duplicated state banks and (b) interleaved execution schedule showing that ColumnRound and DiagonalRound remain active every cycle.

This interleaved execution achieves several important advantages. By alternating two states, the design eliminates idle cycles and ensures that the Column and Diagonal modules are fully utilized in every clock cycle. At the same time, because only one round is evaluated per-cycle as in CR-1, the critical path length remains unchanged, which allows the maximum operating frequency to be preserved. The additional hardware overhead is also minimal, since the architecture requires only a second 512-bit state register, rather than duplicating the round logic. As a result, two keystream blocks are produced in nearly the same latency as a single block in CR-1, leading to a substantial improvement in effective throughput. It is worth noting that this design differs fundamentally from CR-2 and CR-4. Those architectures reduce cycle count by chaining multiple rounds in a single cycle, which improves nominal latency but significantly lengthens the critical path, making it difficult to sustain high clock frequencies. In contrast, the DSCC20 maintains the same per-cycle workload as CR-1, but achieves higher throughput by eliminating idle cycles through interleaving. This distinction allows DSCC20 to combine the advantages of high utilization and frequency scalability, resulting in a balanced trade-off between throughput and hardware cost.

Although the proposed architecture focuses on dual-state interleaving, the concept can be naturally extended to a multi-state design. By introducing additional state registers, multiple states could be interleaved in a round-robin fashion, keeping the Column and Diagonal modules continuously occupied across an even larger number of inputs. Such a multi-state scheme has the potential to further increase throughput and hardware utilization, while still avoiding the excessive logic replication of deeply unrolled designs. However, scaling beyond two states also introduces greater control complexity and additional register overhead, which must be carefully balanced against the expected performance gains.

IV. EXPERIMENTAL RESULTS

To evaluate the performance of the proposed DSCC20 architecture, we implemented it along with three baseline ChaCha20 designs, CR-1, CR-2, and CR-4, using Verilog HDL and synthesized them with Synopsys Design Compiler and a 28-nm CMOS standard-cell library to obtain area, power, delay, and maximum operating frequency. Throughput was derived from the number of processed rounds per-cycle and the maximum achievable frequency, while area efficiency was calculated as throughput normalized by the total gate equivalent (GE) count, where one GE corresponds to the area of a two-input NAND gate. For a fair comparison, each design was evaluated at its own maximum frequency. The area was reported in kilo-gate equivalents (kGE), and power consumption was measured using gate-level switching activity estimated under typical conditions. The proposed DSCC20 architecture was further benchmarked against three categories of designs: 1) the CR-1/2/4 baselines to highlight improvements from interleaving, 2) prior ChaCha20 hardware implementations reported in the literature, and 3) representative AES hardware accelerators across various technology nodes. In addition, the software-based Libsodium ChaCha20 implementation executed on general-purpose CPUs (x86 and ARM architectures) was included as a baseline reference to demonstrate the clear performance gap between hardware and software realizations ^[38]. Finally, although AES and ChaCha20 employ different cryptographic structures (block cipher versus stream cipher), the AES hardware designs were considered as comparison points because both algorithms are widely deployed in CAV security frameworks, making the evaluation relevant to real-world automotive communication scenarios.

Table 1 summarizes the area, power, throughput, and area efficiency results for the baseline CR-1/2/4 designs and the proposed DSCC20 synthesized in 28-nm CMOS technology, as well as software ChaCha20 performance measured on modern CPUs. At 763 MHz, the proposed DSCC20 achieves 30.06 Gbps, which corresponds to $1.88\times$, $1.87\times$, and $2.32\times$ the throughput of CR-1, CR-2, and CR-4, respectively. In terms of area, DSCC20 is larger than CR-1 and CR-2 by 17.0% and 10.4%, respectively, but is 37.9% smaller than CR-4. The power consumption of the DSCC20 is $1.20\times$, $2.01\times$, and $2.62\times$ higher than CR-1, CR-2, and CR-4, respectively, primarily due to the additional state register and control logic. However, the total consumption of 16.08 mW is several orders of magnitude lower than the operating power of common automotive electronic control units (ECUs), which typically range from a few watts to hundreds of watts. As such, this overhead is insignificant and does not hinder real-time V2X operation. Nevertheless, DSCC20 delivers the best area efficiency at 904.92 Kbps/GE, reflecting the benefit of removing idle cycles while keeping a compact datapath. These results demonstrate that, while CR-2 and CR-4 reduce idle cycles through round unrolling, such chaining extends the critical path, prevents operation at high frequencies, and results in lower efficiency. In contrast, DSCC20 leverages dual-state interleaving to sustain higher operating frequency with balanced area overhead, thereby nearly doubling throughput compared to single-state architectures.

Table 1. Comparison of ChaCha20 hardware accelerators (CR-1, CR-2, CR-4, and DSCC20) and software baselines on Apple M4 and AMD Threadripper CPUs.

Design	Freq. (MHz)	Area (kGE)	Power (mW)	Thru. (Gbps)	Area Eff. (Kbps/GE)
CR-1	781	28.39	13.45	16.00	563.58
CR-2	472	30.10	8.01	16.10	534.97
CR-4	253	53.53	6.13	12.96	242.14
Proposed	763	33.22	16.08	30.06	904.92
Apple M4	4400	166 mm$^2$	40 W	2.06	-
AMD 5975WX	4500	324 mm$^2$	280 W	9.36	-

The software performance of ChaCha20 was evaluated using the Libsodium library on two CPU platforms: an Apple M4 processor (ARM architecture) and an AMD Threadripper PRO 5975WX processor (x86 architecture). For a realistic comparison with vehicular communication traffic, throughput was measured with an 800-byte payload size representative of V2X messages ^[39]. To improve timing precision and amortize non-cryptographic overheads, each encryption was repeated 1,000 times per run, and the average execution time was used to compute throughput. The Apple M4 achieved 2.06 Gbps and the AMD Threadripper reached 9.36 Gbps, both with significantly higher power budgets (40 W and 280 W, respectively) compared with the 16.08 mW of DSCC20. By contrast, DSCC20 achieves 30.06 Gbps, which corresponds to roughly $15\times$ the throughput of the M4 and $3.2\times$ that of the AMD CPU. These results clearly highlight the decisive advantage of hardware acceleration, demonstrating that the proposed DSCC20 not only outperforms prior ChaCha20 baseline hardware but also vastly exceeds software implementations, making it highly suitable for real-time secure communication in CAV systems.

Table 2 compares the proposed DSCC20 design with previously presented ChaCha20 hardware accelerators across different technology nodes. For clarity, we explicitly state that all ChaCha20 accelerators evaluated here, including the proposed DSCC and previously reported designs, use the standard 20-round ChaCha20 block function, ensuring that all throughput and efficiency comparisons are performed under the same cryptographic configuration. Even though the earlier designs were implemented in various process technologies, the proposed DSCC20 clearly demonstrates superior performance and efficiency. Operating at 763 MHz in 28-nm CMOS node, the DSCC20 achieves a throughput of 30.06 Gbps, which is up to $8.2\times$ higher than the values earlier designs in ^[20- ^22]. Compared to the most recent design in ^[23], DSCC20 still delivers higher throughput while operating at a lower frequency, confirming the benefit of sustaining utilization without aggressive pipelining or duplication of round logic. In terms of area, the proposed design requires 33.22 kGE, which is comparable to or smaller than many prior works, despite delivering much higher throughput. For example, the design in ^[21] consumes 56.5 kGE and the one in ^[22] requires 25.05 kGE, yet both offer significantly lower performance. Most notably, the proposed DSCC20 achieves an area efficiency of 904.92 Kbps/GE, which surpasses all prior ChaCha20 implementations by a wide margin. It improves efficiency by $4.6\times$, $5.3\times$, and $6.2\times$ compared to the designs in ^[20, ^21], and ^[22], respectively. Even when compared with ^[23], which reports relatively high efficiency at 666.57 Kbps/GE, the proposed design still delivers 36% better efficiency, validating the advantage of dual-state interleaving over conventional single-state or partially unrolled designs. These comparisons confirm that the proposed DSCC20 architecture offers the best balance between throughput and hardware cost among existing ChaCha20 accelerators. By sustaining higher utilization without incurring deep unrolling or excessive logic replication, DSCC20 demonstrates a scalable architecture that outperforms prior works in both absolute throughput and normalized efficiency.

Table 2. Comparison of proposed DSCC20 with existing ChaCha20 hardware implementations.

	SCS'08 ^[20]	TECS'16 ^[21]	ISOCC'22 ^[22]	MCSoC'23 ^[23]	IJCTA'24 ^[24]	Proposed
Technology	180-nm	65-nm	180-nm	45-nm	180-nm	28-nm
Frequency (MHz)	215	307	150	510	352	763
Throughput (Gbps)	5.51	9.60	3.65	23.83	1.10	30.06
Gate Count (kGE)	28.11	56.5	25.05	35.75	6.17	33.22
Area efficiency (Kbps/GE)	196.02	169.91	145.71	666.57	178.28	904.92
Normalized frequency (MHz)	871	458	608	617	1426	763
Normalized throughput (Gbps)	22.31	14.31	14.78	28.85	4.45	30.06

Furthermore, to enable a fair comparison of throughput and frequency across different process technologies, all ChaCha20 implementations in Table 2 were normalized to an equivalent 28-nm reference node. We employed the DeepScaleTool framework, which provides voltage-frequency-delay scaling across 7-nm to 130-nm CMOS nodes ^[40]. For designs originally implemented in 180-nm technology, which DeepScaleTool does not directly support, we first applied a linear geometric scaling factor to map the results to an equivalent 130-nm node, consistent with prior literature ^[41], and subsequently normalized them to 28-nm using DeepScaleTool. This two-step process ensures consistent and defensible normalization for all evaluated designs and isolates architectural effects from technology-dependent frequency variation. The normalized results in Table 2 show that the performance advantage of DSCC20 remains substantial even after removing technology-node differences. When scaled to a common 28-nm reference, the throughput of prior ChaCha20 designs increases proportionally with technology scaling, with several implementations reaching normalized values in the 14~29 Gbps range. Despite this upward shift, the proposed DSCC20 design still achieves the highest normalized throughput at 30.06 Gbps, indicating that its performance benefit originates from architectural efficiency rather than fabrication node. Although our DSCC20 does not exhibit the highest normalized frequency among all designs, it still has the highest due to the elimination of idle cycles through dual-state interleaving. These normalized results confirm that the proposed architecture provides consistent advantages independent of process technology, reinforcing the DSCC20 design as one of the most scalable and resource-efficient ChaCha20 accelerator designs to date.

To further assess the competitiveness of the proposed DSCC20 architecture, we compare it against several existing AES hardware accelerators. Although ChaCha20 and AES employ different cryptographic primitives, ChaCha20 is an ARX-based stream cipher while AES is an S-box-based block cipher, both are widely adopted in secure automotive and vehicular communication frameworks. Therefore, evaluating the proposed DSCC20 alongside various AES designs provides meaningful insight into its viability as a hardware accelerator in practical CAV deployments.

As shown in Table 3, the proposed DSCC20 achieves 30.06 Gbps throughput with 33.22 kGE, yielding an area efficiency of 904.92 Kbps/GE. This value is considerably higher than most AES designs, which typically achieve 100 ~ 234 Kbps/GE, and it also exceeds the efficiency of the most competitive prior AES implementations. Even against the optimized design in ^[30], which reaches 831.10 Kbps/GE, the DSCC20 still delivers 9% higher efficiency while operating in a more advanced process node. These results highlight that ChaCha20, when mapped to hardware with interleaving, can rival or even surpass the efficiency of carefully optimized AES architectures. Throughput comparisons also emphasize this trend. Certain AES accelerators, such as designs in ^[28] and ^[29], report higher peak throughput values of 35.5 ~ 42.7 Gbps. However, these designs incur heavy area costs (182 ~ 352 kGE), resulting in far lower normalized efficiency. In contrast, the proposed DSCC20 provides a more balanced profile: it sustains over 30 Gbps throughput with a compact 33.22 kGE area footprint, demonstrating that interleaving achieves high utilization without excessive logic replication. This balance is particularly advantageous in embedded platforms where area and power budgets are strictly constrained. The structural distinction between ChaCha20 and AES also explains the observed efficiency gap. The AES relies on S-box lookups and Galois field multiplications, which are expensive in hardware and often limit area efficiency unless aggressively optimized. ChaCha20, by contrast, consists solely of addition, rotation, and XOR (ARX) operations, which map naturally onto simple arithmetic and logic units. Our DSCC20 architecture further exploits this property by ensuring that the round hardware is continuously active, translating algorithmic simplicity into practical gains in both throughput and efficiency.

Table 3. Comparison of the proposed DSCC20 with existing AES hardware accelerators.

	TIFS'18 ^[26]	TVLSI'21 ^[27]	CECCT'24 ^[28]	JSTS'25 ^[29]	TC'19 ^[30]	Proposed
Technology	65-nm	7-nm	180-nm	180-nm	45-nm	28-nm
Frequency (MHz)	847	2550	277	333	787	763
Throughput (Gbps)	8.32	29.45	35.50	42.67	10.08	30.06
Gate count (kGE)	51.31	131.45	352.07	182.58	12.13	33.22
Area efficiency (Kbps/GE)	162.15	224.04	100.83	233.69	831.10	904.92
Normalized frequency (MHz)	1265	2017	1122	1349	951	763
Normalized throughput (Gbps)	12.40	23.23	143.72	172.75	12.20	30.06

The normalized results in Table 3 provide a fair comparison of AES accelerators after removing technology-node differences. When scaled to an equivalent 28-nm reference, several AES designs exhibit significantly higher normalized frequencies and throughputs, with some exceeding 1.3 GHz in normalized frequency and achieving normalized throughputs above 140~170 Gbps. These increases mainly stem from the aggressive round unrolling and deep pipelining strategies typically applied in high-performance AES cores. However, such throughput scaling comes at the cost of substantial hardware resources: these designs require more than 180~350 kGE, which places them far to the right on the throughput-area trade-off curve. In contrast, the proposed DSCC20 maintains a normalized throughput of 30.06 Gbps within a compact 33.22 kGE footprint, yielding a normalized efficiency that remains competitive despite its lower absolute throughput. The normalized comparison therefore confirms that the DSCC20 design achieves a more favorable throughput-area trade-off, delivering high sustained performance without the heavy logic duplication characteristic of high-throughput AES accelerators.

In summary, these comparisons demonstrate that the proposed architecture not only advances ChaCha20 acceleration beyond prior works but also establishes ChaCha20 as a compelling alternative to AES in automotive security. By combining lightweight ARX-based operations with dual-state interleaving, the DSCC20 achieves a superior balance of throughput, area efficiency, and implementation cost, making it particularly well suited for latency-sensitive and resource-constrained CAV environments.

To provide a more comprehensive comparison, we perform a joint analysis of throughput and hardware area, as these two metrics together determine the practicality of a cryptographic accelerator in embedded systems. High throughput alone may be insufficient if achieved at excessive silicon cost, while minimal area is of limited value if throughput cannot meet real-time communication demands. Therefore, plotting throughput against area reveals the efficiency frontier and highlights which designs achieve the best balance. Fig. 3 presents the joint throughput-area comparison of the proposed DSCC20, baseline ChaCha20 designs, earlier ChaCha20 accelerators, AES hardware implementations, and software baseline under AMD 5975WX and Apple M4. The proposed DSCC20 is clearly positioned at the Pareto front, achieving the best balance between throughput and area. With 30.06 Gbps throughput at only 33.22 kGE, the DSCC20 lies in the upper-left corner of the design space, which corresponds to the region of highest efficiency. This position reflects the effectiveness of dual-state interleaving in keeping the round logic continuously utilized without the excessive logic replication that characterizes deeply unrolled designs. Compared with the single-state baselines (CR-1/2/4), the proposed DSCC20 shifts upward significantly in throughput while remaining within a similar area range. The CR-4, for instance, consumes more than 50 kGE but achieves only 12.96 Gbps, whereas the DSCC20 delivers over twice the throughput with a smaller area. This demonstrates that interleaving is superior to unrolling as a means of eliminating idle cycles, since unrolling quickly inflates gate count and extends the critical path. Relative to prior ChaCha20 accelerators, our DSCC20 consistently dominates in both axes. Reported implementations either occupy the lower-left corner with modest throughput (3 ~ 10 Gbps) or scale throughput upward at the expense of large area footprints. The proposed design, in contrast, demonstrates that high throughput can be achieved within a compact datapath, establishing a new benchmark for ChaCha20 hardware design. When compared with AES accelerators, the proposed DSCC20 again demonstrates its balanced efficiency. Some AES designs, such as ^[28] and ^[29], reach raw throughput values exceeding 35 Gbps, but they do so with extremely high area costs ($\ge 180$ kGE), pushing them far to the right on the plot. Our DSCC20 instead provides nearly comparable throughput while using an order of magnitude fewer gates, which translates into substantially higher normalized efficiency. Finally, the AMD 5975WX software baseline demonstrates the stark contrast between hardware and software execution. Despite operating at 4.5 GHz and consuming 280 W, the CPU achieves only 9.36 Gbps, placing it far below and to the right of dedicated hardware designs. A similar observation holds for the Apple M4 processor, which delivers only 2.06 Gbps under a 40 W power budget. Both cases highlight the severe inefficiency of software-based ChaCha20, reinforcing the necessity of hardware acceleration to achieve the throughput and energy efficiency required for CAV communication workloads. The DSCC20 delivers more than $3\times$ the throughput at milliwatt-scale power and within a fraction of the area, underscoring the indispensability of hardware acceleration for CAV communication workloads.

Fig. 3. Joint throughput-area comparison of the proposed DSCC20, prior ChaCha20 and AES designs, and software based ChaCha20 implementations.

Overall, the joint throughput-area analysis confirms that DSCC20 achieves the most favorable balance among ChaCha20 and AES designs, outperforming both prior hardware and software baselines. Its location at the Pareto frontier validates the dual-state interleaving approach as an efficient and scalable solution for high-performance, resource-constrained secure communication systems.

V. CONCLUSION

This paper presented a dual-state based ChaCha20 hardware accelerator tailored for secure and real-time communication in CAVs. By interleaving two independent states on shared round hardware, the proposed DSCC20 architecture eliminates idle cycles while preserving the critical path length, thereby achieving a favorable balance of throughput, area, and power. When implemented in a 28-nm CMOS technology, the proposed design achieves 30.06 Gbps throughput with 904.92 Kbps/GE area efficiency, representing up to $2.3\times$ improvement in throughput over baseline single-state architectures (CR-1/2/4). In addition, the DSCC20 surpasses earlier ChaCha20 designs and demonstrates competitive performance compared with the AES accelerators.

Overall, our DSCC20 establishes ChaCha20 as a practical and efficient cryptographic primitive for vehicular security, delivering high throughput within a compact and energy-conscious design. When contrasted with software baselines, the DSCC20 provides more than $15\times$ higher throughput than Apple M4 and over $3\times$ higher throughput than AMD Threadripper, while consuming only 16.08 mW of power. The proposed interleaving strategy can also be naturally extended beyond two states, offering further opportunities to scale throughput while retaining efficiency. These results demonstrate that the proposed DSCC20 provides a compact and efficient ChaCha20 accelerator well suited for future secure and real-time CAV communication systems.

ACKNOWLEDGMENT

We would like to acknowledge the financial support from the Platform R&D Program of KIAT (Korea Institute for Advancement of Technology) of Republic of Korea (No. P0025186) and the Digital Innovation Hub project supervised by the Daegu Digital Innovation Promotion Agency (DIP) grant funded by the Korea government (MSIT and Daegu Metropolitan City) in 2025 (No. 25DIH-18).

REFERENCES

Shladover S. E. , 2017, Connected and automated vehicle systems: Introduction and overview, Journal of Intelligent Transportation Systems, Vol. 22, No. 3, pp. 190-200

Elliott D. , Keen W. , Miao L. , 2019, Recent advances in connected and automated vehicles, journal of traffic and transportation engineering (English edition), Vol. 6, No. 2, pp. 109-131

Matin A. , Dia H. , 2022, Impacts of connected and automated vehicles on road safety and efficiency: A systematic literature review, IEEE Transactions on Intelligent Transportation Systems, Vol. 24, No. 3, pp. 2705-2736

Liu W. , Hua M. , Deng Z. , Meng Z. , Huang Y. , Hu C. , Song S. , Gao L. , Liu C. , Shuai B. , Khajepour A. , Xiong L. , Xia X. , 2023, A systematic survey of control techniques and applications in connected and automated vehicles, IEEE Internet of Things Journal, Vol. 10, No. 24, pp. 21892-21916

Vahidi A. , Sciarretta A. , 2018, Energy saving potentials of connected and automated vehicles, Transportation Research Part C: Emerging Technologies, Vol. 95, pp. 822-843

Li J. , Fotouhi A. , Liu Y. , Zhang Y. , Chen Z. , 2024, Review on eco-driving control for connected and automated vehicles, Renewable and Sustainable Energy Reviews, Vol. 189, pp. 114025

Tamang L. D. , Kim B. W. , 2021, Optical camera communication for vehicular applications: A survey, IEIE Transactions on Smart Processing and Computing, Vol. 10, No. 2, pp. 136-145

Kenney J. B. , 2011, Dedicated short-range communications (DSRC) standards in the United States, Proc. IEEE, Vol. 99, No. 7, pp. 1162-1182

Garcia M. H. C. , Molina-Galan A. , Boban M. , Gozalvez J. , Coll-Perales B. , Sahin T. , Kousaridas A. , 2021, A tutorial on 5G NR V2X communications, IEEE Communications Surveys & Tutorials, Vol. 23, No. 3, pp. 1972-2026

Jellid K. , Mazri T. , 2020, DSRC vs LTE V2X for autonomous vehicle connectivity, Proc. International Conference on Smart City Applications (SCA), pp. 381-394

Elbery A. , Sorour S. , Hassanein H. , Sediq A. B. , Abou-zeid H. , 2021, To dsrc or 5g? a safety analysis for connected and autonomous vehicles, Proc. Global Communications Conference (GLOBECOM), pp. 1-6

Clancy J. , Mullins D. , Deegan B. , Horgan J. , Ward E. , Eising C. , Denny P. , Jones E. , Glavin M. , 2024, Wireless access for V2X communications: Research, challenges and opportunities, IEEE Communications Surveys & Tutorials, Vol. 26, No. 3, pp. 2082-2119

Yelure B. , Patokar A. , Patil S. , Mawale R. , Nemade S. , Gaikwad V. , 2024, Impact and Analysis of Attacks on Routing Protocols in Vehicular Ad hoc Network (VANET): Assessing Security Threats, IEIE Transactions on Smart Processing & Computing, Vol. 13, No. 3, pp. 294-302

Gupta S. , Maple C. , Passerone R. , 2023, An investigation of cyber-attacks and security mechanisms for connected and autonomous vehicles, IEEE Access, Vol. 11, pp. 90641-90669

Wang Z. , Wei H. , Wang J. , Zeng X. , Chang Y. , 2022, Security issues and solutions for connected and autonomous vehicles in a sustainable city: A survey, Sustainability, Vol. 14, No. 19, pp. 12409

Sadaf M. , Iqbal Z. , Javed A. R. , Saba I. , Krichen M. , Majeed S. , Raza A. , 2023, Connected and automated vehicles: Infrastructure, applications, security, critical challenges, and future aspects, Technologies, Vol. 11, No. 5, pp. 117

Daemen J. , Rijmen V. , 1998, Aes proposal: Rijndael, Proc. Advanced Encryption Standard Candidate Conference

Bernstein D. J. , 2008, ChaCha, a variant of Salsa20, Workshop record of SASC, Vol. 8, No. 1, pp. 3-5

Kwak M. , Lee T. H. , Lee D. H. , Kim T.-H. , Kim Y. , 2025, An Area-Efficient ChaCha20 Hardware Accelerator Design for Secure and Real-Time Communication in CAVs, Proc. International Conference on Ubiquitous and Future Networks (ICUFN)

Henzen L. , Carbognani F. , Felber N. , Fichtner W. , 2008, VLSI hardware evaluation of the stream ciphers Salsa20 and ChaCha, and the compression function Rumba, Proc. International Conference on Signals, Circuits and Systems (SCS), pp. 1-5

Mozaffari-Kermani M. , Azarderakhsh R. , Aghaie A. , 2016, Fault detection architectures for post-quantum cryptographic stateless hash-based secure signatures benchmarked on ASIC, ACM Transactions on Embedded Computing Systems, Vol. 16, No. 2, pp. 1-19

Serrano R. , Sarmiento M. , Duran C. , Hoang T. , Pham C. , 2022, A 3.65 Gb/s Area-Efficiency ChaCha20 Cryptocore, Proc. International SoC Design Conference (ISOCC), pp. 79-80

Le V. T. D. , Pham H. L. , Tran T. H. , Duong T. S. , Nakashima Y. , 2023, High-efficiency Reconfigurable Crypto Accelerator Utilizing Innovative Resource Sharing and Parallel Processing, Proc. International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 576-583

Rashidi B. , 2024, High-Performance Hardware Structure of ChaCha20 Stream Cipher Based on Sparse Parallel Prefix Adder, International Journal of Circuit Theory and Applications, Vol. 53, No. 5, pp. 2947-2957

Dani V. , 2023, Implementing ChaCha20: analysis on performance, resource utilization and side-channel protection

Pammu A. A. , Ho W. , Chong K. , Gwee B. , 2018, A high throughput and secure authentication-encryption AES-CCM algorithm on asynchronous multicore processor, IEEE Transactions on Information Forensics and Security, Vol. 14, No. 4, pp. 1023-1036

Nannipieri P. , Matteo S. D. , Baldanzi L. , Crocetti L. , Zulberti L. , Saponara S. , Fanucci L. , 2021, VLSI design of Advanced-Features AES CryptoProcessor in the framework of the European Processor Initiative, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 30, No. 2, pp. 177-186

Deng Q. , Li T. , Wang H. , Wang Y. , 2024, A pipelined hardware implementation of AES with S-box based on right-skewed ECA, Proc. International Conference on Electronics, Computers and Communication Technology (CECCT), pp. 32-36

Lee Y. , Kang J. , Lee J. , 2025, A Lightweight AES-256 Accelerator Design through Processing Order Optimization for Low-cost Hardware Security, Journal of Semiconductor Technology and Science, Vol. 25, No. 4, pp. 406-413

Ueno R. , Morioka S. , Miura N. , Matsuda K. , Nagata M. , Bhasin S. , Mathieu Y. , Graba T. , Danger J.-L. , Homma N. , 2019, High throughput/gate AES hardware architectures based on datapath compression, IEEE Transactions on Computers, Vol. 69, No. 4, pp. 534-548

Lee D. , Kwak M. , Lee J. , Kim B. , Kim Y. , 2022, A Light-Weight AES Design using LFSR-based S-Box for IoT Applications, IEIE Transactions on Smart Processing & Computing, Vol. 11, No. 2, pp. 140-148

Choi I. , Kim J.-H. , 2016, Area-Optimized Multi-Standard AES-CCM Security Engine for IEEE 802.15. 4/802.15. 6, Journal of Semiconductor Technology and Science, Vol. 16, No. 3, pp. 293-299

Baik J. , Kim Y. , 2022, A High-Throughput and Energy-Efficient SHA-256 Design using Approximate Arithmetic, IEIE Transactions on Smart Processing & Computing, Vol. 11, No. 5, pp. 385-391

Kong W. , Choi P. , Kim D. K. , 2020, Hardware Implementation of Lightweight Block Ciphers for IoT Sensors, Journal of Semiconductor Technology and Science, Vol. 20, No. 4, pp. 391-389

Jeong C. , Kim Y. , 2017, Efficient FPGA Implementation of AES-CCM for IEEE 1609.2 Vehicle Communications Security, IEIE Transactions on Smart Processing & Computing, Vol. 6, No. 2, pp. 133-139

Yu H. , Kim Y. , 2020, New RSA Encryption Mechanism Using One-Time Encryption Keys and Unpredictable Bio-Signal for Wireless Communication Devices, Electronics, Vol. 9, No. 2, pp. 1-12

Lee D. , Kim Y. , 2021, Design of a Light-Weight Key Scheduler for AES using LFSR for IoT Applications, Proc. IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), pp. 1-2

2025, Libsodium

Bauder M. , Festag A. , Kubjatko T. , Schweiger H. , 2024, Data accuracy in Vehicle-to-X cooperative awareness messages: An experimental study for the first commercial deployment of C-ITS in Europe, Vehicular Communications, Vol. 47, pp. 100744

Sarangi S. , Baas B. , 2021, DeepScaleTool: A tool for the accurate estimation of technology scaling in the deep-submicron era, Proc. International Symposium on Circuits and Systems (ISCAS), pp. 1-5

Chen Y. , Chang B. , Yang C. , Chiueh T. , 2021, A high-throughput FPGA accelerator for short-read mapping of the whole human genome, IEEE Transactions on Parallel and Distributed Systems, Vol. 32, No. 6

Myeongjin Kwak

Myeongjin Kwak received his B.S. and M.S. degrees in computer science and engineering from Kyungpook National University, Daegu, Republic of Korea, in 2021 and 2023, where he is currently pursuing a Ph.D. degree. His research interests include neuromorphic computing, quantum computing, and hardware accelerators.

Jaewoong Jeong

Jaewoong Jeong is currently pursuing the B.S. degrees in the School of Computer Science and Engineering at Kyungpook National University, Daegu, Republic of Korea. His research interests include computer architecture, cryptographic hardware, and hardware design.

Tae Hee Lee

Tae Hee Lee received his B.S. degree in control and measurement engineering from Changwon National University, Gyeongsangnam-do, Republic of Korea, in 2003, and his M.S., and Ph.D. degrees in electrical engineering from Kyungpook National University, Daegu, Republic of Korea, in 2012 and 2019, respectively. He worked as a researcher at PHA, Daegu, Republic of Korea, from 2003 to 2008, and has been the Director of the Testing and Evaluation Division at the Korea Intelligent Automotive Parts Promotion Institute (KIAPI), Daegu, Republic of Korea, since 2008. His current research interests include autonomous driving and electric vehicle evaluation and certification.

Do Hoon Lee

Do Hoon Lee received his M.S. degree in mechanical engineering from Kyungpook National University, Daegu, Republic of Korea, in 2024. He has been with the Korea Intelligent Automotive Parts Promotion Institute (KIAPI), Daegu, Republic of Korea, since 2015, where he is engaged in real-vehicle testing and performance evaluation of autonomous driving and advanced driver-assistance systems (ADAS).

Tae-Hyoung Kim

Tae-Hyoung Kim received his B.S., M.S., and Ph.D. degrees in electrical engineering from Kyungsung University, Busan, Republic of Korea, in 2003, 2005, and 2009, respectively. He worked a Principal researcher of Future Vehicle Research Team at Daegu Mechatronics & Materials Institute, from 2010 to 2022. He has been with Korea Intelligent Automotive Parts Promotion Institute (KIAPI), Daegu, Republic of Korea, as a General Manager in the Autonomous Driving Evaluation Department since 2023. His current research interests are autonomous mobility system control and evaluation.

Yongtae Kim

Yongtae Kim received his B.S. and M.S. degrees in electrical engineering from the Korea University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and a Ph.D. degree from the Department of Electrical and Computer Engineering from the Texas A&M University, College Station, TX, in 2013. From 2013 to 2018, he was a software engineer with Intel Corporation, Santa Clara, CA. Since 2018, he has been with the School of Computer Science and Engineering at Kyungpook National University, Daegu, Republic of Korea, where he is currently an Associate Professor. His research interests are in energy-efficient integrated circuits and systems, particularly, neuromorphic computing, approximate computing, quantum computing, and new memory devices and architecture.