LeeHoon Hwi1
KimMin Jae1
YouJun Woo1
JangHyung Jun1
RoWon Woo1
-
(Yonsei University, Seoul, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
CXL, SSD, memory, cache, memory hierarchy, analysis
INTRODUCTION
CXL enhances scalability across various memory devices, enabling memory expansions
up to tens of terabytes [1]. CXL-enabled systems allow the CPU to access memory devices directly, thus enabling
devices connected via CXL to function as system memory. Although the CXL interface
exhibits slower latency than traditional DIMM interfaces commonly used in DRAM, few
strategies to mitigate this issue involve promoting hot data to DRAM while relegating
cold data to CXL devices [2,3]. While the CXL interface allows SSDs, commonly used as storage, to serve as system
memory, this adaptation creates new challenges to the traditional memory hierarchies
as the SSD is a non-volatile device with slow latency. This paper explores the challenges
of CXL-SSDs on the overall system from CPU to SSD, identifying potential areas for
improvement in CXL technologies.
We have explored that SSDs serve as system memory; their mismatches in old memory
hierarchies and inherent latency in SSD become a significant bottleneck, severely
hindering practical applications. Moreover, the hierarchy mismatches in CXL-SSD are
not an issue solely on CXL-SSD but also harm the data of DRAM. Our analysis reveals
that memory instructions in CXL-SSDs, which bypass DRAM caching, lead to potential
CPU resource contention, performance bottleneck due to slow latency, and unnecessary
NAND accesses that have not been previously addressed but are critical for the practical
deployment of CXL-SSDs. Based on our findings, we have proposed addressing the CPU
resource contention, mitigating the slow latency of CXL-SSD, and improving inefficient
SSD access by introducing cache memory for CXL-SSD. We also present an approach to
effectively managing memory-purpose data within the SSD by utilizing its permanent
characteristics for non-temporal operation. The contribution of this paper is as follows:
• This paper conducts a pioneering study on the disparities between memory-like SSDs
enabled by CXL and traditional memory systems, exploring these differences from the
standpoint of inherent SSD characteristics.
• Investigate the performance issues of memory instructions on CXL-SSDs due to their
differences from the traditional memory hierarchy.
• Demonstrate the CPU resource contention issues caused by CXL-SSD and its side effect
on DRAM performance.
• Examine the issues when memory instructions are directly accessing SSDs, including
exposure to NAND flash latency and challenges related to the SSD's internal cache.
• Based on our findings, we suggest improvements: implementing a dedicated cache
for CXL-SSDs and a proposal for implementing an SSD capable of effectively managing
memory-like data.
I. Related Work & Motivation
Emerging SSDs. Recent studies in SSDs have focused on transitioning SSDs from black-box
to white-box configurations to facilitate data management. This includes research
on Open-Channel SSDs [4,5], Zoned Namespace (ZNS) SSDs [6-10], and the software-managed white-box SSDs [11,12]. Furthermore, efforts are underway to utilize SSDs as host memory extensions, exemplified
by implementations of CXL-SSDs [13-15]. When SSDs are employed as host memory using the CXL interface, the CPU accesses
SSD directly through white-box addressing with CPU cache, bypassing traditional DRAM
caching. This direct access to SSD enables the CPU to utilize SSD's vast capacity
as system memory as traditional DRAM. Furthermore, the CPU allows the CXL data to
be cached into the CPU cache, which has low latency. In contrast, many of PCIe data
cannot be cached in the CPU due to security and consistency issues and should use
the DRAM cache [16,17], which has slower latency, instead of the CPU cache directly. However, in our observations,
these characteristics in CXL-SSD are a double-edged sword; Even though the CPU cache
has low latency, its capacity is greatly limited, and it already suffered from capacity
starvation only with DRAM instruction. This paper discusses challenges in CXL-SSDs
related to this CPU cache starvation and provides guidance on future memory hierarchies
that can overcome these challenges, aiming to optimize their application in modern
computing architectures.
CXL, memory, cache. Research on CXL memory has been diverse and extensive. Among these
studies, significant efforts have been focused on CXL-SSDs [13-15] and CXL-applied cache [18-20]. These research focus on enhancing the performance of CXL-enabled systems. With the
previous CXL research, we have identified crucial issues involving the cache miss
when integrating CXL-SSDs, in which the cache miss leads to access to slower devices,
which is especially critical for CXL-SSDs, which have the worst latency for memory.
This paper analyzes how these dynamics affect the overall system and proposes potential
enhancements for CXL-SSDs from a memory management perspective, including improvements
in memory hierarchy that can effectively utilize CXL-SSD.
II. ANALYSIS OVERVIEW
The primary objective of this study is to analyze challenges in CXL-SSD related to
the system cache and to propose recommendations for its improvement. For this, we
have used a full-system simulation with gem5 incorporating Dramsim3 and MQSim [21], which supports CXL-SSD emulation. Our analysis follows three main dimensions:
1) Technical Perspective: How does the existing memory management scheme differ when
applied to CXL-SSDs?
2) CPU-side: What are the implications of CXL-SSDs from the CPU's perspective?
3) SSD-side: What problems arise within SSDs when used as memory through CXL?
Our analysis first discusses a technical review to understand how CXL-SSDs differ
from DRAM and conventional SSDs. Subsequently, we examined the impacts of CXL-SSD
on the CPU side that could impact the overall system. From these examinations on the
CPU side, we observe the ramifications on the SSD side. Based on these insights, we
aim to provide decisive directions for improvements in implementing CXL-SSD. We have
evaluated in a CXL-SSD environment with RocksDB-based YCSB workload [22], in-memory database system [23], and various neural network workloads as can be seen in the next Section V and Section
VI. The specific configurations of our experiment are as follows as Table 1. As shown in Table 1, we scaled down the system compared to real-world settings to make the simulation
feasible while preserving key behavioral characteristics. Since this evaluation aims
to analyze cache misses and SSD access patterns in a CXL-based system, the scaled-down
configuration does not affect the validity of the observed trends. In the experiment,
we have evaluated a maximum number of 100,000,000 instructions with 1,000,000 warm-up
instructions.
Table 1. Experiment configurations.
L1 Cache
|
0.5 KiB ICache / 0.5 KiB DCache
|
L2 Cache
|
2KiB
|
L3 Cache
|
32KiB
|
Host DRAM
|
4GB
|
SSD Capacity
|
64GB
|
SSD DRAM
|
64MB
|
Channels
|
8
|
Chip/ch
|
8
|
Planes/chip
|
4
|
Blocks/pln
|
32
|
Pages/blk
|
512
|
III. TECHNICAL ANALYSIS: MEMORY MANAGEMENT
CXL-SSDs are non-volatile storage devices capable of functioning as memory. Prior
to conducting experimental analyses, we will explore the structural challenges posed
by CXL-SSDs, their incompatibility with traditional memory hierarchies, and their
unique characteristics.
1. Mismatches in Old Memory Hierarchies
With the emerging CXL interface, mismatches in traditional memory hierarchies between
DRAM and SSD have risen above the surface due to their variations in usage. In the
old memory hierarchies, memory usually refers to DRAM, which serves for temporal volatile
data, while storage typically refers to non-volatile devices and serves permanent
data. However, with the emergence of memory-intensive applications (e.g., neural networks
and databases), the memory wall issue [24] has become prominent due to insufficient memory capacity. Even though the DRAM, commonly
used as a memory, offers reasonable capacity and access latency at an affordable price,
the DIMM interface implemented in DRAM is susceptible to signal interference issues
[1]; increasing its capacity entails significant expenses and presents scalability limitations.
This has led to the rise of alternatives of DIMM into the CXL interface to address
its limitations. Meanwhile, SSD, traditionally used as storage, which has the advantage
of vast capacity compared to DRAM, can be used as memory with the emerged CXL interface.
CXL-SSD is a non-volatile device that can serve a role similar to traditional DRAM
memory, possessing latency-capacity tradeoff characteristics to DRAM. CXL-SSD allows
CPU to direct access to SSD via memory mapping table unlike traditional filesystem-based
SSDs [13]. Although traditional storage systems also have supported utilizing storage as memory
via swap memory and memory-mapped files (MMF), CXL-SSD uniquely allows the host to
access storage directly as a physical address, enabling its use as standard system
memory. Being treated as a memory is a double-edged sword; Although CXL-SSD has latency
comparable to traditional storage, its latency mostly comes from the CPU cache [13], and its usage as memory prevents DRAM caching within the old memory hierarchy. This
non-DRAM caching characteristic is a critical issue that exposes the system to storage
latency even with simple memory-load/store instructions, undermining the rationale
for using CXL-SSD.
2. Unique Characteristics of CXL-SSD
CXL-SSDs as memory have unique characteristics because the CXL-SSD can serve as memory
while the SSD is storage. In the old memory hierarchies, processors accessing data
directly to storage is mostly inefficient, so data that are frequently used are typically
cached on memory. The modern computer system uses a directory-based filesystem to
manage the permanent-storage data, where a page cache is used [25]. When the system accesses storage data, the data node in the filesystem points to
the cache address, which resides in the memory. If the data are previously accessed
and cached, it allows file I/O to be directly from the memory without accessing storage.
However, unlike memory-to storage caching, memory-to-memory caching in CXL-SSD requires
a different mechanism because it is typically not a concern in the old memory hierarchy,
which has recently been a concern due to the emergence of diverse memory devices.
One research presents a technique that enables caching for slow memory by utilizing
data temperature to manipulate page promotion and demotion between slow and fast devices
[2]. This technique demotes cold data, which is infrequently accessed, to lower devices
and promotes hot data, which is frequently accessed, to upper devices like DRAM, thereby
improving access latency for hot data. Nevertheless, such data movement between memory
devices results in the loss of non-volatile characteristics in non-volatile memory
like CXL-SSDs, which eliminates the possibility of exploiting non-volatile advantages,
which is a significant limitation. Although there is another non-volatile memory device,
Optane memory, this limitation is significant for CXL-SSDs because most of the data
in SSDs are used for non-volatile-permanent storage, which means preventing non-volatile
characteristics is essential.
3. Challenges of Traditional Caching for Memory-Semantic CXL-SSDs
In this technical analysis, we explored the structural challenges associated with
CXL-SSDs when used as main memory from a theoretical perspective. Unlike traditional
SSDs operating under a file system abstraction, CXL-SSDs are directly mapped into
the system's physical memory address space and accessed via load/store instructions,
akin to DRAM. This architectural results in CPU memory instructions being directly
exposed to the slower latency of the SSD because the DRAM-based caching used in conventional
SSDs becomes impractical. Given that CPU memory instructions play a critical role
in computation, minimizing latency in the memory load/store operations is essential,
underscoring the need for a novel memory-to-memory caching architecture that bridges
DRAM and SSD. The design of this new caching architecture must account for the non-volatile
nature of SSD data, which is vital for exploiting the advantages of non-volatile memory.
These advantages include transferring data stored in the storage layer directly to
memory without requiring data migration. However, since DRAM is volatile, implementing
a promotion-demotion scheme between DRAM and SSD is unsuitable. Instead, a dedicated
temporary cache memory for CXL-SSD is necessary. A key point to consider is why CXL-SSDs
cannot utilize traditional caching mechanisms. In traditional SSDs, data is accessed
through the file system, which utilizes a page cache for temporal caching and address
indirection. Conversely, CXL-SSDs bypass the file system entirely, providing direct
memory-mapped access to SSD regions, thereby eliminating the abstraction layer where
caching typically resides. This memory-semantic access model, while improving flexibility
and uniformity, inherently lacks the infrastructure to support traditional temporal
caching, emphasizing the need for an alternative memory-to-memory caching design.
IV. CPU-SIDE: CXL CACHE WITHIN CPU
In the CXL-SSDs, unlike traditional DRAM caching in file-based SSD, the CXL.mem instruction
to SSD does not be cached to DRAM, but to CPU cache in which the capacity is small,
thus leading to cache starvation that leads to a higher cache miss rate. In this section,
we analyze the impact of CXL-SSDs on CPU cache performance and the underlying causes.
1. Analyzing CPU Cache Miss
Fig. 1 is a comparative graph of CPU cache miss rates in database applications, contrasting
traditional DRAM-only memory with DRAM and CXL-SSD used as memory. While the typical
DRAM memory exhibits a miss rate of 29.77%, the use of DRAM with CXL-SSD results in
a miss rate of 34.37%, marking an increased rate of 15.44%. This increase can be attributed
to the latency differences between DRAM and CXL-SSD. During a CPU cache miss event,
cache evictions and promotions are triggered to memory devices. The slow latency of
CXL-SSD prolongs the duration of eviction and promotion processes, consequently worsening
CPU resource traffic, which increases the CPU cache miss for DRAM operation even though
the DRAM is a latency-critical device.
Fig. 1. The graph of CPU cache miss rate between DRAM-only and CXL-enabled system.
2. CPU Resource Contention due to CXL-SSD
To discuss CPU cache, it is essential to mention the stages of the memory instruction
pipeline. The processing sequence for memory instruction includes fetch, decode, rename,
issue, execute, write-back, and commit. During the execution stage for memory instructions,
the CPU attempts to use the DCache in the CPU cache rather than directly accessing
memory. If a cache miss occurs, to prevent stalling, the MSHR is used until the data
is fetched into the cache. In the event of a CPU cache full event, the existing data
in the CPU cache must be evicted to memory before fetching new data. The slow latency
of CXL-SSD significantly prolongs the time required to evict and fetch data into the
cache, increasing the cache usage, MSHR and load/store instruction queue at the execution
stage. This delay during the execution stage can result in resource contention within
the reorder buffer at the issue stage. As a result, the pipeline bottleneck at the
preceding renaming stage leads to underutilized memory bandwidth and slower data fetching.
3. Challenges in CPU Cache Underutilization in CXL-SSD
When DRAM and CXL-SSDs coexist, latency-critical memory instructions issued by the
CPU can be adversely affected by the slower latency of CXL-SSDs. The CXL-SSD increases
cache occupancy due to eviction-fetch delays, leading to a higher cache miss rate.
Since the primary performance bottleneck in memory instruction is data movement, the
inability to efficiently utilize the CPU cache and the subsequent data transfers to
memory is a critical issue. Additionally, CPU resource contention introduced by CXL-SSDs
causes substantial deviations from the standard DRAM access times anticipated by the
CPU, which may lead to unforeseen challenges. Moreover, employing a large CXL-SSD
as memory deepens starvation in the cache capacity compared to a DRAM-only memory
system. To mitigate these issues, it is crucial to provide sufficient cache capacity.
V. SSD-SIDE: CXL DATA WITHIN SSD
In the previous Section V, we analyzed the impact of CXL-SSDs on the CPU side. In
this section, we will investigate how using CXL-SSDs as memory influences the internal
of the SSD.
1. Analyzing NAND Access Count
Fig. 2 is a graph of NAND access count between conventional SSD and CXL-SSD. This experiment
was conducted in an environment where the conventional SSD was configured with 4 GB
of host DRAM memory, while the CXL-SSD was configured with 4 GB of CXL memory. Due
to using SSD as memory, the CXL-SSD's NAND access count is remarkably high, with an
average rate of 101.56% higher than conventional SSDs using DRAM cache. Even though
the performance advantage of CXL-SSD mainly comes from CPU cache latency [13], the CPU cache size is limited while the size of SSD is enormous, which means most
data could not be cached, thus leading to increased NAND access. Although the CPU
cache is faster than the DRAM cache, its capacity starvation causes more access to
SSD, which leads to frequent NAND access. Even though using CPU cache is a key advantage
of CXL-SSDs, just a few cache misses can have severe consequences because the NAND
flash latency is approximately x1,000,000 times slower than CPU cache latency [26]. This represents a significant challenge to the practical adoption of CXL-SSDs.
Fig. 2. NAND access count for CXL-SSD normalized to conventional SSD.
2. NAND Could not be Accessed
Fig. 3 presents a graph of potential miss avoidable cache miss with adequate cache capacity.
It shows that an average of 87.87% of cache miss instructions had previously been
cached but were later evicted, leading to the necessity of accessing NAND with x1,000,000
times slower latency [26] due to the constrained cache size. These memory instructions would not require direct
NAND access if cache eviction were prevented. The detailed reason for these cache
eviction and re-fetch was explained in Section V-B. To discuss data processing at
the SSD level, it is crucial to understand how data is managed within the SSD. When
a memory instruction tries to access the SSD, it retrieves the address from the physical
memory mapping table. It issues a write or read command to the SSD for the corresponding
physical address. At this stage, the CPU operates with a physical address; however,
this address represents a virtual physical address defined by the memory mapping table,
not the actual physical location of the data. Within the SSD, the provided physical
address by the host functions as a logical address. The SSD translates this logical
address to the actual physical location of the data through an internal address translation
process. In some scenarios, previously accessed data may be retrieved directly from
the SSD's internal cache, eliminating the need to access the NAND directly. In this
experiment, a 64MB DRAM cache was allocated, consistent with the typical capacity
of conventional SSDs [27,28]. Despite the presence of an internal DRAM cache within the SSD, Fig. 2 highlights the occurrence of NAND accesses because the data volume accessed to the
SSD far exceeds the internal cache size that could not be accessed.
Fig. 3. Ratio of missed memory instruction that were previously cached.
3. Challenges in NAND Flash Access for Memory-Semantic CXL-SSDs
This section analyzes the challenges of using CXL-SSDs as memory on the SSD side.
CXL-SSDs and DRAM are integrated under a memory pooling architecture, distinguishing
them from traditional swap memory systems by managing them as a unified memory system.
As a result, CPU instructions accessing CXL-SSDs are directly exposed to the high
latency of SSDs. This issue becomes particularly pronounced when cache misses occur
in both the CPU and SSD cache, necessitating access to NAND, which is approximately
x1,000,000 times slower than the CPU cache [26]. Despite this significant issue, CXL-SSDs lack an intermediate cache, leading to
a 101.56% increase in NAND accesses. Frequent accesses to NAND flash are not only
a concern in latency but also because they directly impact SSD lifespan through repeated
erase operations [29]. These frequent erases exacerbate garbage collection challenges [30], further degrading lifespan and performance. Additionally, frequent write operations
and garbage collection can induce temperature throttling in NAND flash [31], significantly worsening latency to critical levels. Moreover, the internal DRAM
cache of SSDs is typically configured with a capacity of 64 MB or 128 MB [32,33], which is generally sufficient because most data accesses occur in the host DRAM,
with relatively few requiring interaction with the SSD. However, when CXL-SSDs are
used as memory, frequent I/O operations significantly increase the potential for DRAM
cache underutilization. However, 87.87% of the data accessed from NAND had been previously
cached, suggesting that sufficient cache capacity could have prevented many of these
ineffective accesses. This underscores the critical importance of adequate cache capacity,
not only to mitigate the performance degradation of DRAM, as discussed in V but also
to address the impact of NAND latency effectively.
VI. DISCUSSIONS FOR IMPROVEMENTS
In Section IV, we examined the theoretical aspects of how CXL-SSDs pose problems when
used as a memory device. Section V explores the practical issues these problems cause
for CPUs, while Section VI investigates the implications at the SSD level. Below is
a summary of the previously discussed, which must be considered before proposing improvements
to the implementation of CXL-SSDs.
Challenges Summary.
When using CXL-SSD as memory, it does not employ the DRAM caching mechanisms typically
utilized in conventional SSDs for storage. Although research exists on intermediate
cache layers for tiered memory, such approaches are unsuitable as they undermine the
non-volatile benefits of SSDs. As CXL-SSDs are designed to function as memory and
storage, they must preserve the characteristics of persistent memory, allowing seamless
transitions between memory and storage purposes.
Using CXL-SSDs as a direct substitute for DRAM presents a significant challenge,
as the slower latency of CXL-SSDs creates a performance bottleneck to memory operations
that typically demands high-speed performance.
At the CPU level, CXL memory poses a challenge not only because of the inherently
slower performance of the CXL device but also due to CPU resource contention, which
further impacts the performance of traditional DRAM.
At the SSD level, even simple memory instructions often lead to frequent access to
NAND flash, which is approximately x1,000,000 times slower than the CPU cache [26]. This affects performance and raises critical issues, including reduced NAND flash
lifespan due to frequent erases, inefficiencies in garbage collection, and diminished
effectiveness of the SSD's internal cache.
Based on the previous challenges, we will now discuss implications that can effectively
implement CXL-SSD.
Requirements in CXL-Dedicated Cache.
Our analysis revealed that integration of a dedicated CXL cache within the DRAM is
necessitated. Although CXL-SSDs function as memory devices similar to DRAM, employing
a traditional memory hierarchy results in heightened CPU resource contention. CXL-SSDs,
due to their slower latency, lead to CPU cache underutilization that can impact DRAM
instruction. Furthermore, even for memory instructions targeting the CXL-SSD, a cache
capable of bypassing the SSD is essential. While utilizing the CPU cache is fast and
efficient, expanding its capacity is challenging. Therefore, we propose using a portion
of DRAM as a cache for the CXL-SSD. To preserve the persistent memory characteristics,
there should be no data eviction or promotion between the CXL-SSD and the DRAM-based
CXL cache, which should function solely as a temporary cache. To implement a DRAM-based
CXL cache, a dedicated area for storing cache addresses must be incorporated within
the memory mapping table during the CPU's address translation process. This design
is expected to enhance not only the performance of CXL memory instructions but also
the performance of DRAM and SSD operations.
Improvements in Accessing CXL-SSD.
As discussed in Section VI, insufficient cache capacity leads to unnecessary repetitive
access to data previously cached due to cache eviction. While securing sufficient
CXL cache capacity can mitigate unnecessary NAND access, the issue of direct SSD access
persists when the cache full event occurs. Even when accessing the SSD, it is crucial
to secure a sufficient capacity of internal SSD cache to bypass NAND access within
the SSD. To achieve this, an internal cache must be implemented that provides adequate
performance while maintaining system sustainability. This involves considering the
extent of performance improvement from cache expansion, the proportional increase
in power consumption, and the added cost associated with the larger capacity
Other Proposals for CXL-SSD.
When the CPU accesses a CXL-SSD, it does so using a physical address via the physical
mapping table. This means the CPU, aware of the data's locality characteristics, can
convey this locality information to the SSD through commands. Consequently, the SSD
inherently has access to data locality information, which can be leveraged to place
similar types of data in proximity within the same space. This approach enhances NAND
parallelism and improves garbage collection efficiency. Research related to SSD data
locality includes studies on Open Channel SSDs [4], ZNS-SSDs [6], and software-based classification SSD [12]. Instead of using a CXL-only device that merely replaces the PCIe interconnect of
traditional SSDs with CXL, integrating SSDs that account for locality considerations
can effectively address the challenges posed by memory-like CXL-SSDs.
VII. CONCLUSION
This paper explores the challenges in CXL-SSD related to mismatches in memory hierarchy,
CPU resource, and SSD with its potential issues. As our analysis indicates, CXL-SSD
not only has slow latency but also causes CPU resource contention that affects DRAM
data. The CXL-SSD also has inefficiencies in SSD access due to internal cache underutilization
and unnecessary NAND access. Although existing studies on tiered memory systems address
some of these challenges, as reviewed in Section II and Section IV, mismatches in
old memory hierarchies revealed the need for improvement. Based on our findings, we
propose the necessity of a dedicated cache for CXL-SSD, the expansion of the internal
SSD cache, and effective methods for managing memory-like data within the SSD. Research
into a new cache hierarchy that retains non-volatile benefits while providing additional
data capabilities will be crucial for enhancing the feasibility of using CXL-SSDs.
ACKNOWLEDGMENTS
This work was supported by Samsung Electronics Co., Ltd, under Grant (IO201210-07936-01,
IO250214-11969-01) and Institute of Information & communications Technology Planning
& Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2024-00402898,
Simulation-based High-speed/High-Accuracy Data Center Workload/System Analysis Platform).
Won Woo Ro is the corresponding author.
References
D. Das Sharma, R. Blankenship, and D. Berger, ``An introduction to the compute express
link (cxl) interconnect,'' ACM Computing Surveys, vol. 56, no. 11, pp. 1–37, 2024.

H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen,
M. Chowdhury, S. Kanaujia, and P. Chauhan, ``TPP: Transparent page placement for CXL-enabled
tieredmemory,'' Proc. of the 28th ACM International Conference on Architectural Support
for Programming Languages and Operating Systems, vol. 3, ser. ASPLOS '23. ACM, Mar.
2023.

R. Abdullah, H. Lee, H. Zhou, and A. Awad, ``Salus: Efficient security support for
CXL-expanded gpu memory,'' Proc. of 2024 IEEE International Symposium on High-Performance
Computer Architecture (HPCA), IEEE, pp. 1–15, 2024.

I. L. Picoli, N. Hedam, P. Bonnet, and P. Tözün, ``Open-channel SSD (what is it good
for),'' Proc. of Conference on Innovative Data Systems Research, 2020.

M. Bjørling, ``From open-channel SSDs to zoned namespaces,'' Proc. of Linux Storage
& Filesystems Conference(Vault), vol. 1, p. 20, 2019.

M. Bjørling, A. Aghayev, H. Holmberg, A. Ramesh, D. L. Moal, G. R. Ganger, and G.
Amvrosiadis, ``ZNS: Avoiding the block interface tax for flash-based SSDs,'' Proc.
of USENIX Annual Technical Conference (USENIX ATC 21), USENIX Association, pp. 689–703,
Jul. 2021. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/bjorling

N. Tehrany and A. Trivedi, ``Understanding nvme zoned namespace (ZNS) flash ssd storage
devices,'' arXiv preprint arXiv:2206.01547, 2022.

T. Stavrinos, D. S. Berger, E. Katz-Bassett, and W. Lloyd, ``Don't be a blockhead:
zoned namespaces make work on conventional ssds obsolete,'' Proc. of the Workshop
on Hot Topics in Operating Systems, pp. 144–151, 2021.

N. Tehrany, K. Doekemeijer, and A. Trivedi, ``Understanding (un)written contracts
of nvme zns devices with zns-tools,'' arXiv preprint arXiv:2307.11860, 2023.

H. Bae, J. Kim, M. Kwon, and M. Jung, ``What you can't forget: Exploiting parallelism
for zoned namespaces,'' Proc. of the 14th ACM Workshop on Hot Topics in Storage and
File Systems, pp. 79–85, 2022.

H. Park, E. Lee, J. Kim, and S. H. Noh, ``Lightweight data lifetime classification
using migration counts to improve performance and lifetime of flash-based ssds,''
Proc. of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems, pp. 25–33, 2021.

S. Oh, J. Kim, S. Han, J. Kim, S. Lee, and S. H. Noh, “MIDAS: Minimizing write amplification
in log-structured
systems through adaptive group number and size configuration,” Proc. of 22nd USENIX
Conference on File and Storage Technologies (FAST 24), pp. 259–275, 2024.

M. Jung, ``Hello bytes, bye blocks: Pcie storage meets compute express link for memory
expansion (CXL-SSD),'' Proc. of the 14th ACM Workshop on Hot Topics in Storage and
File Systems, pp. 45–51, 2022.

M. Kwon, S. Lee, and M. Jung, ``Cache in hand: Expander-driven CXL prefetcher for
next generation CXL-SSD,'' Proc. of the 15th ACM Workshop on Hot Topics in Storage
and File Systems, pp. 24–30, 2023.

S.-P. Yang, M. Kim, S. Nam, J. Park, J.-Y. Choi, E. H. Nam, E. Lee, S. Lee, and B.
S. Kim, “Overcoming the memory wall with CXL enabled SSDs,” Proc. of USENIX Annual
Technical Conference (USENIX ATC 23), pp. 601–617, 2023.

R. Branco and B. Lee, ``Cache-related hardware capabilities and their impact on information
security,'' ACM Computing Surveys, vol. 55, no. 6, pp. 1–35, 2022.

M. A. Khelif, J. Lorandel, O. Romain, M. Regnery, D. Baheux, and G. Barbu, ``Toward
a hardware man-in-the-middle attack on pcie bus,'' Microprocessors and Microsystems,
vol. 77, 103198, 2020.

K. Lee, S. Kim, J. Lee, D. Moon, R. Kim, H. Kim, H. Ji, Y. Mun, and Y. Joo, ``Improving
key-value cache performance with heterogeneous memory tiering: A case study of cxl-based
memory expansion,'' IEEE Micro, 2024.

M. Arif, K. Assogba, M. M. Rafique, and S. Vazhkudai, ``Exploiting CXL-based memory
for distributed deep learning,'' Proc. of the 51st International Conference on Parallel
Processing, pp. 1–11, 2022.

C. Tan, A. F. Donaldson, and J. Wickerson, ``Formalising CXL cache coherence,'' arXiv
preprint arXiv:2410.15908, 2024.

A. Tavakkol, J. Gomez-Luna, M. Sadrosadati, S. Ghose, and O. Mutlu, ``MQSim: A framework
for enabling realistic studies of modern multi-queue SSD devices,'' Proc. of USENIX
Conference on File and Storage Technologies (FAST `18), 2018.

B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, ``Benchmarking
cloud serving systems with YCSB,'' Proc. of the 1st ACM Symposium on Cloud Computing,
pp. 143–154, 2010.

S. Chen, X. Tang, H. Wang, H. Zhao, and M. Guo, ``Towards scalable and reliable in-memory
storage system: A case study with redis,'' Proc. of IEEE Trustcom/BigDataSE/ISPA,
IEEE, pp. 1660–1667, 2016.

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, ``AI and memory
wall,'' IEEE Micro, vol. 44, no. 3, pp. 33-39, 2024.

W. K. Josephson, L. A. Bongo, K. Li, and D. Flynn, ``DFS: A file system for virtualized
flash storage,'' ACM Transactions on Storage (TOS), vol. 6, no. 3, pp. 1–25, 2010.

C. Mellor, ``Storage with the speed of memory? Xpoint, Xpoint, that's our plan,''
Accessed: 2025-03-31. [Online]. Available: https://www.theregister.com/2016/04/21/storage
approaches memory speed with xpoint and storageclass memory/

T. Blog, ``Evolving storage solutions: Breakthroughs, innovations, and a sustainable
future,'' Accessed: 2023-07-21. [Online]. Avaiable: https://semiconductor.samsung.com/newsevents/tech-blog/evolving-storage-solutions-breakthroughs-innovationsand-a-sustainable-future/

Transcend, ``Mte662t mte662t-i,'' Accessed: 2023-07-21. [Online] https://www.transcendinfo.com/embedded/product/embedded-ssd-solutions/mte662t-mte662ti

S. Im and D. Shin, ``ComboFTL: Improving performance and lifespan of mlc flash memory
using SLC flash buffer,'' Journal of Systems Architecture, vol. 56, no. 12, pp. 641–653,
2010.

F. Wu, J. Zhou, S. Wang, Y. Du, C. Yang, and C. Xie, ``FastGC: Accelerate garbage
collection via an efficient copyback-based data migration in ssds,'' Proc. of the
55th Annual Design Automation Conference, pp. 1–6, 2018.

C. Jeon, Y. Choi, K. Rhew, J. Bae, Y. Cho, and S. Pae, ``A systematic study and lifetime
modeling on the board level reliability of ssd after temperature cycling test,'' Proc.
of IEEE 71st Electronic Components and Technology Conference (ECTC), IEEE, pp. 1007–1013,
2021.

[Same as [27] T. Blog, ``Evolving storage solutions: Breakthroughs, innovations, and
a sustainable future,'' Accessed: 2023-07-21. [Online]. Avaiable: https://semiconductor.samsung.com/newsevents/tech-blog/evolving-storage-solutions-breakthroughs-innovationsand-a-sustainable-future/

[Same as [28] Transcend, ``Mte662t mte662t-i,'' Accessed: 2023-07-21. [Online]. Avaiable:
https://www.transcendinfo.com/embedded/product/embedded-ssd-solutions/mte662t-mte662ti

Hoon Hwi Lee received his B.S. degree in electronics engineering from Dongguk University,
Seoul, South Korea, in 2020. He is currently pursuing a Ph.D. degree with the Embedded
Systems and Computer Architecture Laboratory, the School of Electrical and Electronic
Engineering, Yonsei University, Seoul, South Korea. His current research interests
include memory systems, storage, and databases.
Min Jae Kim received his B.S. degree in electronics engineering from Kyung Hee University,
Yongin, South Korea, in 2022. He is currently working toward a Ph.D. degree with the
Embedded Systems and Computer Architecture Laboratory, the School of Electrical and
Electronic Engineering, Yonsei University, Seoul. His current research interests include
memory systems.
Jun Woo You received his B.S degree in electrical and electronic engineering from
Yonsei University, South Korea, in 2024. He is currently pursuing a Ph.D degree in
Embedded Systems and Computer Architecture Laboratory at the School of Electrical
and Electronic Engineering, Yonsei University, Seoul, South Korea, under the supervision
of Professor Won Woo Ro. His research interests include memory systems and GPU systems.
Hyung Jun Jang received his B.S. degree in computer engineering from Yonsei University,
Wonju, South Korea, in 2019. He is currently pursuing a Ph.D. degree with the Embedded
Systems and Computer Architecture Laboratory, School of Electrical and Electronic
Engineering, Yonsei University, Seoul, South Korea. His current research interests
include DNN accelerators, multicore accelerator systems, and accelerator resource
management.
Won Woo Ro received his B.S. degree in electrical engineering from Yonsei University,
Seoul, South Korea, in 1996, and his M.S. and Ph.D. degrees in electrical engineering
from the University of Southern California, in 1999 and 2004, respectively. He worked
as a Research Scientist with the Electrical Engineering and Computer Science Department,
University of California, Irvine. He currently works as a Professor with the School
of Electrical and Electronic Engineering, Yonsei University. Prior to joining Yonsei
University, he worked as an Assistant Professor with the Department of Electrical
and Computer Engineering, California State University, Northridge. His industry experience
includes a college internship with Apple Computer, Inc., and a contract software engineer
with ARM, Inc. His current research interests include high-performance microprocessor
design, GPU microarchitectures, neural network accelerators, and memory hierarchy
design.