Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 25, No. 04, p.459-467

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 6 Jan. 2025Revised : 23 Apr. 2025Accepted : 27 Apr. 2025

DOI :

https://doi.org/10.5573/JSTS.2025.25.4.459

Exploring CXL-SSD Challenges on Cache Underutilization

LeeHoon Hwi¹ KimMin Jae¹ YouJun Woo¹ JangHyung Jun¹ RoWon Woo¹

(Yonsei University, Seoul, Korea)

^* E-mail : wro@yonsei.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

The CXL interface enhances scalability for memory expansion, which is crucial for the demands of emerging memory-intensive applications. However, mismatches between traditional memory hierarchies and CXL-enabled memory, particularly when using SSDs as memory space, present critical issues that have not previously been addressed. This paper investigates the deployment of CXL technology in SSDs and the associated data management challenges that differ from those in traditional memory systems. We explore the potential issues of utilizing SSDs as system memory via CXL interfaces, emphasizing the substantial performance bottlenecks caused by bypassing traditional DRAM caching, which leads to CPU resource contention, unintended adverse effects on DRAM, and inefficiencies in SSD access. Based on our findings, we propose improvements to mitigate such challenges by optimizing memory management strategies and system architecture implementations to leverage CXL-enabled SSDs' capabilities fully.

Index Terms

CXL, SSD, memory, cache, memory hierarchy, analysis

INTRODUCTION

CXL enhances scalability across various memory devices, enabling memory expansions up to tens of terabytes ^[1]. CXL-enabled systems allow the CPU to access memory devices directly, thus enabling devices connected via CXL to function as system memory. Although the CXL interface exhibits slower latency than traditional DIMM interfaces commonly used in DRAM, few strategies to mitigate this issue involve promoting hot data to DRAM while relegating cold data to CXL devices ^[2,^3]. While the CXL interface allows SSDs, commonly used as storage, to serve as system memory, this adaptation creates new challenges to the traditional memory hierarchies as the SSD is a non-volatile device with slow latency. This paper explores the challenges of CXL-SSDs on the overall system from CPU to SSD, identifying potential areas for improvement in CXL technologies.

We have explored that SSDs serve as system memory; their mismatches in old memory hierarchies and inherent latency in SSD become a significant bottleneck, severely hindering practical applications. Moreover, the hierarchy mismatches in CXL-SSD are not an issue solely on CXL-SSD but also harm the data of DRAM. Our analysis reveals that memory instructions in CXL-SSDs, which bypass DRAM caching, lead to potential CPU resource contention, performance bottleneck due to slow latency, and unnecessary NAND accesses that have not been previously addressed but are critical for the practical deployment of CXL-SSDs. Based on our findings, we have proposed addressing the CPU resource contention, mitigating the slow latency of CXL-SSD, and improving inefficient SSD access by introducing cache memory for CXL-SSD. We also present an approach to effectively managing memory-purpose data within the SSD by utilizing its permanent characteristics for non-temporal operation. The contribution of this paper is as follows:

• This paper conducts a pioneering study on the disparities between memory-like SSDs enabled by CXL and traditional memory systems, exploring these differences from the standpoint of inherent SSD characteristics.

• Investigate the performance issues of memory instructions on CXL-SSDs due to their differences from the traditional memory hierarchy.

• Demonstrate the CPU resource contention issues caused by CXL-SSD and its side effect on DRAM performance.

• Examine the issues when memory instructions are directly accessing SSDs, including exposure to NAND flash latency and challenges related to the SSD's internal cache.

• Based on our findings, we suggest improvements: implementing a dedicated cache for CXL-SSDs and a proposal for implementing an SSD capable of effectively managing memory-like data.

I. Related Work & Motivation

Emerging SSDs. Recent studies in SSDs have focused on transitioning SSDs from black-box to white-box configurations to facilitate data management. This includes research on Open-Channel SSDs ^[4,^5], Zoned Namespace (ZNS) SSDs ^[6-^10], and the software-managed white-box SSDs ^[11,^12]. Furthermore, efforts are underway to utilize SSDs as host memory extensions, exemplified by implementations of CXL-SSDs ^[13-^15]. When SSDs are employed as host memory using the CXL interface, the CPU accesses SSD directly through white-box addressing with CPU cache, bypassing traditional DRAM caching. This direct access to SSD enables the CPU to utilize SSD's vast capacity as system memory as traditional DRAM. Furthermore, the CPU allows the CXL data to be cached into the CPU cache, which has low latency. In contrast, many of PCIe data cannot be cached in the CPU due to security and consistency issues and should use the DRAM cache ^[16,^17], which has slower latency, instead of the CPU cache directly. However, in our observations, these characteristics in CXL-SSD are a double-edged sword; Even though the CPU cache has low latency, its capacity is greatly limited, and it already suffered from capacity starvation only with DRAM instruction. This paper discusses challenges in CXL-SSDs related to this CPU cache starvation and provides guidance on future memory hierarchies that can overcome these challenges, aiming to optimize their application in modern computing architectures.

CXL, memory, cache. Research on CXL memory has been diverse and extensive. Among these studies, significant efforts have been focused on CXL-SSDs ^[13-^15] and CXL-applied cache ^[18-^20]. These research focus on enhancing the performance of CXL-enabled systems. With the previous CXL research, we have identified crucial issues involving the cache miss when integrating CXL-SSDs, in which the cache miss leads to access to slower devices, which is especially critical for CXL-SSDs, which have the worst latency for memory. This paper analyzes how these dynamics affect the overall system and proposes potential enhancements for CXL-SSDs from a memory management perspective, including improvements in memory hierarchy that can effectively utilize CXL-SSD.

II. ANALYSIS OVERVIEW

The primary objective of this study is to analyze challenges in CXL-SSD related to the system cache and to propose recommendations for its improvement. For this, we have used a full-system simulation with gem5 incorporating Dramsim3 and MQSim ^[21], which supports CXL-SSD emulation. Our analysis follows three main dimensions:

1) Technical Perspective: How does the existing memory management scheme differ when applied to CXL-SSDs?

2) CPU-side: What are the implications of CXL-SSDs from the CPU's perspective?

3) SSD-side: What problems arise within SSDs when used as memory through CXL?

Our analysis first discusses a technical review to understand how CXL-SSDs differ from DRAM and conventional SSDs. Subsequently, we examined the impacts of CXL-SSD on the CPU side that could impact the overall system. From these examinations on the CPU side, we observe the ramifications on the SSD side. Based on these insights, we aim to provide decisive directions for improvements in implementing CXL-SSD. We have evaluated in a CXL-SSD environment with RocksDB-based YCSB workload ^[22], in-memory database system ^[23], and various neural network workloads as can be seen in the next Section V and Section VI. The specific configurations of our experiment are as follows as Table 1. As shown in Table 1, we scaled down the system compared to real-world settings to make the simulation feasible while preserving key behavioral characteristics. Since this evaluation aims to analyze cache misses and SSD access patterns in a CXL-based system, the scaled-down configuration does not affect the validity of the observed trends. In the experiment, we have evaluated a maximum number of 100,000,000 instructions with 1,000,000 warm-up instructions.

Table 1. Experiment configurations.

L1 Cache	0.5 KiB ICache / 0.5 KiB DCache
L2 Cache	2KiB
L3 Cache	32KiB
Host DRAM	4GB
SSD Capacity	64GB
SSD DRAM	64MB
Channels	8
Chip/ch	8
Planes/chip	4
Blocks/pln	32
Pages/blk	512

III. TECHNICAL ANALYSIS: MEMORY MANAGEMENT

CXL-SSDs are non-volatile storage devices capable of functioning as memory. Prior to conducting experimental analyses, we will explore the structural challenges posed by CXL-SSDs, their incompatibility with traditional memory hierarchies, and their unique characteristics.

1. Mismatches in Old Memory Hierarchies

With the emerging CXL interface, mismatches in traditional memory hierarchies between DRAM and SSD have risen above the surface due to their variations in usage. In the old memory hierarchies, memory usually refers to DRAM, which serves for temporal volatile data, while storage typically refers to non-volatile devices and serves permanent data. However, with the emergence of memory-intensive applications (e.g., neural networks and databases), the memory wall issue ^[24] has become prominent due to insufficient memory capacity. Even though the DRAM, commonly used as a memory, offers reasonable capacity and access latency at an affordable price, the DIMM interface implemented in DRAM is susceptible to signal interference issues ^[1]; increasing its capacity entails significant expenses and presents scalability limitations. This has led to the rise of alternatives of DIMM into the CXL interface to address its limitations. Meanwhile, SSD, traditionally used as storage, which has the advantage of vast capacity compared to DRAM, can be used as memory with the emerged CXL interface. CXL-SSD is a non-volatile device that can serve a role similar to traditional DRAM memory, possessing latency-capacity tradeoff characteristics to DRAM. CXL-SSD allows CPU to direct access to SSD via memory mapping table unlike traditional filesystem-based SSDs ^[13]. Although traditional storage systems also have supported utilizing storage as memory via swap memory and memory-mapped files (MMF), CXL-SSD uniquely allows the host to access storage directly as a physical address, enabling its use as standard system memory. Being treated as a memory is a double-edged sword; Although CXL-SSD has latency comparable to traditional storage, its latency mostly comes from the CPU cache ^[13], and its usage as memory prevents DRAM caching within the old memory hierarchy. This non-DRAM caching characteristic is a critical issue that exposes the system to storage latency even with simple memory-load/store instructions, undermining the rationale for using CXL-SSD.

2. Unique Characteristics of CXL-SSD

CXL-SSDs as memory have unique characteristics because the CXL-SSD can serve as memory while the SSD is storage. In the old memory hierarchies, processors accessing data directly to storage is mostly inefficient, so data that are frequently used are typically cached on memory. The modern computer system uses a directory-based filesystem to manage the permanent-storage data, where a page cache is used ^[25]. When the system accesses storage data, the data node in the filesystem points to the cache address, which resides in the memory. If the data are previously accessed and cached, it allows file I/O to be directly from the memory without accessing storage. However, unlike memory-to storage caching, memory-to-memory caching in CXL-SSD requires a different mechanism because it is typically not a concern in the old memory hierarchy, which has recently been a concern due to the emergence of diverse memory devices. One research presents a technique that enables caching for slow memory by utilizing data temperature to manipulate page promotion and demotion between slow and fast devices ^[2]. This technique demotes cold data, which is infrequently accessed, to lower devices and promotes hot data, which is frequently accessed, to upper devices like DRAM, thereby improving access latency for hot data. Nevertheless, such data movement between memory devices results in the loss of non-volatile characteristics in non-volatile memory like CXL-SSDs, which eliminates the possibility of exploiting non-volatile advantages, which is a significant limitation. Although there is another non-volatile memory device, Optane memory, this limitation is significant for CXL-SSDs because most of the data in SSDs are used for non-volatile-permanent storage, which means preventing non-volatile characteristics is essential.

3. Challenges of Traditional Caching for Memory-Semantic CXL-SSDs

In this technical analysis, we explored the structural challenges associated with CXL-SSDs when used as main memory from a theoretical perspective. Unlike traditional SSDs operating under a file system abstraction, CXL-SSDs are directly mapped into the system's physical memory address space and accessed via load/store instructions, akin to DRAM. This architectural results in CPU memory instructions being directly exposed to the slower latency of the SSD because the DRAM-based caching used in conventional SSDs becomes impractical. Given that CPU memory instructions play a critical role in computation, minimizing latency in the memory load/store operations is essential, underscoring the need for a novel memory-to-memory caching architecture that bridges DRAM and SSD. The design of this new caching architecture must account for the non-volatile nature of SSD data, which is vital for exploiting the advantages of non-volatile memory. These advantages include transferring data stored in the storage layer directly to memory without requiring data migration. However, since DRAM is volatile, implementing a promotion-demotion scheme between DRAM and SSD is unsuitable. Instead, a dedicated temporary cache memory for CXL-SSD is necessary. A key point to consider is why CXL-SSDs cannot utilize traditional caching mechanisms. In traditional SSDs, data is accessed through the file system, which utilizes a page cache for temporal caching and address indirection. Conversely, CXL-SSDs bypass the file system entirely, providing direct memory-mapped access to SSD regions, thereby eliminating the abstraction layer where caching typically resides. This memory-semantic access model, while improving flexibility and uniformity, inherently lacks the infrastructure to support traditional temporal caching, emphasizing the need for an alternative memory-to-memory caching design.

IV. CPU-SIDE: CXL CACHE WITHIN CPU

In the CXL-SSDs, unlike traditional DRAM caching in file-based SSD, the CXL.mem instruction to SSD does not be cached to DRAM, but to CPU cache in which the capacity is small, thus leading to cache starvation that leads to a higher cache miss rate. In this section, we analyze the impact of CXL-SSDs on CPU cache performance and the underlying causes.

1. Analyzing CPU Cache Miss

Fig. 1 is a comparative graph of CPU cache miss rates in database applications, contrasting traditional DRAM-only memory with DRAM and CXL-SSD used as memory. While the typical DRAM memory exhibits a miss rate of 29.77%, the use of DRAM with CXL-SSD results in a miss rate of 34.37%, marking an increased rate of 15.44%. This increase can be attributed to the latency differences between DRAM and CXL-SSD. During a CPU cache miss event, cache evictions and promotions are triggered to memory devices. The slow latency of CXL-SSD prolongs the duration of eviction and promotion processes, consequently worsening CPU resource traffic, which increases the CPU cache miss for DRAM operation even though the DRAM is a latency-critical device.

Fig. 1. The graph of CPU cache miss rate between DRAM-only and CXL-enabled system.

2. CPU Resource Contention due to CXL-SSD

To discuss CPU cache, it is essential to mention the stages of the memory instruction pipeline. The processing sequence for memory instruction includes fetch, decode, rename, issue, execute, write-back, and commit. During the execution stage for memory instructions, the CPU attempts to use the DCache in the CPU cache rather than directly accessing memory. If a cache miss occurs, to prevent stalling, the MSHR is used until the data is fetched into the cache. In the event of a CPU cache full event, the existing data in the CPU cache must be evicted to memory before fetching new data. The slow latency of CXL-SSD significantly prolongs the time required to evict and fetch data into the cache, increasing the cache usage, MSHR and load/store instruction queue at the execution stage. This delay during the execution stage can result in resource contention within the reorder buffer at the issue stage. As a result, the pipeline bottleneck at the preceding renaming stage leads to underutilized memory bandwidth and slower data fetching.

3. Challenges in CPU Cache Underutilization in CXL-SSD

When DRAM and CXL-SSDs coexist, latency-critical memory instructions issued by the CPU can be adversely affected by the slower latency of CXL-SSDs. The CXL-SSD increases cache occupancy due to eviction-fetch delays, leading to a higher cache miss rate. Since the primary performance bottleneck in memory instruction is data movement, the inability to efficiently utilize the CPU cache and the subsequent data transfers to memory is a critical issue. Additionally, CPU resource contention introduced by CXL-SSDs causes substantial deviations from the standard DRAM access times anticipated by the CPU, which may lead to unforeseen challenges. Moreover, employing a large CXL-SSD as memory deepens starvation in the cache capacity compared to a DRAM-only memory system. To mitigate these issues, it is crucial to provide sufficient cache capacity.

V. SSD-SIDE: CXL DATA WITHIN SSD

In the previous Section V, we analyzed the impact of CXL-SSDs on the CPU side. In this section, we will investigate how using CXL-SSDs as memory influences the internal of the SSD.

1. Analyzing NAND Access Count

Fig. 2 is a graph of NAND access count between conventional SSD and CXL-SSD. This experiment was conducted in an environment where the conventional SSD was configured with 4 GB of host DRAM memory, while the CXL-SSD was configured with 4 GB of CXL memory. Due to using SSD as memory, the CXL-SSD's NAND access count is remarkably high, with an average rate of 101.56% higher than conventional SSDs using DRAM cache. Even though the performance advantage of CXL-SSD mainly comes from CPU cache latency ^[13], the CPU cache size is limited while the size of SSD is enormous, which means most data could not be cached, thus leading to increased NAND access. Although the CPU cache is faster than the DRAM cache, its capacity starvation causes more access to SSD, which leads to frequent NAND access. Even though using CPU cache is a key advantage of CXL-SSDs, just a few cache misses can have severe consequences because the NAND flash latency is approximately x1,000,000 times slower than CPU cache latency ^[26]. This represents a significant challenge to the practical adoption of CXL-SSDs.

Fig. 2. NAND access count for CXL-SSD normalized to conventional SSD.

2. NAND Could not be Accessed

Fig. 3 presents a graph of potential miss avoidable cache miss with adequate cache capacity. It shows that an average of 87.87% of cache miss instructions had previously been cached but were later evicted, leading to the necessity of accessing NAND with x1,000,000 times slower latency ^[26] due to the constrained cache size. These memory instructions would not require direct NAND access if cache eviction were prevented. The detailed reason for these cache eviction and re-fetch was explained in Section V-B. To discuss data processing at the SSD level, it is crucial to understand how data is managed within the SSD. When a memory instruction tries to access the SSD, it retrieves the address from the physical memory mapping table. It issues a write or read command to the SSD for the corresponding physical address. At this stage, the CPU operates with a physical address; however, this address represents a virtual physical address defined by the memory mapping table, not the actual physical location of the data. Within the SSD, the provided physical address by the host functions as a logical address. The SSD translates this logical address to the actual physical location of the data through an internal address translation process. In some scenarios, previously accessed data may be retrieved directly from the SSD's internal cache, eliminating the need to access the NAND directly. In this experiment, a 64MB DRAM cache was allocated, consistent with the typical capacity of conventional SSDs ^[27,^28]. Despite the presence of an internal DRAM cache within the SSD, Fig. 2 highlights the occurrence of NAND accesses because the data volume accessed to the SSD far exceeds the internal cache size that could not be accessed.

Fig. 3. Ratio of missed memory instruction that were previously cached.

3. Challenges in NAND Flash Access for Memory-Semantic CXL-SSDs

This section analyzes the challenges of using CXL-SSDs as memory on the SSD side. CXL-SSDs and DRAM are integrated under a memory pooling architecture, distinguishing them from traditional swap memory systems by managing them as a unified memory system. As a result, CPU instructions accessing CXL-SSDs are directly exposed to the high latency of SSDs. This issue becomes particularly pronounced when cache misses occur in both the CPU and SSD cache, necessitating access to NAND, which is approximately x1,000,000 times slower than the CPU cache ^[26]. Despite this significant issue, CXL-SSDs lack an intermediate cache, leading to a 101.56% increase in NAND accesses. Frequent accesses to NAND flash are not only a concern in latency but also because they directly impact SSD lifespan through repeated erase operations ^[29]. These frequent erases exacerbate garbage collection challenges ^[30], further degrading lifespan and performance. Additionally, frequent write operations and garbage collection can induce temperature throttling in NAND flash ^[31], significantly worsening latency to critical levels. Moreover, the internal DRAM cache of SSDs is typically configured with a capacity of 64 MB or 128 MB ^[32,^33], which is generally sufficient because most data accesses occur in the host DRAM, with relatively few requiring interaction with the SSD. However, when CXL-SSDs are used as memory, frequent I/O operations significantly increase the potential for DRAM cache underutilization. However, 87.87% of the data accessed from NAND had been previously cached, suggesting that sufficient cache capacity could have prevented many of these ineffective accesses. This underscores the critical importance of adequate cache capacity, not only to mitigate the performance degradation of DRAM, as discussed in V but also to address the impact of NAND latency effectively.

VI. DISCUSSIONS FOR IMPROVEMENTS

In Section IV, we examined the theoretical aspects of how CXL-SSDs pose problems when used as a memory device. Section V explores the practical issues these problems cause for CPUs, while Section VI investigates the implications at the SSD level. Below is a summary of the previously discussed, which must be considered before proposing improvements to the implementation of CXL-SSDs.

Challenges Summary.

When using CXL-SSD as memory, it does not employ the DRAM caching mechanisms typically utilized in conventional SSDs for storage. Although research exists on intermediate cache layers for tiered memory, such approaches are unsuitable as they undermine the non-volatile benefits of SSDs. As CXL-SSDs are designed to function as memory and storage, they must preserve the characteristics of persistent memory, allowing seamless transitions between memory and storage purposes.

Using CXL-SSDs as a direct substitute for DRAM presents a significant challenge, as the slower latency of CXL-SSDs creates a performance bottleneck to memory operations that typically demands high-speed performance.

At the CPU level, CXL memory poses a challenge not only because of the inherently slower performance of the CXL device but also due to CPU resource contention, which further impacts the performance of traditional DRAM.

At the SSD level, even simple memory instructions often lead to frequent access to NAND flash, which is approximately x1,000,000 times slower than the CPU cache ^[26]. This affects performance and raises critical issues, including reduced NAND flash lifespan due to frequent erases, inefficiencies in garbage collection, and diminished effectiveness of the SSD's internal cache.

Based on the previous challenges, we will now discuss implications that can effectively implement CXL-SSD.

Requirements in CXL-Dedicated Cache.

Our analysis revealed that integration of a dedicated CXL cache within the DRAM is necessitated. Although CXL-SSDs function as memory devices similar to DRAM, employing a traditional memory hierarchy results in heightened CPU resource contention. CXL-SSDs, due to their slower latency, lead to CPU cache underutilization that can impact DRAM instruction. Furthermore, even for memory instructions targeting the CXL-SSD, a cache capable of bypassing the SSD is essential. While utilizing the CPU cache is fast and efficient, expanding its capacity is challenging. Therefore, we propose using a portion of DRAM as a cache for the CXL-SSD. To preserve the persistent memory characteristics, there should be no data eviction or promotion between the CXL-SSD and the DRAM-based CXL cache, which should function solely as a temporary cache. To implement a DRAM-based CXL cache, a dedicated area for storing cache addresses must be incorporated within the memory mapping table during the CPU's address translation process. This design is expected to enhance not only the performance of CXL memory instructions but also the performance of DRAM and SSD operations.

Improvements in Accessing CXL-SSD.

As discussed in Section VI, insufficient cache capacity leads to unnecessary repetitive access to data previously cached due to cache eviction. While securing sufficient CXL cache capacity can mitigate unnecessary NAND access, the issue of direct SSD access persists when the cache full event occurs. Even when accessing the SSD, it is crucial to secure a sufficient capacity of internal SSD cache to bypass NAND access within the SSD. To achieve this, an internal cache must be implemented that provides adequate performance while maintaining system sustainability. This involves considering the extent of performance improvement from cache expansion, the proportional increase in power consumption, and the added cost associated with the larger capacity

Other Proposals for CXL-SSD.

When the CPU accesses a CXL-SSD, it does so using a physical address via the physical mapping table. This means the CPU, aware of the data's locality characteristics, can convey this locality information to the SSD through commands. Consequently, the SSD inherently has access to data locality information, which can be leveraged to place similar types of data in proximity within the same space. This approach enhances NAND parallelism and improves garbage collection efficiency. Research related to SSD data locality includes studies on Open Channel SSDs ^[4], ZNS-SSDs ^[6], and software-based classification SSD ^[12]. Instead of using a CXL-only device that merely replaces the PCIe interconnect of traditional SSDs with CXL, integrating SSDs that account for locality considerations can effectively address the challenges posed by memory-like CXL-SSDs.

VII. CONCLUSION

This paper explores the challenges in CXL-SSD related to mismatches in memory hierarchy, CPU resource, and SSD with its potential issues. As our analysis indicates, CXL-SSD not only has slow latency but also causes CPU resource contention that affects DRAM data. The CXL-SSD also has inefficiencies in SSD access due to internal cache underutilization and unnecessary NAND access. Although existing studies on tiered memory systems address some of these challenges, as reviewed in Section II and Section IV, mismatches in old memory hierarchies revealed the need for improvement. Based on our findings, we propose the necessity of a dedicated cache for CXL-SSD, the expansion of the internal SSD cache, and effective methods for managing memory-like data within the SSD. Research into a new cache hierarchy that retains non-volatile benefits while providing additional data capabilities will be crucial for enhancing the feasibility of using CXL-SSDs.

ACKNOWLEDGMENTS

This work was supported by Samsung Electronics Co., Ltd, under Grant (IO201210-07936-01, IO250214-11969-01) and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2024-00402898, Simulation-based High-speed/High-Accuracy Data Center Workload/System Analysis Platform). Won Woo Ro is the corresponding author.

References

D. Das Sharma, R. Blankenship, and D. Berger, ``An introduction to the compute express link (cxl) interconnect,'' ACM Computing Surveys, vol. 56, no. 11, pp. 1–37, 2024.

H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, and P. Chauhan, ``TPP: Transparent page placement for CXL-enabled tieredmemory,'' Proc. of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, vol. 3, ser. ASPLOS '23. ACM, Mar. 2023.

R. Abdullah, H. Lee, H. Zhou, and A. Awad, ``Salus: Efficient security support for CXL-expanded gpu memory,'' Proc. of 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), IEEE, pp. 1–15, 2024.

I. L. Picoli, N. Hedam, P. Bonnet, and P. Tözün, ``Open-channel SSD (what is it good for),'' Proc. of Conference on Innovative Data Systems Research, 2020.

M. Bjørling, ``From open-channel SSDs to zoned namespaces,'' Proc. of Linux Storage & Filesystems Conference(Vault), vol. 1, p. 20, 2019.

M. Bjørling, A. Aghayev, H. Holmberg, A. Ramesh, D. L. Moal, G. R. Ganger, and G. Amvrosiadis, ``ZNS: Avoiding the block interface tax for flash-based SSDs,'' Proc. of USENIX Annual Technical Conference (USENIX ATC 21), USENIX Association, pp. 689–703, Jul. 2021. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/bjorling

N. Tehrany and A. Trivedi, ``Understanding nvme zoned namespace (ZNS) flash ssd storage devices,'' arXiv preprint arXiv:2206.01547, 2022.

T. Stavrinos, D. S. Berger, E. Katz-Bassett, and W. Lloyd, ``Don't be a blockhead: zoned namespaces make work on conventional ssds obsolete,'' Proc. of the Workshop on Hot Topics in Operating Systems, pp. 144–151, 2021.

N. Tehrany, K. Doekemeijer, and A. Trivedi, ``Understanding (un)written contracts of nvme zns devices with zns-tools,'' arXiv preprint arXiv:2307.11860, 2023.

H. Bae, J. Kim, M. Kwon, and M. Jung, ``What you can't forget: Exploiting parallelism for zoned namespaces,'' Proc. of the 14th ACM Workshop on Hot Topics in Storage and File Systems, pp. 79–85, 2022.

H. Park, E. Lee, J. Kim, and S. H. Noh, ``Lightweight data lifetime classification using migration counts to improve performance and lifetime of flash-based ssds,'' Proc. of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems, pp. 25–33, 2021.

S. Oh, J. Kim, S. Han, J. Kim, S. Lee, and S. H. Noh, “MIDAS: Minimizing write amplification in log-structured systems through adaptive group number and size configuration,” Proc. of 22nd USENIX Conference on File and Storage Technologies (FAST 24), pp. 259–275, 2024.

M. Jung, ``Hello bytes, bye blocks: Pcie storage meets compute express link for memory expansion (CXL-SSD),'' Proc. of the 14th ACM Workshop on Hot Topics in Storage and File Systems, pp. 45–51, 2022.

M. Kwon, S. Lee, and M. Jung, ``Cache in hand: Expander-driven CXL prefetcher for next generation CXL-SSD,'' Proc. of the 15th ACM Workshop on Hot Topics in Storage and File Systems, pp. 24–30, 2023.

S.-P. Yang, M. Kim, S. Nam, J. Park, J.-Y. Choi, E. H. Nam, E. Lee, S. Lee, and B. S. Kim, “Overcoming the memory wall with CXL enabled SSDs,” Proc. of USENIX Annual Technical Conference (USENIX ATC 23), pp. 601–617, 2023.

R. Branco and B. Lee, ``Cache-related hardware capabilities and their impact on information security,'' ACM Computing Surveys, vol. 55, no. 6, pp. 1–35, 2022.

M. A. Khelif, J. Lorandel, O. Romain, M. Regnery, D. Baheux, and G. Barbu, ``Toward a hardware man-in-the-middle attack on pcie bus,'' Microprocessors and Microsystems, vol. 77, 103198, 2020.

K. Lee, S. Kim, J. Lee, D. Moon, R. Kim, H. Kim, H. Ji, Y. Mun, and Y. Joo, ``Improving key-value cache performance with heterogeneous memory tiering: A case study of cxl-based memory expansion,'' IEEE Micro, 2024.

M. Arif, K. Assogba, M. M. Rafique, and S. Vazhkudai, ``Exploiting CXL-based memory for distributed deep learning,'' Proc. of the 51st International Conference on Parallel Processing, pp. 1–11, 2022.

C. Tan, A. F. Donaldson, and J. Wickerson, ``Formalising CXL cache coherence,'' arXiv preprint arXiv:2410.15908, 2024.

A. Tavakkol, J. Gomez-Luna, M. Sadrosadati, S. Ghose, and O. Mutlu, ``MQSim: A framework for enabling realistic studies of modern multi-queue SSD devices,'' Proc. of USENIX Conference on File and Storage Technologies (FAST `18), 2018.

B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, ``Benchmarking cloud serving systems with YCSB,'' Proc. of the 1st ACM Symposium on Cloud Computing, pp. 143–154, 2010.

S. Chen, X. Tang, H. Wang, H. Zhao, and M. Guo, ``Towards scalable and reliable in-memory storage system: A case study with redis,'' Proc. of IEEE Trustcom/BigDataSE/ISPA, IEEE, pp. 1660–1667, 2016.

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, ``AI and memory wall,'' IEEE Micro, vol. 44, no. 3, pp. 33-39, 2024.

W. K. Josephson, L. A. Bongo, K. Li, and D. Flynn, ``DFS: A file system for virtualized flash storage,'' ACM Transactions on Storage (TOS), vol. 6, no. 3, pp. 1–25, 2010.

C. Mellor, ``Storage with the speed of memory? Xpoint, Xpoint, that's our plan,'' Accessed: 2025-03-31. [Online]. Available: https://www.theregister.com/2016/04/21/storage approaches memory speed with xpoint and storageclass memory/

T. Blog, ``Evolving storage solutions: Breakthroughs, innovations, and a sustainable future,'' Accessed: 2023-07-21. [Online]. Avaiable: https://semiconductor.samsung.com/newsevents/tech-blog/evolving-storage-solutions-breakthroughs-innovationsand-a-sustainable-future/

Transcend, ``Mte662t mte662t-i,'' Accessed: 2023-07-21. [Online] https://www.transcendinfo.com/embedded/product/embedded-ssd-solutions/mte662t-mte662ti

S. Im and D. Shin, ``ComboFTL: Improving performance and lifespan of mlc flash memory using SLC flash buffer,'' Journal of Systems Architecture, vol. 56, no. 12, pp. 641–653, 2010.

F. Wu, J. Zhou, S. Wang, Y. Du, C. Yang, and C. Xie, ``FastGC: Accelerate garbage collection via an efficient copyback-based data migration in ssds,'' Proc. of the 55th Annual Design Automation Conference, pp. 1–6, 2018.

C. Jeon, Y. Choi, K. Rhew, J. Bae, Y. Cho, and S. Pae, ``A systematic study and lifetime modeling on the board level reliability of ssd after temperature cycling test,'' Proc. of IEEE 71st Electronic Components and Technology Conference (ECTC), IEEE, pp. 1007–1013, 2021.

[Same as [27] T. Blog, ``Evolving storage solutions: Breakthroughs, innovations, and a sustainable future,'' Accessed: 2023-07-21. [Online]. Avaiable: https://semiconductor.samsung.com/newsevents/tech-blog/evolving-storage-solutions-breakthroughs-innovationsand-a-sustainable-future/

[Same as [28] Transcend, ``Mte662t mte662t-i,'' Accessed: 2023-07-21. [Online]. Avaiable: https://www.transcendinfo.com/embedded/product/embedded-ssd-solutions/mte662t-mte662ti

Hoon Hwi Lee

Hoon Hwi Lee received his B.S. degree in electronics engineering from Dongguk University, Seoul, South Korea, in 2020. He is currently pursuing a Ph.D. degree with the Embedded Systems and Computer Architecture Laboratory, the School of Electrical and Electronic Engineering, Yonsei University, Seoul, South Korea. His current research interests include memory systems, storage, and databases.

Min Jae Kim

Min Jae Kim received his B.S. degree in electronics engineering from Kyung Hee University, Yongin, South Korea, in 2022. He is currently working toward a Ph.D. degree with the Embedded Systems and Computer Architecture Laboratory, the School of Electrical and Electronic Engineering, Yonsei University, Seoul. His current research interests include memory systems.

Jun Woo You

Jun Woo You received his B.S degree in electrical and electronic engineering from Yonsei University, South Korea, in 2024. He is currently pursuing a Ph.D degree in Embedded Systems and Computer Architecture Laboratory at the School of Electrical and Electronic Engineering, Yonsei University, Seoul, South Korea, under the supervision of Professor Won Woo Ro. His research interests include memory systems and GPU systems.

Hyung Jun Jang

Hyung Jun Jang received his B.S. degree in computer engineering from Yonsei University, Wonju, South Korea, in 2019. He is currently pursuing a Ph.D. degree with the Embedded Systems and Computer Architecture Laboratory, School of Electrical and Electronic Engineering, Yonsei University, Seoul, South Korea. His current research interests include DNN accelerators, multicore accelerator systems, and accelerator resource management.

Won Woo Ro

Won Woo Ro received his B.S. degree in electrical engineering from Yonsei University, Seoul, South Korea, in 1996, and his M.S. and Ph.D. degrees in electrical engineering from the University of Southern California, in 1999 and 2004, respectively. He worked as a Research Scientist with the Electrical Engineering and Computer Science Department, University of California, Irvine. He currently works as a Professor with the School of Electrical and Electronic Engineering, Yonsei University. Prior to joining Yonsei University, he worked as an Assistant Professor with the Department of Electrical and Computer Engineering, California State University, Northridge. His industry experience includes a college internship with Apple Computer, Inc., and a contract software engineer with ARM, Inc. His current research interests include high-performance microprocessor design, GPU microarchitectures, neural network accelerators, and memory hierarchy design.