Mobile QR Code QR CODE

  1. (School of Computer Science and Engineering, Pusan National University, Busan 26241, Korea)

NVMe SSDs, proportional-share I/O scheduler, multi-queue block layer, Linux, Cgroups


In the cloud computing service, it is critical problem that system resources should be isolated and allocated among multiple tenants according to different service requirements (3). Especially, as high performance storage such as SSDs are introduced in the cloud computing area, sharing I/O resource of SSDs among multiple tenants with only few overhead became an important problem. Linux Cgroups (5,6) is a resource control framework of Linux which allocates system resources such as CPU, memory, and I/O bandwidth to each process group(called as cgroup). Specifically, it provides a proportional I/O resource sharing scheme which distributes I/O bandwidth to each cgroup in proportion to I/O weight through an I/O scheduler.

NAND flash memory-based SSDs (Solid-state Drives) are recently very popular as a second storage device because of its outstanding I/O performance in comparison to HDDs (Hard-disk Drives). However, legacy storage interface, Serial ATA(SATA) has insufficient maximum bandwidth due to supporting single I/O queue so that it could not take full advantage of internal parallelism of SSDs. NVM Express (NVMe) (4) is the state-of-the-art storage interface providing multiple submission queues for SSDs, so that it can achieve more than one million IOPS with maximizing internal parallelism of SSDs. However, I/O bandwidth sharing is not efficiently supported for NVMe SSDs with Linux Cgroups.

There are some studies have proposed proportional I/O scheduler for a multi-queue storage (13,14). However, they cannot fully utilize the performance of NVMe SSD because of a scalability problem. Especially, BFQ scheduler (13), the only proportional share scheduler of Linux, suffers from performance overhead due to single dispatch queue. According to our preliminary performance evaluation results, it degrades I/O throughput of NVMe SSD up to 63% compared to None scheduler.

In this paper, we present scalable multi-queue I/O scheduler supporting proportional I/O bandwidth sharing for NVMe SSDs without performance degradation. The proposed scheduler employs a cgroup dedicated request queue set for each CPU for parallel I/O request dispatching. Also, I/O bandwidth is shared by periodically allocating I/O tokens to each cgroup in proportion to their I/O weight. Consequently, our scheduler can parallelly dispatch I/O request and share I/O bandwidth with no global synchronization between multiple cgroups because of the scalable architecture. Moreover, the total amount of I/O tokens periodically updates for adjusting to the fluctuated performance of NVMe SSDs due to internal operations such as garbage collection. The proposed I/O scheduler is implemented on Linux kernel 5.2.0 and evaluated with both synthetic and real workloads. The evaluation results show that our I/O scheduler improves I/O bandwidth 3.61 times than BFQ scheduler while allocating I/O bandwidth in proportion to I/O weights exactly. Also, scalability evaluation result shows that our scheduler has no scalability problem even with high degree of multi-threading unlike other multi-queue proportional share scheduler.

The remainder of this paper is organized as follows. Section II introduces the previous studies for I/O resource sharing with SSDs. Then, we show the preliminary evaluation results for BFQ scheduler in Section III. Section IV describes the proposed proportional-share I/O scheduler. Experimental results are presented in Section V. The conclusion is given in Section VI.


There have been several studies on the proportional I/O sharing scheme for multi-queue SSDs. Ahn et al. (12) implemented the weight-based dynamic throttling (WDT) technique on Linux Cgroups layer to throttle the bandwidth to match the I/O weight of each cgroup. However, although it guarantees the fairness on Linux Cgroups layer, the fairness can be degraded due to unexpected scheduling because the WDT is implemented above the multi-queue block layer (7). Moreover, the responsibility for the guarantee of fairness has been maintained by the scheduler traditionally. Therefore, it is hierarchically appropriate that I/O schedulers in the multi-queue block layer carry out the proportional I/O sharing.

Budget fair queueing (BFQ) scheduler proposed by Valente et al. (13) is currently the default proportional share scheduler for Linux multi-queue block layer. BFQ scheduler allocates a budget, the number of sectors used for dispatching I/O requests, to each cgroup according to I/O weight to achieve proportional I/O sharing. However, in BFQ scheduler, the internal parallelism of multi-queue NVMe SSDs cannot be fully utilized because only one request queue can dispatch I/O request to the storage device at a time.

M. Hedayati et al. (14) proposed the MQFQ scheduler which adapts a classical fair queueing algorithm to multi-queue designs and employs scalable data structures to minimize synchronization overhead. However, even with these attempts, MQFQ shows scalability problem with extremely high degree of concurrency workload because the global data structure must be referred whenever each I/O request is dispatched. In addition, since MQFQ provides request queue for each CPU, scalability problem occurs with large number of threads exceeding the number of CPUs.

Kim et al. (15) proposed a I/O differentiated scheme based on Linux process priority. However, it is focused to improve the performance of I/O intensive processes by considering process priority even for I/O scheduling, not to share I/O bandwidth proportionally.

The proportional I/O sharing schemes for multi-queue SSDs should be implemented slightly differently from the traditional schemes for single-queue SSDs (9-11). First, the parallelism of multi-queue block layer should not be harmed by minimizing global synchronization and allowing each I/O queue to perform proportional I/O sharing without interaction between other cgroups. Unlike single queue block layer, attempting to maintain the fairness of bandwidth between multiple cgroups through global synchronization would break the parallel processing of multi-queues and causes the severe I/O performance degradation.

Second, work-conservation should also be considered while implementing proportional I/O sharing schemes for multi-queue NVMe SSDs. Work-conserving in the proportional I/O bandwidth sharing scheme means that the storage device should be busy without leaving the SSDs idle, even if all of cgroups are consumed their own allocated I/O bandwidth.

Fig. 1. Evaluation results of proportional I/O bandwidth sharing for four cgroups with FIO benchmark.



In this section, we show the performance degradation of the BFQ scheduler through the experimental results. To evaluate the fairness and performance overhead of the BFQ scheduler, we measured the I/O throughput of four cgroups having different I/O weight (1000:500:250:100) for the None and BFQ scheduler. Note that the None scheduler is FIFO-fashioned I/O scheduler with little scheduling overhead, and provides a baseline of the maximum performance of NVMe SSD (17). Fig. 1 shows the I/O throughput of both the None and BFQ scheduler measured by using synthetic mixed 4KB random workload configured by 0.5 read ratio with FIO benchmark (16). As you can see in the figure, the None scheduler does not provide any proportional bandwidth sharing scheme for cgroups, while the BFQ scheduler accurately shares the bandwidth among four cgroups according to their own I/O weight. However, for the BFQ scheduler, the aggregated throughput was seriously degraded by about 0.38 times compared to the None scheduler. Moreover, the throughput of the cgroup with the highest weight in the BFQ scheduler is lower than any cgroup in the None scheduler. It means that the BFQ scheduler cannot fully utilize the performance of NVMe SSDs.

The reason for the performance degradation in the BFQ scheduler is a serialized request dispatching process used to guarantee the fairness between multiple cgroups (14). Therefore, in this paper, we propose the new proportional-share I/O scheduler that can accurately share I/O bandwidth proportionally without wasting high performance of multi-queue SSDs.

Fig. 2. (a) No cgroup dedicated request queue, (b) serialized request dispatching process, (c) per-cgroup request queue and no global synchronization.



In this section we describe the Kyber-fairness scheduler, a novel proportional share I/O scheduler for multi-queue block layer proposed in this paper.

1. Overview of Kyber-fairness Scheduler

Kyber-fairness scheduler is implemented by extending the existing multi-queue I/O scheduler of Linux, Kyber scheduler (19). It tries to achieve the target I/O latency by tunning the number of requests in dispatch queue. However, there is no proportional bandwidth sharing mechanism because it does not employ separated request queue for each cgroup as can be seen in Fig. 2(a). We decide to modify Kyber scheduler because of (1) its little scheduling overhead and (2) simple architecture. The proposed scheduler is implemented according to following design principles described in Fig. 2(c).

At first, Kyber-fairness employs a cgroup dedicated request queue set for each submission queue of NVMe SSD to manage the fairness among multiple cgroups. The previous multi-queue schedulers are suffered from lack of fairness or scalability problem due to no separated request queue architecture and serialized dispatching mechanism, respectively, as described in Fig. 2(a) and (b). Otherwise, the proposed scheduler can provide a proportional bandwidth sharing for cgroups while maintaining scalability in multi-core system.

Second, Kyber-fairness distributes a bandwidth of NVMe SSD by allocating token to each cgroup proportionally. Note that token represents I/O bandwidth allocated to each cgroup for a certain amount time. Therefore, each cgroup needs no synchronization with other cgroups for dispatching requests. As a result, scalable proportional I/O sharing can be guaranteed even under high degree of multiple threading.

Fig. 3. Overview of Kyber-fairness scheduler architecture.


2. Proportional I/O Sharing in Kyber-fairness

The Fig. 3 describes the proportional bandwidth sharing scheme of Kyber-fairness which is composed of three steps as follows: (1) determining the total amount of tokens, (2) allocating tokens to cgroups in proportion to I/O weight, and (3) consuming tokens to dispatch I/O request.

At first, the total amount of token represents the maximum I/O bandwidth can be served by underlying NVMe SSD. It can be predicted based on average number of used tokens in previous periods as described in next section. In second step, tokens should be allocated to each cgroup in proportion to I/O weight. Token manager assigns tokens periodically to each cgroup in proportion to the I/O weight. The last step is to dispatch I/O request from a request queue dedicated certain cgroup to NVMe SSD by consuming tokens. If there are not enough tokens, the request cannot be dispatched and is throttled in the request queue. Here, because each cgroup do not need to concern of other cgroups, it is guaranteed that our proposed scheduler has no scalability problem even with high degree of multiple cgroups.

When consuming tokens, writing is a more costly task than reading, so if the same token is consumed without distinction between read and write, the read-dominant cgroups can be disadvantaged by the write-dominant cgroups. To address inequality, we have multiplied the token consumption of the write request by adding the ratio of the write latency over read latency recorded by histogram buckets in Kyber scheduler. Token manager refills the tokens for each cgroup if one of the following two conditions is met: (1) if it has reached a certain period (e.g., 100 ms), (2) if the token for all cgroups in the active state has been exhausted. Here, if there are no I/O requests inserted into the queues for the previous period, it becomes inactive state.

3. Prediction of NVMe SSD Bandwidth

As mentioned above, total amount of tokens represents I/O throughput which can be served by the underlying NVMe SSD. However, it is difficult to accurately predict the I/O performance of NVMe SSD because it is greatly affected by the characteristics of the I/O workload and internal operations such as garbage collection and wear leveling. If the predicted token amount exceeds the available bandwidth of NVMe SSD, then the fairness between cgroups is deteriorate. Conversely, if the amount of required token is underestimated, the overhead of Kyber-fairness is increased due to frequent token reassignment. In this section, we explain how the number of tokens can be accurately predicted.

Algorithm. 1 describes bandwidth prediction and token allocation process. First of all, Kyber-fairness scheduler records the number of used tokens($u_{i}^{k}$) and throttled time($t_{i}^{k}$) over the periods and calculate an average number of tokens used during allocation period by each cgroup($u_{i}^{avg}$). It causes only negligible overhead because no global synchronization is required between multiple cgroups. Then, the number of required tokens for next period is predicted based on the sum of average used tokens ($U_{k}$). At this time, if throttled time of cgroup($t_{i}^{k}$) is shorter than allocation period , the predicted number increases, otherwise it decreases. The predicted tokens are distributed over the active cgroups in proportion to weight.

Algorithm 1. Token Prediction and Reallocation for The Next Period


4. Conformance to Multi-queue Scheme

Kyber-fairness scheduler assigns individual tokens to each cgroup that can be consumed without requiring a global synchronization of any proportional I/O sharing information. In the proposed scheduler the global synchronization is required for allocating tokens only for each cgroup. In other words, Kyber-fairness scheduler does not refer to the global structure to dispatch I/O requests. As a result, a scalable I/O performance can be delivered by fully utilizing internal parallelism of NVMe SSDs.

Table 1. Experimental environments


Intel E5-2620 (8-core) x 2




Intel DC P4500 2TB


Ubuntu 18.04 (Kernel 5.2.0)

Table 2. Parameters for FIO benchmark

IO engine

Libaio (32 I/O depth)

Block size

Random: 4KB / Sequential: 1MB


4 cgroups (4 threads per cgroup)

Direct I/O


File size

10GB (60sec runtime)


To evaluate the performance and fairness of Kyber-fairness scheduler, we have implemented and evaluated it on Linux kernel 5.2.0 and the experimental environment described in Table 1 (8). Also, FIO and trace-replay (20) have been used for measuring I/O performance of synthetic and real workloads, respectively.

FIO benchmark performs four different synthetic I/O workload, random read/write and sequential read/write workloads. Four cgroups are generated in total and each one has four threads. The I/O weight of cgroups are 100, 250, 500 and 1000, respectively. Table 2 shows the parameters used for FIO.

Fig. 4 shows the results of FIO benchmark. Fig. 4(a) shows Kyber-fairness scheduler can delivers closed I/O throughput to the None scheduler for all types of workloads. It improves I/O performance by 2.68x and 3.61x, respectively, for random read and random write workloads compared to BFQ scheduler. However, we can see that the performance improvement is not impressive in sequential workloads. The reason is that the scalability problem of BFQ is hidden for sequential workloads that do not require high scalability. Moreover, as you can see in Fig. 4(b), proposed Kyber-fairness scheduler ensures the fairness among multiple cgroups having different I/O weights. Note that Fairness Index (FI) (18) is an indicator of how accurately the I/O bandwidth is shared in proportion to I/O weights, the closer it is to 1, the more accurately it is shared.

Fig. 4. Evaluation results of proportional I/O bandwidth sharing for four cgroups with synthetic workloads (a) Total I/O Throughput(MB/s), (b) Fairness Index(FI).


For generating real workloads, block I/O trace replay tool, trace-replay (20) is used with block I/O traces from SNIA (21). Tables 3 and 4 shows the parameters used for the trace-replay and the characteristics of I/O workloads, respectively. In our experiments, four cgroups are created and each cgroup replays one of the block I/O traces described in Table 4 where the different I/O weights are given to them. Note that all traces are replayed by four concurrent threads during 60 seconds without timing consideration of the I/O requests.

Table 3. Parameters for trace-replay



IO depth



4 cgroups


60 sec

Num. of threads

4 threads

Table 4. Characteristics of I/O workloads


Request size

(total / average)

Required Bandwidth


R : W


1.12 GB / 31.1 KB


3.44 : 1


505 MB / 24.8 KB


3.19 : 1


2.88 GB / 50.4 KB


2.78 : 1


1.87 GB / 63.7KB


1.60 : 1

Fig. 5 shows the I/O performance and fairness of I/O schedulers for real workloads. Fig. 5(a) shows that Kyber-fairness scheduler can improves the I/O performance up to 1.33 times higher than that of BFQ scheduler. Moreover, Kyber-fairness scheduler shows slightly better fairness index than BFQ scheduler, as shown in Fig. 5(b). However, in Fig. 5(a), you can see that the performance gap between BFQ and Kyber-fairness is narrower when the LUN3 has the highest weight. The reason is that LUN3 does not require scalable performance compared to other LUNs so that the scalability problem of BFQ scheduler is not significantly exposed.

Fig. 5. Evaluation results of proportional I/O bandwidth sharing for four cgroups with real-world workloads (a) Total I/O Throughput(MB/s), (b) Fairness Index(FI).


Fig. 6 shows the scalability evaluation results of I/O schedulers. To evaluate scalability, we use small sized (1KB) random read requests with increasing number of cgroups. As you can see in the figure, the Kyber-fairness shows a performance close to that of the None scheduler even if the number of cgroups increases, whereas the BFQ scheduler decreases sharply as the number of cgroups increases. The experimental result shows that our proposed scheduler does not harm the parallelism of multi-queue block design even with high degree of multiple threading.

Fig. 6. Scalability evaluation results with increasing number of cgroups.



We proposed the scalable low-overhead proportional-share I/O scheduler for multi-queue block layer. The proposed I/O scheduler efficiently supports the proportional I/O bandwidth sharing for NVMe SSDs with removing global synchronization and token amount prediction technique. According to the experimental results, our I/O scheduler shows up to 3.61 times better I/O performance than the existing proportional-share I/O scheduler while the fairness was also accurately guaranteed. Moreover, the scalability evaluation result shows that the proposed Kyber-fairness scheduler can serve a scalable performance even with extremely high degree of multiple threading.


This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2018R1D1A3B07050034).


Popek G. J., Goldberg R. P., July 1974, SFormal requirements for virtualizable third generation architectures, Communications of the ACM, Vol. 17, No. 7, pp. 412-421Google Search
Li Z., Kihl M., Lu Q., Andersson J. A., Mar 2017, Performance Overhead Comparison between Hypervisor and Container Based Virtualization, Advanced Information Networking and Applications, 2017, AINA 2017, 31st IEEE International Con-ference on, pp. 955-962Google Search
Xavier M. G., Mar 2015, A Performance Isolation Analysis of Disk-Intensive Workloads on Container-Based Clouds, Parallel, Distributed, and Network-Based Processing, 2015, PDP 2015, 23rd Euromicro International Conference on, pp. 253-260Google Search
NVMe overview , 2020, Search
Cgroups , 2016, cgroup- v1/cgroups.txtGoogle Search
Linux Containers (LXC) , 2016, https://linuxcontainers. org/lxc/introduction/Google Search
Bjørling M., June 2013, Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems, 6th ACM International Systems and Storage Conference, SYSTOR 2013, pp. 1-10Google Search
Intel DC P4500 2TB NVMe SSD , 2017, https://ark. Search
Demers A., Keshav S., Shenker S., Aug 1989, Analysis and simulation of a fair queueing algorithm, ACM SIGCOMM Computer Communication Review, Vol. 19, No. 4, pp. 1-12Google Search
Axboe J., July 2004, Linux block IO-present and future, In Ottawa Linux Symp., pp. 51-61Google Search
Jin W., Chase J. S., Kaur J., June 2004, Interposed Proportional Sharing for a Storage Service Utility, ACM SIGMETRICS Performance Evaluation Review, Vol. 32, No. 1, pp. 37-48Google Search
Ahn S., La K., Kim J., June 2016, Improving I/O Resource Sharing of Linux Cgroup for NVMe SSDs on Multi-core Systems, Hot Topics in Storage and File Systems, 2016, HotStorage 2016, 8th USENIX Workshop onGoogle Search
Valente P., Checconi F., Sep 2010, High throughput disk scheduling with fair bandwidth distribution, Computers, IEEE Transactions on, Vol. 59, No. 9, pp. 1172-1186Google Search
Hedayati M., July 2019, Multi-Queue Fair Queueing, USENIX Annual Technical Conference, 2019, ATC 2019, pp. 301-314Google Search
Kim K., Hong S., Kim T., Feb 2020, Supporting the Priorities in the Multi-queue Block I/O Layer for NVMe SSDs, Journal of Semiconductor Technology and Science, Vol. 20, No. 1, pp. 55-62Google Search
Axboe J., Flexible I/O tester, https://github. com/axboe/fioGoogle Search
Open Benchmarking , 2020. Linux 5.6 IO Scheduler Benchmarks. Search
Jain R., Chiu D., Hawe W., Sep 1984, A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer System, DEC Eastern Research Lab, Tech. Rep. TR-301Google Search
Sandoval O., 2017, Kyber Multi-queue I/O Scheduler, Search
Oh Y., trace-replay, yongseokoh/trace-replayGoogle Search
Lee C., 2017, Understanding Storage Traffic Characteristics on Enterprise Virtual Desktop Infrastructure, 10th ACM International Systems and Storage Conference, SYSTOR 2017, pp. 1-11Google Search


Suho Son

Suho Son was born in Suncheon, Korea, in 1995.

He received the B.S. degree in the School of Computer Science and Engineering from Pusan National University, Korea, in 2020.

He is currently pursuing the M.S. degree in the School of Computer Science and Engineering from Pusan National University, Korea.

His interests include operating systems, cloud platforms and non-volatile memory storages.

Gijun Oh

Gijun Oh was born in Ulsan, Korea, in 1995.

He received the B.S. degree in the School of Computer Science and Engineering from Pusan National University, Korea, in 2020.

He is currently pursuing the M.S. degree in the School of Computer Science and Engineering from Pusan National University, Korea.

His interests include NVMe device driver, Flash-Translation Layer and Zoned Namespace SSD.

Sungyong Ahn

Sungyong Ahn received the B.S. and Ph.D. degree in the Department of Computer Science and Engineering from Seoul National University, Korea, in 2003 and 2012, respec-tively.

He worked at Samsung Electronics as a Senior Engineer from 2012 to 2017.

In 2017, he joined the School of Computer Science and Engineering, Pusan National University, Busan, South Korea, where he is currently a Professor.

His interests include operating systems, solid-state disk, and cloud platform.