RSD Part 2: Performance

While it was not the most interesting question in RSD’s development, performance in the end is the sword by which any real-world project lives or dies. First, let me emphasize: Never trust a benchmark you didn’t fake yourself. Hardware behaves viciously, and benchmarking often is more art than science. These are my results, and I trust them, but you should take them with a handful of salt. Second, with RSD being a proof of concept, I haven’t had the time to really optimize it, so there may be room for improvement. Conversely, the current lack of features might lead to detrimental performance in the future if new features were to be introduced in the common I/O path.
With that out of the way, overall, the performance looks good. As far as I can tell, RSD consistently outperforms QSD, and often even a single-process (I/O-thread-using) qemu. If it does not, it does not lose by much. In contrast to QSD and qemu, RSD can benefit greatly from more I/O-submitting threads in the guest, not rarely achieving IOPS multiple times that of qemu or QSD.
I’ve obtained my results with FIO in a qemu guest (30-second runs), on a machine with 24 CPU cores, and the guest has just as many cores. The SSD is an NVMe SSD, specifically a Samsung 970 EVO Plus (whose data sheet lists up to 620k IOPS for random reads, and up to 560k IOPS for random writes), and the image is on a 16 GB ext4 filesystem on that SSD.
CPU usage has been measured only for QSD and RSD, because for qemu, it is difficult to separate the I/O thread’s CPU time from the whole process’s.

Test runs

I have tested the following program configurations (always with O_DIRECT where possible, i.e. not on tmpfs):
  • qemu-*: Single-process qemu (v8.0.0), with the virtio-blk device in a separate I/O thread.
    • qemu-blkio: Using the blkio library (driver=io_uring), which in turn uses the io_uring kernel interface.
    • qemu-io_uring: Using the built-in io_uring driver (driver=file,aio=io_uring).
    • qemu-native: Using Linux’s native AIO interface (driver=file,aio=native).
    • qemu-threads: Using no AIO interface, but thread pool submitting synchronous requests instead (driver=file,aio=threads).
  • qsd-*: QSD (v8.0.0) run alongside qemu, using the vhost-user-blk interface. The variants (qsd-blkio, qsd-io_uring, qsd-native, qsd-threads) are the same as for qemu-*.
  • rsd-*: RSD (commit c42481f12933f4b5f6567f4ec192e8430eec153a) run alongside qemu, using the vhost-user-blk interface.
    • rsd-blkio: Using the blkio library (driver=file,aio=blkio, which is the default), which, like in qemu, internally uses the io_uring kernel interface.
    • rsd-blkio_batching: Same as rsd-blkio, but having the vhost-user-blk export configured to reduce the dependence on the kick and call file descriptors: First, after processing each request, it will poll the virt queues for a bit before yielding and waiting for a new kick through the kickfd (poll-duration). Second, after a request has been completed, it will not communicate this back to the front-end immediately, but wait a bit for more completions and batch them to reduce the number of callfd activations (call-batching). On tmpfs, this can give a throughput boost at the cost of latency and CPU usage, theoretically (in practice, FIO seems to find the latency still better, on average). Without polling and call batching, the guest seems to submit requests one by one, tmpfs finishes them basically instantly, resulting in an I/O queue depth of effectively just 1 and thus maximum overhead from the vhost-user interface.
    • rsd-threads: Using no AIO interface, but a thread pool submitting synchronous requests instead (driver=file,aio=threads).
    • rsd-threads_batching: Same as rsd-threads, but using polling and call batching in vhost-user-blk.
Not all of these configurations are shown in the graphs below, e.g. qemu/qsd-native requires O_DIRECT, so cannot be used on tmpfs; and using a thread pool generally does not give better performance than using an AIO interface, so the -threads variants are used only on tmpfs (where aio=native is not possible). Call batching is shown only where it clearly improves performance.

SSD 4k

Figure 1: fio SSD random R/W (4k) IOPS – mqX-Yjobs means the virtio-blk device has X queues, and fio runs with Y jobs
Figure 2: fio SSD random R/W (4k) CPU usage
Figure 1 shows random reads and writes, i.e. FIO’s randrw mode. Read and write IOPS each are summed to get the result, and the shaded area indicates the standard deviation (not statistically accurate, as standard deviations for both reads and writes are just added).
RSD outperforms QSD and qemu in every configuration, though only slightly in the 1-queue 1-job case. With a second queue, however, RSD already achieves 38 % higher IOPS, and with more jobs run simultaneously in the guest, the gap grows beyond +100 %. Conversely, qemu and QSD have basically the same performance regardless of the number of queues or jobs, but the *-blkio configurations get worse the more jobs the guest has.
As figure 2 shows, RSD’s CPU usage meanwhile is not greater than you’d expect. Where QSD always has around 80 % (i.e. 0.8 cores), RSD (in its default blkio configuration) uses slightly less CPU for one guest job (70 %), and then grows to match the number of jobs when that grows (140 % for 2 jobs, 1000 % for 12 jobs).
Figure 3: fio SSD random read (4k) IOPS
Figure 4: fio SSD random write (4k) IOPS
When we look at pure reads and writes, thinks look a bit different. For reads (figure 3), there is a common performance hit in a single-queue configuration across qemu, QSD, and RSD. RSD is between qemu and QSD here. With more queues, it still considerably outperforms them, though. Interestingly, running simultaneous jobs in the guest does not improve RSD’s performance in this case, probably due to hitting the randread IOPS hardware limit (it is stated to be 620k in the data sheet, but running fio on bare hardware yields just 424k).
For random writes (figure 4), RSD always outperforms qemu, even in the single-queue case (+34 %). The rest of the graph looks very similar to the randrw graph.

SSD 64k

Figure 5: fio SSD random R/W (64k) IOPS (maximum, i.e., shown is only the maximum each application achieves between the configurations it offers)
Figure 6: fio SSD random read (64k) IOPS (maximum)
Figure 7: fio SSD random write (64k) IOPS (maximum)
For a block size of 64k, results are clearly limited by bandwidth (figure 5, figure 6, figure 7), so it all applications perform about the same. There is little difference between the different configurations (i.e. AIO back-ends), so the figures show only the overall maximum for legibility.

tmpfs 4k

Figure 8: fio tmpfs random R/W (4k) IOPS, both all configurations and the per-application maximum for legibility
tmpfs behaves rather differently than normal devices, but nonetheless, it is one way to get very high IOPS. Figure 8 shows the result for random reads and writes: RSD (using blkio) outperforms QSD in every configuration, and is about on par with single-process qemu, until fio runs more than a single job. The more jobs fio runs, the better IOPS RSD achieves, pulling ahead by a large margin.
Note that for the single-job cases, RSD can only match qemu in the *_batching cases (i.e. having its vhost-user-blk back-end use polling and call batching). Otherwise, its performance is “only” on par with QSD. The difference can be considerable, reaching +78 % in the mq24-1jobs case (between rsd-blkio and rsd-blkio_batching). These optimizations could probably also improve QSD’s performance, but it is unclear whether qemu could benefit from them as well, given that they are intended to improve on vhost-user’s performance (which is not used in the qemu case) by reducing the number of call/kick FD uses.
Figure 9: fio tmpfs random R/W (4k) CPU usage
Figure 10: fio tmpfs random R/W (4k) latency (only for the lowest-latency runs)
Figure 11: fio tmpfs random R/W (4k) minimum latency (only for the lowest-latency runs)
We expect polling to increase CPU usage, and call batching to increase latency. Figure 9 shows that indeed CPU usage is increased. There is no increase in average latency (figure 10), but an increase in minimum latency (figure 11).
Figure 12: fio tmpfs random read (4k) IOPS
Figure 13: fio tmpfs random write (4k) IOPS
Running pure reads and writes does not really yield fundamentally different results from the randrw case. It is nice to see 3.5 million IOPS in the read case, though.

tmpfs 64k

Figure 14: fio tmpfs random R/W (64k) IOPS (maximum)
Figure 15: fio tmpfs random write (64k) IOPS (maximum)
As on the SSD, a block size of 64k seems to be limited on bandwidth, at least when writes are involved. Like in the SSD tests, the graphs therefore only show the maximum number of IOPS per application, because they are so close to one another.
Figure 16: fio tmpfs random read (64k) IOPS
Figure 17: fio tmpfs random read (64k) CPU usage
Interestingly, the picture is much different for pure reads (figure 16). This is the only case where each application’s respective thread pool back-end outperforms the native AIO back-ends (blkio/io_uring). Still, when looking only at one kind of back-end at once, RSD generally outperforms QSD and qemu, except for mq1-1jobs and mq2-1jobs with a thread pool back-end, where call batching is needed.
Here, we see the only case where RSD’s performance decreases the more jobs are run in the guest. I haven’t investigated why that is, but the CPU usage graph (figure 17) may indicate that the usage simply gets too high past a sweet spot, constraining itself and perhaps RAM bandwidth (for tmpfs). Notably, this dip is also visible in the randrw and randwrite cases (figure 14, figure 15), though much less pronounced.

Performance Discussion

Now that we’ve seen the benchmarking results, we can discuss some interesting findings: First, multi-threaded multi-queue support works as intended. RSD benefits directly from the guest using more virt queues. On hardware that can cope well with requests from multiple threads (like an SSD), this can greatly boost performance.
Second, it is actually not clear why RSD outperforms QSD even in the mq1-1jobs cases, or sometimes even qemu. One reason might be that RSD supports request batching, i.e. that requests can be created separately from awaiting them, so that when they are awaited, all of them can be collectively submitted. At least in the SSD 4k randwrite case, this is definitely the relevant reason, because request batching did not work for write requests for a while, and before it was fixed, RSD’s single-job performance was slightly lower than qemu’s.
Third, of all of qemu’s and QSD’s I/O back-ends, the built-in io_uring back-end has generally performed best. This is interesting given that the libblkio back-end also uses io_uring, but almost always performs worse, especially with more than one I/O-submitting thread in the guest, where for some reason its performance always drops. This may be because qemu’s blkio driver only creates a single submission queue, potentially leading to contention when I/O comes in through different virt queues. RSD in contrast creates one queue per I/O thread, i.e. one queue per virt queue. Still, maybe using libblkio does impose an inherent overhead, and a native io_uring interface could improve RSD’s performance further.
Finally, it is unclear why the thread pool back-ends universally performed so well in the tmpfs 64k randread case, not least because RSD’s thread pool back-end shares no code or concept (besides using a thread pool) with qemu’s. Notably, this peculiarity is not present in the respective 4k case.