RSD Part 2: Performance
While it was not the most interesting question in RSD’s development, performance in the
end is the sword by which any real-world project lives or dies. First, let me
emphasize:
Never trust a benchmark you didn’t fake yourself. Hardware behaves
viciously, and benchmarking often is more art than science. These are my results, and I
trust them, but you should take them with a handful of salt. Second, with RSD being a
proof of concept, I haven’t had the time to really optimize it, so there may be room for
improvement. Conversely, the current lack of features might lead to detrimental
performance in the future if new features were to be introduced in the common I/O path.
With that out of the way, overall, the performance looks good. As far as I can tell,
RSD consistently outperforms QSD, and often even a single-process (I/O-thread-using)
qemu. If it does not, it does not lose by much. In contrast to QSD and qemu, RSD can
benefit greatly from more I/O-submitting threads in the guest, not rarely achieving IOPS
multiple times that of qemu or QSD.
I’ve obtained my results with FIO in a qemu guest (30-second runs), on a machine with 24
CPU cores, and the guest has just as many cores. The SSD is an NVMe SSD, specifically a
Samsung 970 EVO Plus (whose data sheet lists up to 620k IOPS for random reads, and up to
560k IOPS for random writes), and the image is on a 16 GB ext4 filesystem on that SSD.
CPU usage has been measured only for QSD and RSD, because for qemu, it is difficult to
separate the I/O thread’s CPU time from the whole process’s.
Test runs
I have tested the following program configurations (always with
O_DIRECT where
possible, i.e. not on tmpfs):
-
qemu-*: Single-process qemu (v8.0.0), with the virtio-blk device in a separate
I/O thread.
-
qemu-blkio: Using the blkio library (driver=io_uring),
which in turn uses the io_uring kernel interface.
-
qemu-io_uring: Using the built-in io_uring driver
(driver=file,aio=io_uring).
-
qemu-native: Using Linux’s native AIO interface
(driver=file,aio=native).
-
qemu-threads: Using no AIO interface, but thread pool submitting
synchronous requests instead (driver=file,aio=threads).
-
qsd-*: QSD (v8.0.0) run alongside qemu, using the vhost-user-blk interface. The
variants (qsd-blkio, qsd-io_uring, qsd-native, qsd-threads) are the same as for
qemu-*.
-
rsd-*: RSD (commit c42481f12933f4b5f6567f4ec192e8430eec153a) run alongside qemu,
using the vhost-user-blk interface.
-
rsd-blkio: Using the blkio library
(driver=file,aio=blkio, which is the default), which, like in
qemu, internally uses the io_uring kernel interface.
-
rsd-blkio_batching: Same as rsd-blkio, but having the
vhost-user-blk export configured to reduce the dependence on the kick
and call file descriptors: First, after processing each request, it will
poll the virt queues for a bit before yielding and waiting for a new
kick through the kickfd (poll-duration). Second, after a
request has been completed, it will not communicate this back to the
front-end immediately, but wait a bit for more completions and batch
them to reduce the number of callfd activations
(call-batching). On tmpfs, this can give a throughput boost at
the cost of latency and CPU usage, theoretically (in practice, FIO seems
to find the latency still better, on average). Without polling and call
batching, the guest seems to submit requests one by one, tmpfs finishes
them basically instantly, resulting in an I/O queue depth of effectively
just 1 and thus maximum overhead from the vhost-user interface.
-
rsd-threads: Using no AIO interface, but a thread pool submitting
synchronous requests instead (driver=file,aio=threads).
-
rsd-threads_batching: Same as rsd-threads, but using polling and call
batching in vhost-user-blk.
Not all of these configurations are shown in the graphs below, e.g. qemu/qsd-native
requires
O_DIRECT, so cannot be used on tmpfs; and using a thread pool
generally does not give better performance than using an AIO interface, so the -threads
variants are used only on tmpfs (where
aio=native is not possible). Call
batching is shown only where it clearly improves performance.
SSD 4k
Figure 1: fio SSD random R/W (4k) IOPS –
mqX-Yjobs means the virtio-blk device has
X queues, and fio runs with
Y jobs
Figure 2: fio SSD random R/W (4k) CPU usage
Figure 1 shows random reads and writes, i.e. FIO’s randrw mode.
Read and write IOPS each are summed to get the result, and the shaded area indicates the
standard deviation (not statistically accurate, as standard deviations for both reads
and writes are just added).
RSD outperforms QSD and qemu in every configuration, though only slightly in the 1-queue
1-job case. With a second queue, however, RSD already achieves 38 % higher IOPS, and
with more jobs run simultaneously in the guest, the gap grows beyond +100 %.
Conversely, qemu and QSD have basically the same performance regardless of the number of
queues or jobs, but the *-blkio configurations get worse the more jobs the guest has.
As
figure 2 shows, RSD’s CPU usage meanwhile is not greater than
you’d expect. Where QSD always has around 80 % (i.e. 0.8 cores), RSD (in its
default blkio configuration) uses slightly less CPU for one guest job (70 %), and
then grows to match the number of jobs when that grows (140 % for 2 jobs,
1000 % for 12 jobs).
Figure 4: fio SSD random write (4k) IOPS
When we look at pure reads and writes, thinks look a bit different. For reads
(
figure 3), there is a common performance hit in a single-queue
configuration across qemu, QSD, and RSD. RSD is between qemu and QSD here. With more
queues, it still considerably outperforms them, though. Interestingly, running
simultaneous jobs in the guest does not improve RSD’s performance in this case, probably
due to hitting the randread IOPS hardware limit (it is stated to be 620k in the data
sheet, but running fio on bare hardware yields just 424k).
For random writes (
figure 4), RSD always outperforms qemu, even in
the single-queue case (+34 %). The rest of the graph looks very similar to the randrw
graph.
SSD 64k
Figure 5: fio SSD random R/W (64k) IOPS
(maximum, i.e., shown is only the maximum each application achieves between the
configurations it offers)
Figure 6: fio SSD random read (64k)
IOPS (maximum)
Figure 7: fio SSD random write (64k)
IOPS (maximum)
For a block size of 64k, results are clearly limited by bandwidth
(
figure 5,
figure 6,
figure 7), so it all applications perform about the same. There is
little difference between the different configurations (i.e. AIO back-ends), so the
figures show only the overall maximum for legibility.
tmpfs 4k
tmpfs behaves rather differently than normal devices, but nonetheless, it is one way to
get very high IOPS.
Figure 8 shows the result for random reads and
writes: RSD (using blkio) outperforms QSD in every configuration, and is about on par
with single-process qemu, until fio runs more than a single job. The more jobs fio
runs, the better IOPS RSD achieves, pulling ahead by a large margin.
Note that for the single-job cases, RSD can only match qemu in the *_batching cases
(i.e. having its vhost-user-blk back-end use polling and call batching). Otherwise, its
performance is “only” on par with QSD. The difference can be considerable, reaching
+78 % in the
mq24-1jobs case (between rsd-blkio and rsd-blkio_batching). These
optimizations could probably also improve QSD’s performance, but it is unclear whether
qemu could benefit from them as well, given that they are intended to improve on
vhost-user’s performance (which is not used in the qemu case) by reducing the number of
call/kick FD uses.
Figure 9: fio tmpfs random R/W (4k) CPU usage
Figure 10: fio tmpfs random R/W (4k)
latency (only for the lowest-latency runs)
Figure 11: fio tmpfs random R/W (4k)
minimum latency (only for the lowest-latency runs)
We expect polling to increase CPU usage, and call batching to increase latency.
Figure 9 shows that indeed CPU usage is increased. There is no
increase in average latency (
figure 10), but an increase in minimum
latency (
figure 11).
Running pure reads and writes does not really yield fundamentally different results from
the randrw case. It is nice to see 3.5 million IOPS in the read case, though.
tmpfs 64k
Figure 14: fio tmpfs random R/W (64k)
IOPS (maximum)
Figure 15: fio tmpfs random write
(64k) IOPS (maximum)
As on the SSD, a block size of 64k seems to be limited on bandwidth, at least when
writes are involved. Like in the SSD tests, the graphs therefore only show the maximum
number of IOPS per application, because they are so close to one another.
Figure 17: fio tmpfs random read (64k) CPU
usage
Interestingly, the picture is much different for pure reads (
figure
16). This is the only case where each application’s respective thread pool back-end
outperforms the native AIO back-ends (blkio/io_uring). Still, when looking only at one
kind of back-end at once, RSD generally outperforms QSD and qemu, except for
mq1-1jobs and
mq2-1jobs with a thread pool back-end, where call batching
is needed.
Here, we see the only case where RSD’s performance decreases the more jobs are run in
the guest. I haven’t investigated why that is, but the CPU usage graph (
figure 17) may indicate that the usage simply gets too high past a
sweet spot, constraining itself and perhaps RAM bandwidth (for tmpfs). Notably, this
dip is also visible in the randrw and randwrite cases (
figure 14,
figure 15), though much less pronounced.
Performance Discussion
Now that we’ve seen the benchmarking results, we can discuss some interesting findings:
First, multi-threaded multi-queue support works as intended. RSD benefits directly from
the guest using more virt queues. On hardware that can cope well with requests from
multiple threads (like an SSD), this can greatly boost performance.
Second, it is actually not clear why RSD outperforms QSD even in the
mq1-1jobs
cases, or sometimes even qemu. One reason might be that RSD supports request batching,
i.e. that requests can be created separately from awaiting them, so that when they are
awaited, all of them can be collectively submitted. At least in the SSD 4k randwrite
case, this is definitely the relevant reason, because request batching did not work for
write requests for a while, and before it was fixed, RSD’s single-job performance was
slightly lower than qemu’s.
Third, of all of qemu’s and QSD’s I/O back-ends, the built-in io_uring back-end has
generally performed best. This is interesting given that the libblkio back-end also
uses io_uring, but almost always performs worse, especially with more than one
I/O-submitting thread in the guest, where for some reason its performance always drops.
This may be because qemu’s blkio driver only creates a single submission queue,
potentially leading to contention when I/O comes in through different virt queues. RSD
in contrast creates one queue per I/O thread, i.e. one queue per virt queue. Still,
maybe using libblkio does impose an inherent overhead, and a native io_uring interface
could improve RSD’s performance further.
Finally, it is unclear why the thread pool back-ends universally performed so well in
the tmpfs 64k randread case, not least because RSD’s thread pool back-end shares no code
or concept (besides using a thread pool) with qemu’s. Notably, this peculiarity is not
present in the respective 4k case.