RSD Part 1: Overview
RSD wants to show what a Rust block layer for qemu could look like, acting as a
potential replacement for the qemu-storage-daemon. Its name is a rather free-form
abbreviation built from QSD (qemu-storage-daemon) and RS (for Rust).
This blog post accompanies the first tentative release 0.1 to present RSD’s current
state and capabilities, and what may come in the future. It’s split in two parts: Part
1 here provides an overview of what RSD is, insights to be gained from it so far, and
what we can expect from it.
Part 2 compares its
performance to that of qemu’s C block layer (single-process qemu and QSD).
Motivation
Rewriting existing code in Rust yields tangible benefits besides abstract concepts like
memory safety: Because of Rust’s strict concept of ownership, multi-threading is cheap
to implement, and many of the potential hard-to-debug pitfalls due to concurrent access
are simply non-existent. Also, it has built-in primitives for writing asynchronous
code, which eliminates some of the problems we face when using custom coroutine code in
qemu’s C code.
From discussions at KVM Forums, it has long been clear that there is interest in
bringing Rust code into qemu. The benefits are obvious; and so is the main drawback,
namely an existing large code base written in C, which “works just fine”. Year after
year, the result was the same: If we want to see Rust in qemu, someone needs to start.
Meanwhile, projects like
virtiofsd
(the virtio-fs vhost-user back-end) have already shown a perfectly feasible way to
integrate Rust with qemu, namely by putting the Rust code into a separate process and
connecting it to qemu via an interface like vhost-user.
In the block layer, we not only have such an interface (vhost-user-blk), but with the
qemu-storage-daemon (QSD), we also have an existing back-end for it. Thus came the idea
to write another such back-end, with as compatible an interface as possible and
sensible. This allows us to directly compare and see how the theoretical benefits of a
switch to Rust can hold up in practice, and what previously unknown obstacles may
emerge.
Overview
RSD has effectively been an experiment so far, and its feature set is in a
proof-of-concept state. This is what it offers in 0.1:
-
Arbitrary multi-threading, i.e. any export can do I/O to any node through any
thread(s), which allows the vhost-user-blk export to assign any queue to an
arbitrary thread, by default putting each queue into its own thread
-
QMP over network sockets (unix/TCP) or stdio
-
Using libblkio for file access
-
Limited qcow2 support (notably no external data files, no sub-clusters, no
zstd compression, no persistent dirty bitmaps, and basically no performance
optimizations)
-
Limited vhost-user-blk exporting capabilities (notably no vectored writes, no
zero writes, no discards)
-
Copying through a copy block job (unifies backup and mirror, no commit/stream
yet; missing features like sync=top or throttling)
-
Limited NBD exporting, mostly for testing
-
Block graph reconfiguration via blockdev-reopen
Some (of the many) notable things that are missing:
- Resizing volumes
- Creating images
- Zero writes, discards
Note that the interface is also sometimes slightly different from qemu’s. For example,
RSD does not (yet) provide a key-value syntax for the command line, so all values must
be specified in JSON (e.g.
--blockdev driver=null,node-name=node0 does not
work, it has to be
--blockdev '{"driver": "null", "node-name": "node0"}').
Also, node-names are not auto-generated, so the user has to specify it for every node.
Finally, nested nodes are not auto-deleted when their root node disappears, but
blockdev-del has to be issued for every single node.
These differences are not necessarily intentional, but are rather things that were just
implemented this way for simplicity. They should probably be changed in the future for
compatibility, but so far doing so wasn’t sufficiently high on the priority list.
Results
The main purpose of RSD as an experiment has been to evaluate the use of Rust for an
implementation of the qemu block layer, to identify benefits, drawbacks, and potential
peculiarities.
To summarize some of the results (which are laid out in more detail in the sections
below):
-
It is certainly no surprise that it is fundamentally possible to rewrite the
qemu block layer in a different language like Rust. On the flip side, rewriting
such a large existing code base is not only hard work, but also prone to
introducing new bugs. Mitigating this by partially reusing existing code does
not seem feasible.
-
Writing multi-threaded code comes natural in Rust. Compared to C, it seems
nearly effortless and the result is much easier to maintain. The language also
helps guiding your design, e.g. by having types be designated as sharable or not
(via the Send/Sync traits).
-
Rust has built-in support for writing asynchronous code, which is good. Its
peculiarities however sometimes make it difficult to write elegant code. It
uses a different model to retain async state than we do in qemu (stored in an
often heap-allocated object rather than implicitly on the stack), but this does
not seem to be detrimental to performance.
-
Being an experiment, not much time could be spent on optimizing performance.
Still, the comparing its performance to that of
QSD/qemu shows very good results.
Reusing/Sharing C Code
One of the first questions was whether code sharing between RSD and the qemu block layer
would be possible and feasible. The short answer is “No”.
The long answer is: Of course you can share code between Rust and C, that is what Rust’s
FFI is for, but it does not seem to be worth it here. First, what existing C code might
we want to reuse? These things come to mind:
-
The block layer core: If we can gain any benefits from rewriting parts of the
block layer in Rust, it is probably here. The block layer core is complex and
difficult to grasp in its entirety, which makes it error prone. Bugs that are
reported here are often difficult to debug, and with true multi-threading, it is
not going to be simpler (data plane has already lead to bugs that were difficult
to reproduce and fix). Rust might help with these problems, so we probably want
to rewrite this core, and not reuse existing code.
-
Some or all block drivers (including block jobs): In contrast, block driver code
is rather stable and bugs seem more rare. We could try to reuse existing code
here, and we did try, but it turns out that providing a usable FFI interface
into which the C driver could be plugged, and adjusting the drivers for it,
seems to be as much work as just rewriting them. This is especially so with
qcow2 being the only driver for which this is really interesting, as the others
have such low complexity that a rewrite should not be problematic.
Therefore, we have decided not to reuse C code in RSD. As a result, RSD’s feature set
is only a small subset of that of qemu’s at this point in time.
Asynchronous Programming: State Objects vs. Coroutines
Performance
In qemu’s block layer, we use coroutines to write asynchronous code. Here, each
coroutine has its own stack, so when it wants to yield, it can basically do so by simply
switching to the caller’s stack. The coroutine’s state is stored on the stack, so this
costs nearly nothing. One problem that qemu has faced is that there is no native
coroutine support in the C language, so this has been implemented by hand, relying on
effectively undefined behavior, which has already caused conflicts with compiler
optimizations (for TLS variables).
Rust has built-in language features to facilitate writing asynchronous code
(
async and
await), which instead have asynchronous functions return
custom objects that describe their state. These objects must implement the
Future trait, i.e. a non-blocking
poll() function that tries to to
make progress until the request completes and a result is returned. On the first call
to an async function, such an object must be constructed, and
poll() will need
to modify and finally destroy it. This can cause overhead over the coroutine model,
which simply uses the stack, which innately has the state.
This becomes especially apparent when recursively nesting asynchronous functions: For
qemu’s coroutines, the stack simply grows, and unwinding a function when it completes is
a simple arithmetic operation on the stack pointer. In contrast, Rust’s
Future
objects must encapsulate nested asynchronous functions’ states. Because those objects
must also have a size known at compile time, recursion is only possible by placing them
on the heap, so that nesting can be done via pointers. Thus, setting up these objects
and unwinding them requires a heap allocation and deallocation.
To circumvent this, I have experimented with a stack-like custom allocator so that
allocation and deallocation for nested functions could be done at no overhead. It
turned out, however, that in practice heap allocations are so quick, that this brought
no benefit. Instead, the stack allocator actually reduced performance (
issue 1), perhaps because of having
to always always pass a reference to the allocator object.
Therefore, overall, I didn’t find that Rust’s asynchronous programming style had any
negative impact on real-world performance when compared to qemu’s coroutines.
Convenience
One thing to note is that Rust’s model can be slightly inconvenient to use. The problem
is that for qemu’s coroutines, code is actually run as written. But in Rust, any async
function (and block) is actually transformed into an abstract object that implements
this code in the
Future::poll() function. The original function only
constructs this object and returns it. Because of this, manually defining these objects
and implementing
Future::poll() is sometimes tempting, simply because it is
more powerful and allows defining exactly what will happen at runtime.
Doing this makes code more hard to read, though. Therefore, you have to put in extra
effort to write native (non-
poll()) code so it is actually readable. That
isn’t ideal.
Even worse, writing code in this more obtuse way can indeed bring significant
performance benefits: Doing so has allowed for easy read and write request submission
batching, which can bring +50 % IOPS in the write case. As described in
commit
e52d1061, the batching’s implementation heavily relies on doing actual work in the
synchronous path that creates the
Future object, which is not possible when
writing native async code; there, all work is done in the
Future::poll()
function.
Initial Polling
That last point is so important that we should repeat it: In Rust, async functions and
blocks are transformed such that when they are run (but not awaited yet), they will
construct and return a
Future object, but nothing more. They will not yet
actually do anything to make progress, because that is supposed to only be done in
poll(). This is in contrast to qemu’s coroutines, which will generally do
something before yielding for the first time.
Notably, when you run an async I/O function, this means that it will not even submit any
I/O before you await it. It is therefore not an option to create an asynchronous I/O
request, do some other stuff, and then await the request, hoping it already made
progress in the background; if you want both to run concurrently, you must put the other
stuff into a dedicated async block, and then use
futures::join!(). If you want
to run multiple requests simultaneously, you cannot just create them sequentially and
then await them sequentially, but again, you must use
futures::join!().
As described in the last section, this can be different for manually designed
Future objects. libblkio-async, which RSD uses to get an async interface to
libblkio, will create and enqueue any request when it is created, and this allows for
read and write request submission batching.
Runtime Framework
Rust’s model effectively requires using a runtime framework, for example so that
poll() can register event FDs on which to wait. I have decided to use tokio,
for no particular reason, and have not yet compared it to other frameworks.
Tokio offers interesting features like automatically distributing
Future
objects to a thread pool for load distribution, but this is of no relevance to RSD,
because I/O requests are generally so fast that the overhead of sending them to other
threads is prohibitively high. Essentially, all such objects in RSD are
!Send,
i.e. bound to the creating thread. Doing this also allows optimizations like having
them use
Rc or
Cell instead of more costly
Arc or
Mutex internally (
MR 13 to
libblkio-async).
Multi-Threading
In general, while I find multi-threading in C to be a dangerous and treacherous area,
and also something that has to be thought about and implemented explicitly and with
intent, in Rust it comes natural. Thanks to its ownership rules, things can often
easily be run in a multi-threaded environment once you’ve implemented them, and when
there are exceptions, the compiler will error out at compile time (
X doesn’t
implement
Send or
Sync).
This has indeed been my experience with Rust. In fact, the code I originally wrote was
automatically multi-threaded (tokio wanting to send
Future objects to various
threads in a pool), and I had to manually make it single-threaded to improve performance
(because sending these objects around caused too much overhead).
It is hard to convey how much of a relief it is to know that your compiler keeps track
of what can be shared between threads and what cannot be, so you don’t have to find out
at runtime by hitting a bug that’s extremely hard to reproduce. Conversely, this is
also a great design tool to know what you can optimize for single-thread use, just by
virtue of knowing whether it needs to implement
Send or
Sync.
Potential Use Cases
Naturally, RSD could be used (with feature additions as required for the specific use
case) as a drop-in replacement wherever QSD can be/is used, for example in a Kubernetes
CSI plugin that would use QSD to provide qemu block layer functionality. If such cases
require new functionality not currently present in QSD or RSD, we are free to decide
whether to implement it in QSD or RSD first, perhaps giving preference to RSD.
A more extensive (and much more hypothetical) case would be to integrate RSD directly
into qemu, as an optional replacement of the existing block layer. This would involve
looking into a tight integration between qemu and RSD on a linking level, i.e. much
closer than vhost-user-blk.