RSD Part 1: Overview

RSD wants to show what a Rust block layer for qemu could look like, acting as a potential replacement for the qemu-storage-daemon. Its name is a rather free-form abbreviation built from QSD (qemu-storage-daemon) and RS (for Rust).
This blog post accompanies the first tentative release 0.1 to present RSD’s current state and capabilities, and what may come in the future. It’s split in two parts: Part 1 here provides an overview of what RSD is, insights to be gained from it so far, and what we can expect from it. Part 2 compares its performance to that of qemu’s C block layer (single-process qemu and QSD).

Motivation

Rewriting existing code in Rust yields tangible benefits besides abstract concepts like memory safety: Because of Rust’s strict concept of ownership, multi-threading is cheap to implement, and many of the potential hard-to-debug pitfalls due to concurrent access are simply non-existent. Also, it has built-in primitives for writing asynchronous code, which eliminates some of the problems we face when using custom coroutine code in qemu’s C code.
From discussions at KVM Forums, it has long been clear that there is interest in bringing Rust code into qemu. The benefits are obvious; and so is the main drawback, namely an existing large code base written in C, which “works just fine”. Year after year, the result was the same: If we want to see Rust in qemu, someone needs to start.
Meanwhile, projects like virtiofsd (the virtio-fs vhost-user back-end) have already shown a perfectly feasible way to integrate Rust with qemu, namely by putting the Rust code into a separate process and connecting it to qemu via an interface like vhost-user.
In the block layer, we not only have such an interface (vhost-user-blk), but with the qemu-storage-daemon (QSD), we also have an existing back-end for it. Thus came the idea to write another such back-end, with as compatible an interface as possible and sensible. This allows us to directly compare and see how the theoretical benefits of a switch to Rust can hold up in practice, and what previously unknown obstacles may emerge.

Overview

RSD has effectively been an experiment so far, and its feature set is in a proof-of-concept state. This is what it offers in 0.1:
  • Arbitrary multi-threading, i.e. any export can do I/O to any node through any thread(s), which allows the vhost-user-blk export to assign any queue to an arbitrary thread, by default putting each queue into its own thread
  • QMP over network sockets (unix/TCP) or stdio
  • Using libblkio for file access
  • Limited qcow2 support (notably no external data files, no sub-clusters, no zstd compression, no persistent dirty bitmaps, and basically no performance optimizations)
  • Limited vhost-user-blk exporting capabilities (notably no vectored writes, no zero writes, no discards)
  • Copying through a copy block job (unifies backup and mirror, no commit/stream yet; missing features like sync=top or throttling)
  • Limited NBD exporting, mostly for testing
  • Block graph reconfiguration via blockdev-reopen
Some (of the many) notable things that are missing:
  • Resizing volumes
  • Creating images
  • Zero writes, discards
Note that the interface is also sometimes slightly different from qemu’s. For example, RSD does not (yet) provide a key-value syntax for the command line, so all values must be specified in JSON (e.g. --blockdev driver=null,node-name=node0 does not work, it has to be --blockdev '{"driver": "null", "node-name": "node0"}'). Also, node-names are not auto-generated, so the user has to specify it for every node. Finally, nested nodes are not auto-deleted when their root node disappears, but blockdev-del has to be issued for every single node.
These differences are not necessarily intentional, but are rather things that were just implemented this way for simplicity. They should probably be changed in the future for compatibility, but so far doing so wasn’t sufficiently high on the priority list.

Results

The main purpose of RSD as an experiment has been to evaluate the use of Rust for an implementation of the qemu block layer, to identify benefits, drawbacks, and potential peculiarities.
To summarize some of the results (which are laid out in more detail in the sections below):
  • It is certainly no surprise that it is fundamentally possible to rewrite the qemu block layer in a different language like Rust. On the flip side, rewriting such a large existing code base is not only hard work, but also prone to introducing new bugs. Mitigating this by partially reusing existing code does not seem feasible.
  • Writing multi-threaded code comes natural in Rust. Compared to C, it seems nearly effortless and the result is much easier to maintain. The language also helps guiding your design, e.g. by having types be designated as sharable or not (via the Send/Sync traits).
  • Rust has built-in support for writing asynchronous code, which is good. Its peculiarities however sometimes make it difficult to write elegant code. It uses a different model to retain async state than we do in qemu (stored in an often heap-allocated object rather than implicitly on the stack), but this does not seem to be detrimental to performance.
  • Being an experiment, not much time could be spent on optimizing performance. Still, the comparing its performance to that of QSD/qemu shows very good results.

Reusing/Sharing C Code

One of the first questions was whether code sharing between RSD and the qemu block layer would be possible and feasible. The short answer is “No”.
The long answer is: Of course you can share code between Rust and C, that is what Rust’s FFI is for, but it does not seem to be worth it here. First, what existing C code might we want to reuse? These things come to mind:
  • The block layer core: If we can gain any benefits from rewriting parts of the block layer in Rust, it is probably here. The block layer core is complex and difficult to grasp in its entirety, which makes it error prone. Bugs that are reported here are often difficult to debug, and with true multi-threading, it is not going to be simpler (data plane has already lead to bugs that were difficult to reproduce and fix). Rust might help with these problems, so we probably want to rewrite this core, and not reuse existing code.
  • Some or all block drivers (including block jobs): In contrast, block driver code is rather stable and bugs seem more rare. We could try to reuse existing code here, and we did try, but it turns out that providing a usable FFI interface into which the C driver could be plugged, and adjusting the drivers for it, seems to be as much work as just rewriting them. This is especially so with qcow2 being the only driver for which this is really interesting, as the others have such low complexity that a rewrite should not be problematic.
Therefore, we have decided not to reuse C code in RSD. As a result, RSD’s feature set is only a small subset of that of qemu’s at this point in time.

Asynchronous Programming: State Objects vs. Coroutines

Performance

In qemu’s block layer, we use coroutines to write asynchronous code. Here, each coroutine has its own stack, so when it wants to yield, it can basically do so by simply switching to the caller’s stack. The coroutine’s state is stored on the stack, so this costs nearly nothing. One problem that qemu has faced is that there is no native coroutine support in the C language, so this has been implemented by hand, relying on effectively undefined behavior, which has already caused conflicts with compiler optimizations (for TLS variables).
Rust has built-in language features to facilitate writing asynchronous code (async and await), which instead have asynchronous functions return custom objects that describe their state. These objects must implement the Future trait, i.e. a non-blocking poll() function that tries to to make progress until the request completes and a result is returned. On the first call to an async function, such an object must be constructed, and poll() will need to modify and finally destroy it. This can cause overhead over the coroutine model, which simply uses the stack, which innately has the state.
This becomes especially apparent when recursively nesting asynchronous functions: For qemu’s coroutines, the stack simply grows, and unwinding a function when it completes is a simple arithmetic operation on the stack pointer. In contrast, Rust’s Future objects must encapsulate nested asynchronous functions’ states. Because those objects must also have a size known at compile time, recursion is only possible by placing them on the heap, so that nesting can be done via pointers. Thus, setting up these objects and unwinding them requires a heap allocation and deallocation.
To circumvent this, I have experimented with a stack-like custom allocator so that allocation and deallocation for nested functions could be done at no overhead. It turned out, however, that in practice heap allocations are so quick, that this brought no benefit. Instead, the stack allocator actually reduced performance (issue 1), perhaps because of having to always always pass a reference to the allocator object.
Therefore, overall, I didn’t find that Rust’s asynchronous programming style had any negative impact on real-world performance when compared to qemu’s coroutines.

Convenience

One thing to note is that Rust’s model can be slightly inconvenient to use. The problem is that for qemu’s coroutines, code is actually run as written. But in Rust, any async function (and block) is actually transformed into an abstract object that implements this code in the Future::poll() function. The original function only constructs this object and returns it. Because of this, manually defining these objects and implementing Future::poll() is sometimes tempting, simply because it is more powerful and allows defining exactly what will happen at runtime.
Doing this makes code more hard to read, though. Therefore, you have to put in extra effort to write native (non-poll()) code so it is actually readable. That isn’t ideal.
Even worse, writing code in this more obtuse way can indeed bring significant performance benefits: Doing so has allowed for easy read and write request submission batching, which can bring +50 % IOPS in the write case. As described in commit e52d1061, the batching’s implementation heavily relies on doing actual work in the synchronous path that creates the Future object, which is not possible when writing native async code; there, all work is done in the Future::poll() function.

Initial Polling

That last point is so important that we should repeat it: In Rust, async functions and blocks are transformed such that when they are run (but not awaited yet), they will construct and return a Future object, but nothing more. They will not yet actually do anything to make progress, because that is supposed to only be done in poll(). This is in contrast to qemu’s coroutines, which will generally do something before yielding for the first time.
Notably, when you run an async I/O function, this means that it will not even submit any I/O before you await it. It is therefore not an option to create an asynchronous I/O request, do some other stuff, and then await the request, hoping it already made progress in the background; if you want both to run concurrently, you must put the other stuff into a dedicated async block, and then use futures::join!(). If you want to run multiple requests simultaneously, you cannot just create them sequentially and then await them sequentially, but again, you must use futures::join!().
As described in the last section, this can be different for manually designed Future objects. libblkio-async, which RSD uses to get an async interface to libblkio, will create and enqueue any request when it is created, and this allows for read and write request submission batching.

Runtime Framework

Rust’s model effectively requires using a runtime framework, for example so that poll() can register event FDs on which to wait. I have decided to use tokio, for no particular reason, and have not yet compared it to other frameworks.
Tokio offers interesting features like automatically distributing Future objects to a thread pool for load distribution, but this is of no relevance to RSD, because I/O requests are generally so fast that the overhead of sending them to other threads is prohibitively high. Essentially, all such objects in RSD are !Send, i.e. bound to the creating thread. Doing this also allows optimizations like having them use Rc or Cell instead of more costly Arc or Mutex internally (MR 13 to libblkio-async).

Multi-Threading

In general, while I find multi-threading in C to be a dangerous and treacherous area, and also something that has to be thought about and implemented explicitly and with intent, in Rust it comes natural. Thanks to its ownership rules, things can often easily be run in a multi-threaded environment once you’ve implemented them, and when there are exceptions, the compiler will error out at compile time (X doesn’t implement Send or Sync).
This has indeed been my experience with Rust. In fact, the code I originally wrote was automatically multi-threaded (tokio wanting to send Future objects to various threads in a pool), and I had to manually make it single-threaded to improve performance (because sending these objects around caused too much overhead).
It is hard to convey how much of a relief it is to know that your compiler keeps track of what can be shared between threads and what cannot be, so you don’t have to find out at runtime by hitting a bug that’s extremely hard to reproduce. Conversely, this is also a great design tool to know what you can optimize for single-thread use, just by virtue of knowing whether it needs to implement Send or Sync.

Potential Use Cases

Naturally, RSD could be used (with feature additions as required for the specific use case) as a drop-in replacement wherever QSD can be/is used, for example in a Kubernetes CSI plugin that would use QSD to provide qemu block layer functionality. If such cases require new functionality not currently present in QSD or RSD, we are free to decide whether to implement it in QSD or RSD first, perhaps giving preference to RSD.
A more extensive (and much more hypothetical) case would be to integrate RSD directly into qemu, as an optional replacement of the existing block layer. This would involve looking into a tight integration between qemu and RSD on a linking level, i.e. much closer than vhost-user-blk.