This "quiescent state based reclamation" module provides support for
the qp-trie module in dns/qp. It is a replacement for liburcu, written
without reference to the urcu source code, and in fact it works in a
significantly different way.
A few specifics of BIND make this variant of QSBR somewhat simpler:
* We can require that wait-free access to a qp-trie only happens in
an isc_loop callback. The loop provides a natural quiescent state,
after the callbacks are done, when no qp-trie access occurs.
* We can dispense with any API like rcu_synchronize(). In practice,
it takes far too long to wait for a grace period to elapse for each
write to a data structure.
* We use the idea of "phases" (aka epochs or eras) from EBR to
reduce the amount of bookkeeping needed to track memory that is no
longer needed, knowing that the qp-trie does most of that work
already.
I considered hazard pointers for safe memory reclamation. They have
more read-side overhead (updating the hazard pointers) and it wasn't
clear to me how to nicely schedule the cleanup work. Another
alternative, epoch-based reclamation, is designed for fine-grained
lock-free updates, so it needs some rethinking to work well with the
heavily read-biased design of the qp-trie. QSBR has the fastest read
side of the basic SMR algorithms (with no barriers), and fits well
into a libuv loop. More recent hybrid SMR algorithms do not appear to
have enough benefits to justify the extra complexity.
15 KiB
QSBR: quiescent state based reclamation
QSBR is a safe memory reclamation (SMR) algorithm for lock-free data
structures such as a qp-trie. (See doc/dev/qp.md.)
When an object is unlinked from a lock-free data structure, it
cannot be free()ed immediately, because there can still be readers
accessing the object via an old version of the data structure. SMR
algorithms determine when it is safe to reclaim memory after it has
been unlinked.
Introductions and overviews
There is a terse overview in include/isc/qsbr.h.
Jeff Preshing has a nice introduction to QSBR, https://preshing.com/20160726/using-quiescent-states-to-reclaim-memory/
At the end of this note is a copy of a blog post about writing BIND's
isc_qsbr, https://dotat.at/@/2023-01-10-qsbr.html
Paul McKenney's web page has links to his book on concurrent programming, the Userspace RCU library, and more. McKenney invented RCU and QSBR. RCU is the Linux kernel's machinery for lock-free data structures and safe memory reclamation, based on QSBR.
Example code
If you are implementing a lock-free data structure that needs safe
memory reclamation, here's a guide to using isc_qsbr, based on how
QSBR is used by dns_qp.
registration
When the program starts up you need to register a global callback function that will reclaim unused memory. You can do so using an ISC_CONSTRUCTOR function that runs automatically at startup.
static void
qp_qsbr_register(void) ISC_CONSTRUCTOR;
static void
qp_qsbr_register(void) {
isc_qsbr_register(qp_qsbr_reclaimer);
}
work list
Your module will need somewhere that your callback can find the work
it needs to do. The qp-trie has an atomic list of dns_qpmulti_t
objects for this purpose.
/* a global variable */
static ISC_ASTACK(dns_qpmulti_t) qsbr_work;
The reason for using global variables is so that we don't need to allocate a thunk every time we have memory reclamation work to do.
read-only access
You should design your data structure so that it has a single atomic
root pointer referring to its current version. A lock-free reader
must run in an isc_loop callback. It gains access to the data
structure by taking a copy of this pointer:
qp_node_t *reader = atomic_load_acquire(&multi->reader);
During an isc_loop callback, a reader should keep using the same
pointer go get a consistent view of the data structure. If it reloads
the pointer it can get a different version changed by concurrent
writers.
A reader must stop using the root pointer and any interior pointers
obtained via the root pointer before it returns to the isc_loop.
modifications and writes
All changes to the data structure must be copy-on-write (aka read-copy-update) so that concurrent readers are not disturbed.
When a new version of the data structure has been prepared, it is committed by overwriting the atomic root pointer,
atomic_store_release(&multi->reader, reader); /* COMMIT */
scheduling cleanup
After committing a change, your data structure may have memory that will become free, after concurrent readers have stopped accessing it. To reclaim the memory when it is safe, use code like:
isc_qsbr_phase_t phase = isc_qsbr_phase(multi->loopmgr);
if (defer_chunk_reclamation(qp, phase)) {
ISC_ASTACK_ADD(qsbr_work, multi, cleanup);
isc_qsbr_activate(multi->loopmgr, phase);
}
-
First, get the current QSBR phase
-
Second, mark free memory with the phase number. The qp-trie scans its chunks and marks those that will become free, and returns
trueif there is cleanup work to do. -
If so, the qp-trie is added to the work list. (
ISC_ALIST_ADD()is idempotent). -
Finally, QSBR is informed that there is work to do.
In other cases it might not make sense to scan the data structure after committing, and instead you might make note of which memory to clean up while making changes before you know what the phase will be. You can then have per-phase work lists, like:
static ISC_ASTACK(my_work_t) qsbr_work[ISC_QSBR_PHASES];
isc_qsbr_phase_t phase = isc_qsbr_phase(loopmgr);
ISC_ASTACK_ADD(qsbr_work[phase], cleanup_work, link);
isc_qsbr_activate(loopmgr, phase);
In general, there will be several (maybe many) write operations during a grace period. Your lock-free data structure should collect its reclamation work from all these writes into a batch per phase, i.e. per grace period.
reclaiming
Inside the reclaimer callback, we iterate over the work list and clean up each item on it. If there is more cleanup work to do in another phase, we put the qp-trie back on the work list for another go.
static void
qsbreclaimer(void *arg, isc_qsbr_phase_t phase) {
UNUSED(arg);
ISC_STACK(dns_qpmulti_t) drain = ISC_ASTACK_TO_STACK(qsbr_work);
while (!ISC_STACK_EMPTY(drain)) {
dns_qpmulti_t *multi = ISC_STACK_POP(drain, cleanup);
INSIST(QPMULTI_VALID(multi));
LOCK(&multi->mutex);
if (reclaim_chunks(&multi->writer, phase)) {
/* more to do next time */
ISC_ALIST_PUSH(qsbr_work, multi, cleanup);
}
UNLOCK(&multi->mutex);
}
}
reclaim marks
In the qp-trie data structure, each chunk has some metadata which includes a bitfield for the reclaim phase:
isc_qsbr_phase_t phase : ISC_QSBR_PHASE_BITS;
We use a bitfield so that all the metadata fits in a single word.
Safe memory reclamation for BIND
At the end of October 2022, I finally got my multithreaded qp-trie working! It could be built with two different concurrency control mechanisms:
-
A reader/writer lock
This has poor read-side scalability, because every thread is hammering on the same shared location. But its write performance is reasonably good: concurrent readers don't slow it down too much.
-
liburcu, userland read-copy-updateRCU has a fast and scalable read side, nice! But on the write side I used
synchronize_rcu(), which is blocking and rather slow, so my write performance was terrible.
OK, but I want the best of both worlds! To fix it, I needed to change
the qp-trie code to use safe memory reclamation more effectively:
instead of blocking inside synchronize_rcu() before cleaning up, use
call_rcu() to clean up asynchronously. I expect I'll write about the
qp-trie changes another time.
Another issue is that I want the best of both worlds by default,
but liburcu is LGPL and we don't want BIND to depend on
code whose licence demands more from our users than the MPL.
So I set out to write my own safe memory reclamation support code.
lock freedom
In a multithreaded qp-trie, there can be many concurrent readers, but there can be only one writer at a time and modifications are strictly serialized. When I have got it working properly, readers are completely wait-free, unaffected by other readers, and almost unaffected by writers. Writers need to get a mutex to ensure there is only one at a time, but once the mutex is acquired, a writer is not obstructed by readers.
The way this works is that readers use an atomic load to get a pointer to the root of the current version of the trie. Readers can make multiple queries using this root pointer and the results will be consistent wrt that particular version, regardless of what changes writers might be making concurrently. Writers do not affect readers because all changes are made by copy-on-write. When a writer is ready to commit a new version of the trie, it uses an atomic store to flip the root pointer.
safe memory reclamation
We can't copy-on-write indefinitely: we need to reclaim the memory
used by old versions of the trie. And we must do so "safely", i.e.
without free()ing memory that readers are still using.
So, before free()ing memory, a writer must wait for a "grace
period", which is a jargon term meaning "until readers are not using
the old version". There are a bunch of algorithms for determining when
a grace period is over, with varying amounts of over-approximation,
CPU overhead, and memory backlog.
The RCU function synchronize_rcu() is slow because it blocks
waiting for a grace period; the call_rcu() function runs a callback
asynchronously after a grace period has passed. I wanted to avoid
blocking my writers, so I needed to implement something like
call_rcu().
aversions
When I started trying to work out how to do safe memory reclamation, it all seemed quite intimidating. But as I learned more, I found that my circumstances make it easier than it appeared at first.
The liburcu homepage has a long list of supported CPU
architectures and operating systems. Do I have to care about those
details too? No! The RCU code dates back to before the age of
standardized concurrent memory models, so the RCU developers had to
invent their own atomic primitives and correctness rules. Twenty-ish
years later the state of the art has advanced, so I can use
<stdatomic.h> without having to re-do it like liburcu.
You can also choose between several algorithms implemented by
liburcu, involving questions about kernel support, specially
reserved signals, and intrusiveness in application code. But while I
was working out how to schedule asynchronous memory reclamation work,
I realised that BIND is already well-suited to the fastest flavour of
RCU, called "QSBR".
QSBR
QSBR stands for "quiescent state based reclamation". A "quiescent state" is a fancy name for a point when a thread is not accessing a lock-free data structure, and does not retain any root pointers or interior pointers.
When a thread has passed through a quiescent state, it no longer has access to older versions of the data structures. When all threads have passed through quiescent states, then nothing in the program has access to old versions. This is how QSBR detects grace periods: after a writer commits a new version, it waits for all threads to pass through quiescent states, and therefore a grace period has definitely elapsed, and so it is then safe to reclaim the old version's memory.
QSBR is fast because readers do not need to explicitly mark the critical section surrounding the atomic load that I mentioned earlier. Threads just need to pass through a quiescent state frequently enough that there isn't a huge build-up of unreclaimed memory.
Inside an operating system kernel (RCU's native environment), a
context switch provides a natural quiescent state. In a userland
application, you need to find a good place to call
rcu_quiescent_state(). You could call it every time you have
finished using a root pointer, but marking a quiescent state is not
completely free, so there are probably more efficient ways.
libuv
BIND is multithreaded, and (basically) each thread runs an event loop.
Recent versions of BIND use libuv for the event loops.
A lot of things started falling into place when I realised that the
libuv event loop gives BIND a natural quiescent state:
when the event callbacks have finished running, and libuv is about
to call select() or poll() or whatever, we can mark a quiescent
state. We can require that event-handling functions do not stash root
pointers in the heap, but only use them via local variables, so we
know that old versions are inaccessible after the callback returns.
My design marks a quiescent state once per loop, so on a busy server where each loop has lots to do, the cost of marking a quiescent state is amortized across several I/O events.
fuzzy barrier
So, how do we mark a quiescent state? Using a "fuzzy barrier".
When a thread reaches a normal barrier, it blocks until all the other threads have reached the barrier, after which exactly one of the threads can enter a protected section of code, and the others are unblocked and can proceed as normal.
When a thread encounters a fuzzy barrier, it never blocks. It either proceeds immediately as normal, or if it is the last thread to reach the barrier, it enters the protected code.
RCU does not actually use a fuzzy barrier as I have described it. Like
a fuzzy barrier, each thread keeps track of whether it has passed
through a quiescent state in the current grace period, without
blocking; but unlike a fuzzy barrier, no thread is diverted to the
protected code. Instead, code that wants to enter a protected section
uses the blocking synchronize_rcu() function.
EBR-ish
As in the paper "performance of memory reclamation for lockless synchronization", my implementation of QSBR uses a fuzzy barrier designed for another safe memory reclamation algorithm, EBR, epoch based reclamation. (EBR was invented here in Cambridge by Keir Fraser.)
Actually, my fuzzy barrier is slightly different to EBR's. In EBR, the fuzzy barrier is used every time the program enters a critical section. (In qp-trie terms, that would be every time a reader fetches a root pointer.) So it is vital that EBR's barrier avoids mutating shared state, because that would wreck multithreaded performance.
Because BIND will only pass through the fuzzy barrier when it is about to use a blocking system call, my version mutates shared state more frequently (typically, once per CPU per grace period, instead of once per grace period). If this turns out to be a problem, it won't be too hard to make it work more like EBR.
More trivially, I'm using the term "phase" instead of "epoch", because it's nothing to do with the unix epoch, because there are three phases, and because I can talk about phase transitions and threads being out of phase with each other.
coda
While reading various RCU-related papers, I was amused by "user-level implementations of read-copy update", which says:
BIND, a major domain-name server used for Internet domain-name resolution, is facing scalability issues. Since domain names are read often but rarely updated, using user-level RCU might be beneficial.
Yes, I think it might :-)