This is the core implementation of the SIEVE algorithm described in the
following paper:
Zhang, Yazhuo, Juncheng Yang, Yao Yue, Ymir Vigfusson, and K V
Rashmi. “SIEVE Is Simpler than LRU: An Efficient Turn-Key Eviction
Algorithm for Web Caches,” n.d.. available online from
https://junchengyang.com/publication/nsdi24-SIEVE.pdf
This merge request resolves some performance regressions introduced
with the change from isc_symtab_t to isc_hashmap_t.
The key improvements are:
1. Using a faster hash function than both isc_hashmap_t and
isc_symtab_t. The previous implementation used SipHash, but the
hashflood resistance properties of SipHash are unneeded for config
parsing.
2. Shrinking the initial size of the isc_hashmap_t used inside
isc_symtab_t. Symtab is mainly used for config parsing, and the
when used that way it will have between 1 and ~50 keys, but the
previous implementation initialized a map with 128 slots.
By initializing a smaller map, we speed up mallocs and optimize for
the typical case of few config keys.
3. Slight optimization of the string matching in the hashmap, so that
the tail is handled in a single load + comparison, instead of byte
by byte.
Of the three improvements, this is the least important.
Since algorithm fetching is handled purely in libisc, FIPS mode toggling
can be purely done in within the library instead of provider fetching in
the binary for OpenSSL >=3.0.
Disabling FIPS mode isn't a realistic requirement and isn't done
anywhere in the codebase. Make the FIPS mode toggle enable-only to
reflect the situation.
As the default_call_rcu_thread can't be forced to flush all the work
during the executable shutdown, create one call_rcu_thread explicitly
and assign it to the all created threads.
This allows this explicit call_rcu_thread to be unassociated from the
main thread and freed before the executable destructor exits.
Instead of relying on unreliable order of execution of the library
constructors and destructors, move them to individual binaries. The
advantage is that the execution time and order will remain constant and
will not depend on the dynamic load dependency solver.
This requires more work, but that was mitigated by a simple requirement,
any executable using libisc and libdns, must include <isc/lib.h> and
<dns/lib.h> respectively (in this particular order). In turn, these two
headers must not be included from within any library as they contain
inlined functions marked with constructor/destructor attributes.
Since BIND 9 headers are not longer public, there's no reason to keep
the ISC_LANG_BEGINDECL and ISC_LANG_ENDDECL macros to support including
them from C++ projects.
Unify libcrypto initialization and explicit digest fetching in a single
place and move relevant code to the isc__crypto namespace instead of
isc__tls.
It will remove the remaining implicit fetching and deduplicate explicit
fetching inside the codebase.
Add an extra thread that can be used to offload operations that would
affect latency, but are not long-running tasks; those are handled by
isc_work API.
Each isc_loop now has matching isc_helper thread that also built on top
of uv_loop. In fact, it matches most of the isc_loop functionality, but
only the `isc_helper_run()` asynchronous call is exposed.
Add an isc_queue implementation that hides the gory details of cds_wfcq
into more neat API. The same caveats as with cds_wfcq.
TODO: Add documentation to the API.
As it was pointed out, the alignas() can't be used on objects larger
than `max_align_t` otherwise the compiler might miscompile the code to
use auto-vectorization on unaligned memory.
As we were only using alignas() as a way to prevent false memory
sharing, we can use manual padding in the affected structures.
This commit adds a new transport that supports PROXYv2 over UDP. It is
built on top of PROXYv2 handling code (just like PROXY Stream). It
works by processing and stripping the PROXYv2 headers at the beginning
of a datagram (when accepting a datagram) or by placing a PROXYv2
header to the beginning of an outgoing datagram.
The transport is built in such a way that incoming datagrams are being
handled with minimal memory allocations and copying.
This commit adds a new stream-based transport with an interface
compatible with TCP. The transport is built on top of TCP transport
and the new PROXYv2 handling code. Despite being built on top of TCP,
it can be easily extended to work on top of any TCP-like stream-based
transport. The intention of having this transport is to add PROXYv2
support into all existing stream-based DNS transport (DNS over TCP,
DNS over TLS, DNS over HTTP) by making the work on top of this new
transport.
The idea behind the transport is simple after accepting the connection
or connecting to a remote server it enters PROXYv2 handling mode: that
is, it either attempts to read (when accepting the connection) or send
(when establishing a connection) a PROXYv2 header. After that it works
like a mere wrapper on top of the underlying stream-based
transport (TCP).
This commit adds a set of utilities for dealing with PROXYv2 headers,
both parsing and generating them. The code has no dependencies from
the networking code and is (for the most part) a "separate library".
The part responsible for handling incoming PROXYv2 headers is
structured as a state machine which accepts data as input and calls a
callback to notify the upper-level code about the data processing
status.
Such a design, among other things, makes it easy to write a thorough
unit test suite for that, as there are fewer dependencies as well as
will not stand in the way of any changes in the networking code.
The AES algorithm for DNS cookies was being kept for legacy reasons, and
it can be safely removed in the next major release. Remove both the AES
usage for DNS cookies and the AES implementation itself.
When inserting items into hashtables (hashmaps), we might have a
fragmented key (as an example we might want to hash DNS name + class +
type). We either need to construct continuous key in the memory and
then hash it en bloc, or incremental hashing is required.
This incremental version of SipHash 2-4 algorithm is the first building
block.
As SipHash 2-4 is often used in the hot paths, I've turned the
implementation into header-only version in the process.
This adds support for User Statically Defined Tracing (USDT). On
Linux, this uses the header from SystemTap and dtrace utility, but the
support is universal as long as dtrace is available.
Also add the required infrastructure to add probes to libisc, libdns and
libns libraries, where most of the probes will be.
The `ISC_OVERFLOW_XXX()` macros are usually wrappers around
`__builtin_xxx_overflow()`, with alternative implementations
for compilers that lack the builtins.
Replace the overflow checks in `isc/time.c` with the new macros.
The Userspace-RCU headers are now needed for more parts of the libisc
and libdns, thus we need to add it globally to prevent compilation
failures on systems with non-standard Userspace-RCU installation path.
Instead of writing complicated wrappers for every thread, move the
initialization back to isc_random unit and check whether the random seed
was initialized with a thread_local variable.
Ensure that isc_entropy_get() returns a non-zero seed.
This avoids problems with thread sanitizer tests getting stuck in an
infinite loop.
This commit allows BIND 9 to be compiled with different flavours of
Userspace RCU, and improves the integration between Userspace RCU and
our event loop:
- In the RCU QSBR, the thread is put offline when polling and online
when rcu_dereference, rcu_assign_pointer (or friends) are called.
- In other RCU modes, we check that we are not reading when reaching the
quiescent callback in the event loop.
- We register the thread before uv_work_run() callback is called and
after it has finished. The rcu_(un)register_thread() has a large
overhead, but that's fine in this case.
The spinlock is small (atomic_uint_fast32_t at most), lightweight
synchronization primitive and should only be used for short-lived and
most of the time a isc_mutex should be used.
Add a isc_spinlock unit which is either (most of the time) a think
wrapper around pthread_spin API or an efficient shim implementation of
the simple spinlock.
This is an adaptation of my `hg64` experiments for use in BIND.
As well as renaming everything according to ISC style, I have
written some more extensive tests that ensure the edge cases are
correct and the fenceposts are in the right places.
I have added utility functions for working with precision in terms of
decimal significant figures as well as this code's native binary.
The `isc_trampoline` module had a lot of machinery to support stable
thread IDs for use by hazard pointers. But the hazard pointer code
is gone, and the `isc_loop` module now has its own per-loop thread
IDs.
The trampoline machinery seems over-complicated for its remaining
tasks, so move the per-thread initialization into `isc/thread.c`,
and delete the rest.
The isc_fsaccess API was created to hide the implementation details
between POSIX and Windows APIs. As we are not supporting the Windows
APIs anymore, it's better to drop this API used in the DST part.
Moreover, the isc_fsaccess was setting the permissions in an insecure
manner - it operated on the filename, and not on the file descriptor
which can lead to all kind of attacks if unpriviledged user has read (or
even worse write) access to key directory.
Replace the code that operates on the private keys with code that uses
mkstemp(), fchmod() and atomic rename() at the end, so at no time the
private key files have insecure permissions.
Change the isc_job_run() to not-make any allocations. The caller must
make sure that it allocates isc_job_t - usually as part of the argument
passed to the callback.
For simple jobs, using isc_async_run() is advised as it allocates its
own separate isc_job_t.
BIND needs a collection of standard lock-free data structures,
which we can find in liburcu, along with its RCU safe memory
reclamation machinery. We will use liburcu's QSBR variant instead
of the home-grown isc_qsbr.
This "quiescent state based reclamation" module provides support for
the qp-trie module in dns/qp. It is a replacement for liburcu, written
without reference to the urcu source code, and in fact it works in a
significantly different way.
A few specifics of BIND make this variant of QSBR somewhat simpler:
* We can require that wait-free access to a qp-trie only happens in
an isc_loop callback. The loop provides a natural quiescent state,
after the callbacks are done, when no qp-trie access occurs.
* We can dispense with any API like rcu_synchronize(). In practice,
it takes far too long to wait for a grace period to elapse for each
write to a data structure.
* We use the idea of "phases" (aka epochs or eras) from EBR to
reduce the amount of bookkeeping needed to track memory that is no
longer needed, knowing that the qp-trie does most of that work
already.
I considered hazard pointers for safe memory reclamation. They have
more read-side overhead (updating the hazard pointers) and it wasn't
clear to me how to nicely schedule the cleanup work. Another
alternative, epoch-based reclamation, is designed for fine-grained
lock-free updates, so it needs some rethinking to work well with the
heavily read-biased design of the qp-trie. QSBR has the fastest read
side of the basic SMR algorithms (with no barriers), and fits well
into a libuv loop. More recent hybrid SMR algorithms do not appear to
have enough benefits to justify the extra complexity.
the isc_glob module was originally needed to support posix-style glob
processing on Windows, but is now just an unnecessary wrapper around
glob(3). this commit removes it.
Add a singly-linked stack that supports lock-free prepend and drain (to
empty the list and clean up its elements). Intended for use with QSBR
to collect objects that need safe memory reclamation, or any other user
that works with adding objects to the stack and then draining them in
one go like various work queues.
In <isc/atomic.h>, add an `atomic_ptr()` macro to make type
declarations a little less abominable, and clean up a duplicate
definition of `atomic_compare_exchange_strong_acq_rel()`
as there is no further use of isc_task in BIND, this commit removes
it, along with isc_taskmgr, isc_event, and all other related types.
functions that accepted taskmgr as a parameter have been cleaned up.
as a result of this change, some functions can no longer fail, so
they've been changed to type void, and their callers have been
updated accordingly.
the tasks table has been removed from the statistics channel and
the stats version has been updated. dns_dyndbctx has been changed
to reference the loopmgr instead of taskmgr, and DNS_DYNDB_VERSION
has been udpated as well.
Unfortunately, C still lacks a standard function for pause (x86,
sparc) or yeild (arm) instructions, for use in spin lock or CAS loops.
BIND has its own based on vendor intrinsics or inline asm.
Previously, it was buried in the `isc_rwlock` implementation. This
commit renames `isc_rwlock_pause()` to `isc_pause()` and moves
it into <isc/pause.h>.
This commit also fixes the configure script so that it detects ARM
yield support on systems that identify as `aarch*` instead of `arm*`.
On 64-bit ARM systems we now use the ISB (instruction synchronization
barrier) instruction in preference to yield. The ISB instruction
pauses the CPU for longer, several nanoseconds, which is more like the
x86 pause instruction. There are more details in a Rust pull request,
which also refers to MySQL making the same change:
https://github.com/rust-lang/rust/pull/84725
isc_bind9 was a global bool used to indicate whether the library
was being used internally by BIND or by an external caller. external
use is no longer supported, but the variable was retained for use
by dyndb, which needed it only when being built without libtool.
building without libtool is *also* no longer supported, so the variable
can go away.