Remove the complicated mechanism that could be (in theory) used by
external libraries to register new categories and modules with
statically defined lists in <isc/log.h>. This is similar to what we
have done for <isc/result.h> result codes. All the libraries are now
internal to BIND 9, so we don't need to provide a mechanism to register
extra categories and modules.
The isc_log_write1() and isc_log_vwrite1() functions were meant to
de-duplicate the messages sent to the isc_log subsystem. However, they
were never used in an entire code base and the whole mechanism around it
was complicated and very inefficient. Just remove those, there are
better ways to deduplicate syslog messages inside syslog daemons now.
Add isc_logconfig_get() function to get the current logconfig and use
the getter to replace most of the little dancing around setting up
logging in the tools. Thus:
isc_log_create(mctx, &lctx, &logconfig);
isc_log_setcontext(lctx);
dns_log_setcontext(lctx);
...
...use lcfg...
...
isc_log_destroy();
is now only:
logconfig = isc_logconfig_get(lctx);
...use lcfg...
For thread-safety, isc_logconfig_get() should be surrounded by RCU read
lock, but since we never use isc_logconfig_get() in threaded context,
the only place where it is actually used (but not really needed) is
named_log_init().
Instead of juggling different logging context, use one single logging
context that gets initialized in the libisc constructor and destroyed in
the libisc destructor.
The application is still responsible for creating the logging
configuration before using the isc_log API.
This patch is first in the series in a way that it is transparent for
the users of the isc_log API as the isc_log_create() and
isc_log_destroy() are now thin shims that emulate the previous
functionality, but it isc_log_create() will always return internal
isc__lctx pointer and isc_log_destroy() will actually not destroy the
internal isc__lctx context.
Signed-off-by: Ondřej Surý <ondrej@isc.org>
The contexpr introduced in C23 standard makes perfect sense to be used
instead of preprocessor macros - the symbols are kept, etc. Define
ISC_CONSTEXPR to be `constexpr` for C23 and `static const` for the older
C standards. Use the newly introduced macro for the NS_PER_SEC and
friends time constants.
New version of clang (19) has introduced a stricter checks when mixing
integer (and float types) with enums. In this case, we used enum {}
as C17 doesn't have constexpr yet. Change the time conversion constants
to be static const unsigned int instead of enum values.
Since the enable_fips_mode() now resides inside the isc_tls unit, BIND 9
would fail to compile when FIPS mode was enabled as the DST subsystem
logging functions were missing.
Move the crypto library logging functions from the openssl_link unit to
isc_tls unit and enhance it, so it can now be used from both places
keeping the old dst__openssl_toresult* macros alive.
When putting the 48-bit number into a fixed-size buffer that's exactly 6
bytes, the assertion failure would occur as the 48-bit number is
internally represented as 64-bit number and the code was checking if
there is enough space for `sizeof(val)`. This causes assertion failure
when otherwise valid TSIG signature has a bad timing information.
Specify the size of the argument explicitly, so the 48-bit number
doesn't require 8-byte long buffer.
Add an isc_queue implementation that hides the gory details of cds_wfcq
into more neat API. The same caveats as with cds_wfcq.
TODO: Add documentation to the API.
The detach function declaration in `ISC__REFCOUNT_TRACE_DECL` had an
returned an accidental implicit int. While not allowed since C99, it
became an error by default in GCC 14.
`ISC_REFCOUNT_TRACE_IMPL` and `ISC_REFCOUNT_STATIC_TRACE_IMPL` expanded
into the wrong macros, trying to declare it again with the wrong number
of parameters.
Returning the value allows for better high-water tracking without
running into edge cases like the following:
0. The counter is at value X
1. Increment the value (X+1)
2. The value is decreased multiple times in another threads (X+1-Y)
3. Get the value (X+1-Y)
4. Update-if-greater misses the X+1 value which should have been the
high-water
isc_loop() can now take its place.
This also requires changes to the test harness - instead of running the
setup and teardown outside of th main loop, we now schedule the setup
and teardown to run on the loop (via isc_loop_setup() and
isc_loop_teardown()) - this is needed because the new the isc_loop()
call has to be run on the active event loop, but previously the
isc_loop_current() (and the variants like isc_loop_main()) would work
even outside of the loop because it needed just isc_tid() to work, but
not the full loop (which was mainly true for the main thread).
if we had a method to get the running loop, similar to how
isc_tid() gets the current thread ID, we can simplify loop
and loopmgr initialization.
remove most uses of isc_loop_current() in favor of isc_loop().
in some places where that was the only reason to pass loopmgr,
remove loopmgr from the function parameters.
After removing sockaddr_unix from isc_sockaddr, we can also remove
sockaddr_storage and reduce the isc_sockaddr size from 152 bytes to just
48 bytes needed to hold IPv6 addresses.
As it was pointed out, the alignas() can't be used on objects larger
than `max_align_t` otherwise the compiler might miscompile the code to
use auto-vectorization on unaligned memory.
As we were only using alignas() as a way to prevent false memory
sharing, we can use manual padding in the affected structures.
With _exit() instead of exit() in place, we don't need
isc__tls_setfatalmode() mechanism as the atexit() calls will not be
executed including OpenSSL atexit hooks.
cmocka.h and jemalloc.h/malloc_np.h has conflicting macro definitions.
While fixing them with push_macro for only malloc is done below, we only
need the non-standard mallocx interface which is easy to just define by
ourselves.
Because we don't use jemalloc functions directly, but only via the
libisc library, the dynamic linker might pull the jemalloc library
too late when memory has been already allocated via standard libc
allocator.
Add a workaround round isc_mem_create() that makes the dynamic linker
to pull jemalloc earlier than libc.
The statistics channel does not expose the current number of TCP clients
connected, only the highwater. Therefore, users did not have an easy
means to collect statistics about TCP clients served over time. This
information could only be measured as a seperate mechanism via rndc by
looking at the TCP quota filled.
In order to expose the exact current count of connected TCP clients
(tracked by the "tcp-clients" quota) as a statistics counter, an
extra, dedicated Network Manager callback would need to be
implemented for that purpose (a counterpart of ns__client_tcpconn()
that would be run when a TCP connection is torn down), which is
inefficient. Instead, track the number of currently-connected TCP
clients separately for IPv4 and IPv6, as Network Manager statistics.
This commits adds low-level wrappers on top of
'SSL_CTX_set_ciphersuites()'. These are going to be a foundation
behind the 'cipher-suites' option of the 'tls' statement.
This commit adds a new transport that supports PROXYv2 over UDP. It is
built on top of PROXYv2 handling code (just like PROXY Stream). It
works by processing and stripping the PROXYv2 headers at the beginning
of a datagram (when accepting a datagram) or by placing a PROXYv2
header to the beginning of an outgoing datagram.
The transport is built in such a way that incoming datagrams are being
handled with minimal memory allocations and copying.
This commit makes it possible to use PROXY Stream not only over TCP,
but also over TLS. That is, now PROXY Stream can work in two modes as
far as TLS is involved:
1. PROXY over (plain) TCP - PROXYv2 headers are sent unencrypted before
TLS handshake messages. That is the main mode as described in the
PROXY protocol specification (as it is clearly stated there), and most
of the software expects PROXYv2 support to be implemented that
way (e.g. HAProxy);
2. PROXY over (encrypted) TLS - PROXYv2 headers are sent after the TLS
handshake has happened. For example, this mode is being used (only ?)
by "dnsdist". As far as I can see, that is, in fact, a deviation from
the spec, but I can certainly see how PROXYv2 could end up being
implemented this way elsewhere.
This commit modifies TLS Stream to make it possible to use over PROXY
Stream. That is required to add PROVYv2 support into TLS-based
transports (DNS over HTTP, DNS over TLS).
This commit adds a new stream-based transport with an interface
compatible with TCP. The transport is built on top of TCP transport
and the new PROXYv2 handling code. Despite being built on top of TCP,
it can be easily extended to work on top of any TCP-like stream-based
transport. The intention of having this transport is to add PROXYv2
support into all existing stream-based DNS transport (DNS over TCP,
DNS over TLS, DNS over HTTP) by making the work on top of this new
transport.
The idea behind the transport is simple after accepting the connection
or connecting to a remote server it enters PROXYv2 handling mode: that
is, it either attempts to read (when accepting the connection) or send
(when establishing a connection) a PROXYv2 header. After that it works
like a mere wrapper on top of the underlying stream-based
transport (TCP).
This commit adds a set of utilities for dealing with PROXYv2 headers,
both parsing and generating them. The code has no dependencies from
the networking code and is (for the most part) a "separate library".
The part responsible for handling incoming PROXYv2 headers is
structured as a state machine which accepts data as input and calls a
callback to notify the upper-level code about the data processing
status.
Such a design, among other things, makes it easy to write a thorough
unit test suite for that, as there are fewer dependencies as well as
will not stand in the way of any changes in the networking code.
Previously, there were two methods of working with the overmem
condition:
1. hi/lo water callback - when the overmem condition was reached
for the first time, the water callback was called with HIWATER
mark and .is_overmem boolean was set internally. Similarly,
when the used memory went below the lo water mark, the water
callback would be called with LOWATER mark and .is_overmem
was reset. This check would be called **every** time memory
was allocated or freed.
2. isc_mem_isovermem() - a simple getter for the internal
.is_overmem flag
This commit refactors removes the first method and move the hi/lo water
checks to the isc_mem_isovermem() function, thus we now have only a
single method of checking overmem condition and the check for hi/lo
water is removed from the hot path for memory contexts that doesn't use
overmem checks.
The AES algorithm for DNS cookies was being kept for legacy reasons, and
it can be safely removed in the next major release. Remove both the AES
usage for DNS cookies and the AES implementation itself.
Add complementary macros to ISC_LIST_FOREACH(_SAFE) that walk the lists
in reverse.
* ISC_LIST_FOREACH_REV(list, elt, link) - walk the static list from
tail to head
* ISC_LIST_FOREACH_REV_SAFE(list, elt, link, next) - walk the list
from tail to head in a manner that's safe against list member
deletions
In units that support detailed reference tracing via ISC_REFCOUNT
macros, we were doing:
/* Define to 1 for detailed reference tracing */
#undef <unit>_TRACE
This would prevent using -D<unit>_TRACE=1 in the CFLAGS.
Convert the above mentioned snippet with just a comment how to enable
the detailed reference tracing:
/* Add -D<unit>_TRACE=1 to CFLAGS for detailed reference tracing */
Instead of copying address back and forth when hashing addr+port, we can
use incremental hashing. Additionally, switch from 64-bit
isc_hash_function to 32-bit isc_hash32() as the resulting value is
32-bit.
The Unix Domain Sockets support in BIND 9 has been completely disabled
since BIND 9.18 and it has been a fatal error since then. Cleanup the
code and the documentation that suggest that Unix Domain Sockets are
supported.
Instead of high number of dispatches (4 * named_g_udpdisp)[1], make the
dispatches bound to threads and make dns_dispatchset_t create a dispatch
for each thread (event loop).
This required couple of other changes:
1. The dns_dispatch_createudp() must be called on loop, so the isc_tid()
is already initialized - changes to nsupdate and mdig were required.
2. The dns_requestmgr had only a single dispatch per v4 and v6. Instead
of using single dispatch, use dns_dispatchset_t for each protocol -
this is same as dns_resolver.
Looking up unique message ID in the dns_dispatch has been using custom
hash tables. Rewrite the custom hashtable to use cds_lfht API, removing
one extra lock in the cold-cache resolver hot path.
Refactor isc_hashmap to allow custom matching functions. This allows us
to have better tailored keys that don't require fixed uint8_t arrays,
but can be composed of more fields from the stored data structure.
Add support for incremental hashing to the isc_hash unit, both 32-bit
and 64-bit incremental hashing is now supported.
This is commit second in series adding incremental hashing to libisc.
When inserting items into hashtables (hashmaps), we might have a
fragmented key (as an example we might want to hash DNS name + class +
type). We either need to construct continuous key in the memory and
then hash it en bloc, or incremental hashing is required.
This incremental version of SipHash 2-4 algorithm is the first building
block.
As SipHash 2-4 is often used in the hot paths, I've turned the
implementation into header-only version in the process.
This commit extends the internal memory management middleware code in
BIND so that memory contexts backed by dedicated jemalloc arenas can
be created. A new function (isc_mem_create_arena()) is added for that.
Moreover, it extends the existing code so that specialised memory
contexts can be created easily, should we need that functionality for
other future purposes. We have achieved that by passing the flags to
the underlying jemalloc-related calls. See the above
isc_mem_create_arena(), which can serve as an example of this.
Having this opens up possibilities for creating memory contexts tuned
for specific needs.
Use the new isc_mem_c*() calloc-like API for allocations that are
zeroed.
In turn, this also fixes couple of incorrect usage of the ISC_MEM_ZERO
for structures that need to be zeroed explicitly.
There are few places where isc_mem_cput() is used on structures with a
flexible member (or similar).
Add new isc_mem_cget(), isc_mem_creget(), and isc_mem_cput() macros to
complement the isc_mem_callocate() (which works like calloc()).
The overflow checks are implemented as macros in the <isc/mem.h>, so
that the compiler can see that the element size is constant: it should
always be `sizeof(something)`.
Before calling isc_buffer_putmem(), there is a condition to check
that 'buf_size' is greater than 0. At this point 'buf_size' is
guaranteed to be greater than zero, so either the condition is
redundant, or 'unprocessed_size' should be checked instead, which
seems more logical, because calling isc_buffer_putmem() with
'unprocessed_size' being zero is not useful, although harmless.
The isc_dnsstream_assembler_incoming() inline function expects that
when 'buf_size' is zero, then 'buf' must be NULL. The expectation is
not correct, because those values come from the libuv read callback,
and its documentation notes[1] that 'nread' ('buf_size' here) might
be 0, which does not indicate an error or EOF, but is equivalent to
EAGAIN or EWOULDBLOCK under read(2).
Change the isc_dnsstream_assembler_incoming() inline function to
remove the invalid expectation.
[1] https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb
There used to be an extra layer of indirection in the memory functions
for certain dynamic linking scenarios. This involved variant spellings
like isc__mem and isc___mem. The isc___mem variants were removed in
commit 7de846977b so the token pasting is no longer needed and
only serves to obfuscate.
The SET_IF_NOT_NULL() macro avoids a fair amount of tedious boilerplate,
checking pointer parameters to see if they're non-NULL and updating
them if they are. The macro was already in the dns_zone unit, and this
commit moves it to the <isc/util.h> header.
I have included a Coccinelle semantic patch to use SET_IF_NOT_NULL()
where appropriate. The patch needs an #include in `openssl_shim.c`
in order to work.
With ThreadSanitizer support added to the Userspace RCU, we no longer
need to wrap the call_rcu and caa_container_of with
__tsan_{acquire,release} hints. Remove the direct calls to
__tsan_{acquire,release} and the isc_urcu_{container,cleanup} macros.
The cds_lfht_for_each_entry and cds_lfht_for_each_entry_duplicate macros
had a code that operated on the NULL pointer, at the end of the list it
was calling caa_container_of() on the NULL pointer in the init-clause
and iteration-expression, but the result wasn't actually used anywhere
because the cond-expression in the for loop has prevented executing
loop-statement. This made AddressSanitizer notice the invalid operation
and rightfully complain.
This was reported to the upstream and fixed there. Pull the upstream
fix into our <isc/urcu.h> header, so our CI checks pass.
The isc_stats_create() can no longer return anything else than
ISC_R_SUCCESS. Refactor isc_stats_create() and its variants in libdns,
libns and named to just return void.
to reduce the amount of common code that will need to be shared
between the separated cache and zone database implementations,
clean up unused portions of dns_db.
the methods dns_db_dump(), dns_db_isdnssec(), dns_db_printnode(),
dns_db_resigned(), dns_db_expirenode() and dns_db_overmem() were
either never called or were only implemented as nonoperational stub
functions: they have now been removed.
dns_db_nodefullname() was only used in one place, which turned out
to be unnecessary, so it has also been removed.
dns_db_ispersistent() and dns_db_transfernode() are used, but only
the default implementation in db.c was ever actually called. since
they were never overridden by database methods, there's no need to
retain methods for them.
in rbtdb.c, beginload() and endload() methods are no longer defined for
the cache database, because that was never used (except in a few unit
tests which can easily be modified to use the zone implementation
instead). issecure() is also no longer defined for the cache database,
as the cache is always insecure and the default implementation of
dns_db_issecure() returns false.
for similar reasons, hashsize() is no longer defined for zone databases.
implementation functions that are shared between zone and cache are now
prepended with 'dns__rbtdb_' so they can become nonstatic.
serve_stale_ttl is now a common member of dns_db.
As well as clearing the fresh memory, `calloc()`-like functions must
ensure that the count and size do not overflow when multiplied.
Use `isc_mem_callocate()` in `isc__uv_calloc()`.
The `ISC_OVERFLOW_XXX()` macros are usually wrappers around
`__builtin_xxx_overflow()`, with alternative implementations
for compilers that lack the builtins.
Replace the overflow checks in `isc/time.c` with the new macros.
The isc_result_t enum was to sparse when each library code would skip to
next << 16 as a base. Remove the huge holes in the isc_result_t enum to
make the isc_result tables more compact.
This change required a rewrite how we map dns_rcode_t to isc_result_t
and back, so we don't ever return neither isc_result_t value nor
dns_rcode_t out of defined range.
In some cases, the inlined version rcu_dereference() would not compile
when working on pointer to opaque struct (namely Ubuntu Jammy). Detect
such condition in the autoconf and disable the inlining of the small
functions if it breaks the build.
Move registration and deregistration of the main thread from
`isc_loopmgr_run()` into `isc__initialize()` / `isc__shutdown()`:
liburcu-qsbr fails an assertion if we try to use it from an
unregistered thread, and we need to be able to use it when the
event loops are not running.
Use `rcu_assign_pointer()` and `rcu_dereference()` in qp-trie
transactions so that they properly mark threads as online. The
RCU-protected pointer is no longer declared atomic because
liburcu does not (yet) use standard C atomics.
Fix the definition of `isc_qsbr_rcu_dereference()` to return
the referenced value, and to call the right function inside
liburcu.
Change the thread sanitizer suppressions to match any variant of
`rcu_*_barrier()`
All the places the qp-trie code was using `call_rcu()` needed
`__tsan_release()` and `__tsan_acquire()` annotations, so
add a couple of wrappers to encapsulate this pattern.
With these wrappers, the tests run almost clean under thread
sanitizer. The remaining problems are due to `rcu_barrier()`
which can be suppressed using `.tsan-suppress`. It does not
suppress the whole of `liburcu`, because we would like thread
sanitizer to detect problems in `call_rcu()` callbacks, which
are called from `liburcu`.
The CI jobs have been updated to use `.tsan-suppress` by
default, except for a special-case job that needs the
additional suppressions in `.tsan-suppress-extra`.
We might be able to get rid of some of this after liburcu gains
support for thread sanitizer.
Note: the `rcu_barrier()` suppression is not entirely effective:
tsan sometimes reports races that originate inside `rcu_barrier()`
but tsan has discarded the stack so it does not have the
information required to suppress the report. These "races" can
be made much easier to reproduce by adding `atexit_sleep_ms=1000`
to `TSAN_OPTIONS`. The problem with tsan's short memory can be
addressed by increasing `history_size`: when it is large enough
(6 or 7) the `rcu_barrier()` stack usually survives long enough
for suppression to work.
It can be fairly long-winded to allocate space for a struct with a
flexible array member: in general we need the size of the struct, the
size of the member, and the number of elements. Wrap them all up in a
STRUCT_FLEX_SIZE() macro, and use the new macro for the flexible
arrays in isc_ht and dns_qp.
The isc_quota API was using locked list of isc_job_t objects to keep the
waiting TCP accepts. Change the isc_quota implementation to use
cds_wfcqueue internally - the enqueue is wait-free and only dequeue
needs to be locked.
The isc_async API was using lock-free stack (where enqueue operation was
not wait-free). Change the isc_async to use cds_wfcqueue internally -
enqueue and splice (move the queue members from one list to another) is
nonblocking and wait-free.
Instead of having a global hashtable with a global rwlock for the GLUE
cache, move the glue_list directly into rdatasetheader and use
Userspace-RCU to update the pointer when the glue_list is empty.
Additionally, the cached glue_lists needs to be stored in the RBTDB
version for early cleaning, otherwise the circular dependencies between
nodes and glue_lists will prevent nodes to be ever cleaned up.
Clang 16 LeakSanitizer reports a memory leak when dns_request_create()
returned a TLS error in the nsupdate system test. While technically a
memory leak on error handling, it's not a problem because the program is
immediately terminated; nsupdate is not expected to run for a prolonged
time.
All the per-loop `libuv` setup remains in `isc_loop`, but the per-thread
RCU setup is moved to `isc_thread` alongside the other per-thread setup.
This avoids repeating the per-thread setup for `call_rcu()` helpers,
and explains a little better why some parts of the per-thread setup
is missing for `call_rcu()` helpers.
This also removes the per-loop `call_rcu()` helpers as we refactored the
isc__random_initialize() in the previous commit.
Remove the `isc_threadarg_t` and `isc_threadresult_t`
typedefs which were unhelpful disguises for `void *`,
and free the dummy jemalloc allocation sooner.
When liburcu is not installed from a system package, its headers are
not treated as system headers by the compiler, so BIND's -Werror and
other warning options take effect. The liburcu headers have a lot
of inline functions, some of which do not use all their arguments,
which BIND's build treats as an error.
This commit allows BIND 9 to be compiled with different flavours of
Userspace RCU, and improves the integration between Userspace RCU and
our event loop:
- In the RCU QSBR, the thread is put offline when polling and online
when rcu_dereference, rcu_assign_pointer (or friends) are called.
- In other RCU modes, we check that we are not reading when reaching the
quiescent callback in the event loop.
- We register the thread before uv_work_run() callback is called and
after it has finished. The rcu_(un)register_thread() has a large
overhead, but that's fine in this case.
There's a recurring pattern walking the ISC_LISTs that just repeats over
and over. Add two macros:
* ISC_LIST_FOREACH(list, elt, link) - walk the static list
* ISC_LIST_FOREACH_SAFE(list, elt, link, next) - walk the list in
a manner that's safe against list member deletions
The spinlock is small (atomic_uint_fast32_t at most), lightweight
synchronization primitive and should only be used for short-lived and
most of the time a isc_mutex should be used.
Add a isc_spinlock unit which is either (most of the time) a think
wrapper around pthread_spin API or an efficient shim implementation of
the simple spinlock.