As well as clearing the fresh memory, `calloc()`-like functions must
ensure that the count and size do not overflow when multiplied.
Use `isc_mem_callocate()` in `isc__uv_calloc()`.
The `ISC_OVERFLOW_XXX()` macros are usually wrappers around
`__builtin_xxx_overflow()`, with alternative implementations
for compilers that lack the builtins.
Replace the overflow checks in `isc/time.c` with the new macros.
The free_all_cpu_call_rcu_data() call can consume hundreds of
milliseconds on shutdown. Don't try to be smart and let the RCU library
handle this internally.
In HTTP/1.0 and HTTP/1.1, RFC 9112 section 9.6 says the last response
in a connection should include a `Connection: close` header, but the
statschannel server omitted it.
In an HTTP/1.0 response, the statschannel server can sometimes send a
`Connection: keep-alive` header when it is about to close the
connection. There are two ways:
If the first request on a connection is keep-alive and the second
request is not, then _both_ responses have `Connection: keep-alive`
but the connection is (correctly) closed after the second response.
If a single request contains
Connection: close
Connection: keep-alive
then RFC 9112 section 9.3 says the keep-alive header is ignored, but
the statschannel sends a spurious keep-alive in its response, though
it correctly closes the connection.
To fix these bugs, make it more clear that the `httpd->flags` are part
of the per-request-response state. The Connection: flags are now
described in terms of the effect they have instead of what causes them
to be set.
The isc_result_t enum was to sparse when each library code would skip to
next << 16 as a base. Remove the huge holes in the isc_result_t enum to
make the isc_result tables more compact.
This change required a rewrite how we map dns_rcode_t to isc_result_t
and back, so we don't ever return neither isc_result_t value nor
dns_rcode_t out of defined range.
In some cases, the inlined version rcu_dereference() would not compile
when working on pointer to opaque struct (namely Ubuntu Jammy). Detect
such condition in the autoconf and disable the inlining of the small
functions if it breaks the build.
isc_mem_put NULL's the pointer to the memory being freed. The
equality test 'parent->r == node' was accidentally being turned
into a test against NULL.
RUNTIME_CHECK on the "wrap" variable avoids possible NULL dereference:
thread.c: In function 'thread_wrap':
thread.c:60:15: error: dereference of possibly-NULL 'wrap' [CWE-690] [-Werror=analyzer-possible-null-dereference]
60 | *wrap = (struct thread_wrap){
The RUNTIME_CHECK was there before
7d1ceaf35d.
Move registration and deregistration of the main thread from
`isc_loopmgr_run()` into `isc__initialize()` / `isc__shutdown()`:
liburcu-qsbr fails an assertion if we try to use it from an
unregistered thread, and we need to be able to use it when the
event loops are not running.
Use `rcu_assign_pointer()` and `rcu_dereference()` in qp-trie
transactions so that they properly mark threads as online. The
RCU-protected pointer is no longer declared atomic because
liburcu does not (yet) use standard C atomics.
Fix the definition of `isc_qsbr_rcu_dereference()` to return
the referenced value, and to call the right function inside
liburcu.
Change the thread sanitizer suppressions to match any variant of
`rcu_*_barrier()`
An omission pointed out by the following report from Coverity:
/lib/isc/loop.c: 483 in isc_loopmgr_pause()
>>> CID 455002: Error handling issues (CHECKED_RETURN)
>>> Calling "uv_async_send" without checking return value (as is done elsewhere 5 out of 6 times).
483 uv_async_send(&loop->pause_trigger);
when reading on a streamdns socket failed due to timeout, but
the dispatch was still waiting for other responses, it would
resume reading by calling isc_nm_read() again. this caused
an assertion because the socket was already reading.
we now check that either the socket is reading, or that it was
already reading on the same handle.
Create and free per-CPU helper threads from the main thread and tell
thread sanitizer to suppress leaking threads. (We are not leaking
threads ourselves and we can safely ignore the Userspace-RCU thread
leaks.)
All the places the qp-trie code was using `call_rcu()` needed
`__tsan_release()` and `__tsan_acquire()` annotations, so
add a couple of wrappers to encapsulate this pattern.
With these wrappers, the tests run almost clean under thread
sanitizer. The remaining problems are due to `rcu_barrier()`
which can be suppressed using `.tsan-suppress`. It does not
suppress the whole of `liburcu`, because we would like thread
sanitizer to detect problems in `call_rcu()` callbacks, which
are called from `liburcu`.
The CI jobs have been updated to use `.tsan-suppress` by
default, except for a special-case job that needs the
additional suppressions in `.tsan-suppress-extra`.
We might be able to get rid of some of this after liburcu gains
support for thread sanitizer.
Note: the `rcu_barrier()` suppression is not entirely effective:
tsan sometimes reports races that originate inside `rcu_barrier()`
but tsan has discarded the stack so it does not have the
information required to suppress the report. These "races" can
be made much easier to reproduce by adding `atexit_sleep_ms=1000`
to `TSAN_OPTIONS`. The problem with tsan's short memory can be
addressed by increasing `history_size`: when it is large enough
(6 or 7) the `rcu_barrier()` stack usually survives long enough
for suppression to work.
Memory reclamation by `call_rcu()` is asynchronous, so during shutdown
it can lose a race with the destruction of its memory context. When we
defer memory reclamation, we need to attach to the memory context to
indicate that it is still in use, but that is not enough to delay its
destruction. So, call `rcu_barrier()` in `isc_mem_destroy()` to wait
for pending RCU work to finish before proceeding to destroy the memory
context.
It can be fairly long-winded to allocate space for a struct with a
flexible array member: in general we need the size of the struct, the
size of the member, and the number of elements. Wrap them all up in a
STRUCT_FLEX_SIZE() macro, and use the new macro for the flexible
arrays in isc_ht and dns_qp.
The Userspace-RCU headers are now needed for more parts of the libisc
and libdns, thus we need to add it globally to prevent compilation
failures on systems with non-standard Userspace-RCU installation path.
The isc_quota API was using locked list of isc_job_t objects to keep the
waiting TCP accepts. Change the isc_quota implementation to use
cds_wfcqueue internally - the enqueue is wait-free and only dequeue
needs to be locked.
The isc_async API was using lock-free stack (where enqueue operation was
not wait-free). Change the isc_async to use cds_wfcqueue internally -
enqueue and splice (move the queue members from one list to another) is
nonblocking and wait-free.
Instead of having a global hashtable with a global rwlock for the GLUE
cache, move the glue_list directly into rdatasetheader and use
Userspace-RCU to update the pointer when the glue_list is empty.
Additionally, the cached glue_lists needs to be stored in the RBTDB
version for early cleaning, otherwise the circular dependencies between
nodes and glue_lists will prevent nodes to be ever cleaned up.
Clang 16 LeakSanitizer reports a memory leak when dns_request_create()
returned a TLS error in the nsupdate system test. While technically a
memory leak on error handling, it's not a problem because the program is
immediately terminated; nsupdate is not expected to run for a prolonged
time.
Stop deliberately breaking const rules by copying file->name into
dirbuf and truncating it there. Handle files located in the root
directory properly. Use unlinkat() from POSIX 200809.
Removing old timestamp or increment versions of log backup files did
not work when the file is an absolute path: only the entry name was
provided to the file remove function.
The dirname was also bogus, since the file separater was put back too
soon.
Fix these issues to make log file rotation work when the file is
configured to be an absolute path.
All the per-loop `libuv` setup remains in `isc_loop`, but the per-thread
RCU setup is moved to `isc_thread` alongside the other per-thread setup.
This avoids repeating the per-thread setup for `call_rcu()` helpers,
and explains a little better why some parts of the per-thread setup
is missing for `call_rcu()` helpers.
This also removes the per-loop `call_rcu()` helpers as we refactored the
isc__random_initialize() in the previous commit.
Instead of writing complicated wrappers for every thread, move the
initialization back to isc_random unit and check whether the random seed
was initialized with a thread_local variable.
Ensure that isc_entropy_get() returns a non-zero seed.
This avoids problems with thread sanitizer tests getting stuck in an
infinite loop.
Remove the `isc_threadarg_t` and `isc_threadresult_t`
typedefs which were unhelpful disguises for `void *`,
and free the dummy jemalloc allocation sooner.
When liburcu is not installed from a system package, its headers are
not treated as system headers by the compiler, so BIND's -Werror and
other warning options take effect. The liburcu headers have a lot
of inline functions, some of which do not use all their arguments,
which BIND's build treats as an error.
This commit allows BIND 9 to be compiled with different flavours of
Userspace RCU, and improves the integration between Userspace RCU and
our event loop:
- In the RCU QSBR, the thread is put offline when polling and online
when rcu_dereference, rcu_assign_pointer (or friends) are called.
- In other RCU modes, we check that we are not reading when reaching the
quiescent callback in the event loop.
- We register the thread before uv_work_run() callback is called and
after it has finished. The rcu_(un)register_thread() has a large
overhead, but that's fine in this case.
There's a recurring pattern walking the ISC_LISTs that just repeats over
and over. Add two macros:
* ISC_LIST_FOREACH(list, elt, link) - walk the static list
* ISC_LIST_FOREACH_SAFE(list, elt, link, next) - walk the list in
a manner that's safe against list member deletions
The spinlock is small (atomic_uint_fast32_t at most), lightweight
synchronization primitive and should only be used for short-lived and
most of the time a isc_mutex should be used.
Add a isc_spinlock unit which is either (most of the time) a think
wrapper around pthread_spin API or an efficient shim implementation of
the simple spinlock.
When shutting down TCP sockets, the read callback calling logic was
flawed, it would call either one less callback or one extra. Fix the
logic in the way:
1. When isc_nm_read() has been called but isc_nm_read_stop() hasn't on
the handle, the read callback will be called with ISC_R_CANCELED to
cancel active reading from the socket/handle.
2. When isc_nm_read() has been called and isc_nm_read_stop() has been
called on the on the handle, the read callback will be called with
ISC_R_SHUTTINGDOWN to signal that the dormant (not-reading) socket
is being shut down.
3. The .reading and .recv_read flags are little bit tricky. The
.reading flag indicates if the outer layer is reading the data (that
would be uv_tcp_t for TCP and isc_nmsocket_t (TCP) for TLSStream),
the .recv_read flag indicates whether somebody is interested in the
data read from the socket.
Usually, you would expect that the .reading should be false when
.recv_read is false, but it gets even more tricky with TLSStream as
the TLS protocol might need to read from the socket even when sending
data.
Fix the usage of the .recv_read and .reading flags in the TLSStream
to their true meaning - which mostly consist of using .recv_read
everywhere and then wrapping isc_nm_read() and isc_nm_read_stop()
with the .reading flag.
4. The TLS failed read helper has been modified to resemble the TCP code
as much as possible, clearing and re-setting the .recv_read flag in
the TCP timeout code has been fixed and .recv_read is now cleared
when isc_nm_read_stop() has been called on the streaming socket.
5. The use of Network Manager in the named_controlconf, isccc_ccmsg, and
isc_httpd units have been greatly simplified due to the improved design.
6. More unit tests for TCP and TLS testing the shutdown conditions have
been added.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Artem Boldariev <artem@isc.org>
By inspecting the code, it was discovered that .sendbuf member of the
isc__nm_networker_t was unused and just consuming ~64k per worker.
Remove the member and the association allocation/deallocation.
In e185412872, the TCP accept quota code
became broken in a subtle way - the quota would get initialized on the
first accept for the server socket and then deleted from the server
socket, so it would never get applied again.
Properly fixing this required a bigger refactoring of the isc_quota API
code to make it much simpler. The new code decouples the ownership of
the quota and acquiring/releasing the quota limit.
After (during) the refactoring it became more clear that we need to use
the callback from the child side of the accepted connection, and not the
server side.
Instead of using isc_async_run() when closing StreamDNS handle, add
isc_job_t member to the isc_nmhandle_t structure and use isc_job_run()
to avoid allocation/deallocation on the StreamDNS hot-path.
If the quota callback is called on a thread matching the socket, call
the TCP accept function directly instead of using isc_async_run() which
allocates-deallocates memory.
The isc_tid() function is often called on the hot-path and it's the only
function is to return thread_local variable, make the isc_tid() function
a header-only to save several function calls during query-response
processing.
Only attempt to digest 'in' if it is non NULL. This will prevent
false positives about NULL pointer dereferences against 'in' and
should also speed up the processing.
This commit optimises isc_dnsstream_assembler_t in such a way that
memory copying and reallocation are avoided when receiving one or more
complete DNS messages at once. We try to handle the data from the
messages directly, without storing them in an intermediate memory
buffer.
The `isc_histosummary_t` functions were written in the early days of
`hg64` and carried over when I brought `hg64` into BIND. They were
intended to be useful for graphing cumulative frequency distributions
and the like, but in practice whatever draws charts is better off with
a raw histogram export. Especially because of the poor performance of
the old functions.
The replacement `isc_histo_quantiles()` function is intended for
providing a few quantile values in BIND's stats channel, when the user
does not want the full histogram. Unlike the old functions, the caller
provides all the query fractions up-front, so that the values can be
found in a single scan instead of a scan per value. The scan is from
larger values to smaller, since larger quantiles are usually more
interesting, so the scan can bail out early.
Although an `isc_histo_t` is thread-safe, it can suffer
from cache contention under heavy load. To avoid this,
an `isc_histomulti_t` contains a histogram per thread,
so updates are local and low-contention.
This is an adaptation of my `hg64` experiments for use in BIND.
As well as renaming everything according to ISC style, I have
written some more extensive tests that ensure the edge cases are
correct and the fenceposts are in the right places.
I have added utility functions for working with precision in terms of
decimal significant figures as well as this code's native binary.
Cleanup the remnants of MS Compiler bits from <isc/refcount.h>, printing
the information in named/main.c, and cleanup some comments about Windows
that no longer apply.
The bits in picohttpparser.{h,c} were left out, because it's not our
code.
OPENSSL_cleanup is supposed to free all remaining memory in use
provided the application has cleaned up properly. This is not the
case on some operating systems. Silently ignore memory that is
freed after OPENSSL_cleanup has been called.
When fatal is called we may be holding memory allocated by OpenSSL.
This may result in the reference count for the FIPS provider not
going to zero and the shared library not being unloaded during
OPENSSL_cleanup. When the shared library is ultimately unloaded,
when all remaining dynamically loaded libraries are freed, we have
already destroyed the memory context we where using to track memory
leaks / late frees resulting in INSIST being called.
Disable triggering the INSIST when fatal has being called.
The `isc_trampoline` module had a lot of machinery to support stable
thread IDs for use by hazard pointers. But the hazard pointer code
is gone, and the `isc_loop` module now has its own per-loop thread
IDs.
The trampoline machinery seems over-complicated for its remaining
tasks, so move the per-thread initialization into `isc/thread.c`,
and delete the rest.
The isc_time_now() and isc_time_now_hires() were used inconsistently
through the code - either with status check, or without status check,
or via TIME_NOW() macro with RUNTIME_CHECK() on failure.
Refactor the isc_time_now() and isc_time_now_hires() to always fail when
getting current time has failed, and return the isc_time_t value as
return value instead of passing the pointer to result in the argument.
The isc_fsaccess API was created to hide the implementation details
between POSIX and Windows APIs. As we are not supporting the Windows
APIs anymore, it's better to drop this API used in the DST part.
Moreover, the isc_fsaccess was setting the permissions in an insecure
manner - it operated on the filename, and not on the file descriptor
which can lead to all kind of attacks if unpriviledged user has read (or
even worse write) access to key directory.
Replace the code that operates on the private keys with code that uses
mkstemp(), fchmod() and atomic rename() at the end, so at no time the
private key files have insecure permissions.
As it's impossible to get the current umask without modifying it at the
same time, initialize the current umask at the program start and keep
the loaded value internally. Add isc_os_umask() function to access the
starttime umask.
This is a simple replacement using the semantic patch from the previous
commit and as added bonus, one removal of previously undetected unused
variable in named/server.c.
When the loopmanager is shutting down following a signal,
`dig` and `host` should stop cleanly. Before this commit
they were oblivious to ISC_R_SHUTTINGDOWN.
The `isc_signal` callbacks now report this kind of mistake
with a stack backtrace.
Instead of marking the unused entities with UNUSED(x) macro in the
function body, use a `ISC_ATTR_UNUSED` attribute macro that expans to
C23 [[maybe_unused]] or __attribute__((__unused__)) as fallback.
Use C23 attribute styles if available:
* Add new ISC_ATTR_UNUSED attribute macro that either expands to C23's
[[maybe_unused]] or __attribute__((__unused__));
* Add default expansion of the `noreturn` to [[noreturn]] if available;
* Move the FALLTHROUGH from <isc/util.h> to <isc/attributes.h>
With the changes to tls_try_handshake() made in
2846888c57 there are some incorrect
INSISTS() related to handshake handling which better to be removed.
When accepting a TCP connection in the higher layers (tlsstream,
streamdns, and http) attach to the socket the connection was accepted
on, and use this socket instead of the parent listening socket.
This has an advantage - accessing the sock->listener now doesn't break
the thread boundaries, so we can properly check whether the socket is
being closed without requiring .closing member to be atomic_bool.
The last atomic_bool variable sock->active was converted to non-atomic
bool by properly handling the listening socket case where we were
checking parent socket instead of children sockets.
This is no longer necessary as we properly set the .active to false on
the children sockets.
Additionally, cleanup the .rchildren - the atomic variable was used for
mutex+condition to block until all children were listening, but that's
now being handled by a barrier.
Finally, just remove dead .self and .active_child_connections members of
the netmgr socket.
Now that everything runs on their own loop and we don't cross the thread
boundaries (with few exceptions), most of the atomic_bool variables used
to track the socket state have been unatomicized because they are always
accessed from the matching thread.
The remaining few have been relaxed: a) the sock->active is now using
acquire/release memory ordering; b) the various global limits are now
using relaxed memory ordering - we don't really care about the
synchronization for those.
Previously, isc_job_run() could have been used to run the job on the
current loop and the isc_job_run() would take care of allocating and
deallocating the job. After the change in this MR, the isc_job_run()
is more complicated to use, so we introduce the isc_async_current()
macro to suplement isc_async_run() when we need to run the job on the
current loop.
Change the isc_job_run() to not-make any allocations. The caller must
make sure that it allocates isc_job_t - usually as part of the argument
passed to the callback.
For simple jobs, using isc_async_run() is advised as it allocates its
own separate isc_job_t.
Change the isc__nm_uvreq_t to have the idle callback as a separate
member as we always need to use it to properly close the uvreq.
Slightly refactor uvreq_put and uvreq_get to remove the unneeded
arguments - in uvreq_get(), we always use sock->worker, and in
uvreq_put, we always use req->sock, so there's not reason to pass those
extra arguments.
Instead of using isc_job_run() that's quite heavy as it allocates memory
for every new job, add uv_idle_t to uvreq union, and use uv_idle API
directly to execute the connect/read/send callback without any
additional allocations.
Put the comment back, so it's more obvious that we are only restarting
timer when there's a last handle attached to the socket; there has to be
always at least one.
It's sometimes helpful to get a quick idea of the call stack when
debugging. This change factors out the backtrace logging from named's
fatal error handler so that it's easy to use in other places too.
The isc_nm_httpconnect() would succeed even if the netmgr would be
already shuttingdown. This has been fixed and the unit test has been
updated to cope with fact that the handle would be NULL when
isc_nm_httpconnect() returns with an error.
when isc_nm_listenstreamdns() is called with a local port of 0,
a random port is chosen. call uv_getsockname() to determine what
the port is as soon as the socket is bound, and add a function
isc_nmsocket_getaddr() to retrieve it, so that the caller can
connect to the listening socket. this will be used in cases
where the same process is acting as both client and server.
add a public function ns_interface_create() allowing the caller
to set up a listening interface directly without having to set
up listen-on and scan network interfaces.
Simplify the stopping of the generic socket children by using the
isc_async API from the loopmgr instead of using the asychronous
netievent mechanism in the netmgr.
Simplify the setting of the TLS contexts by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the canceling of the StreamDNS socket by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the reading from the StreamDNS socket by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the setting of the DoH endpoints by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the acception the new TCP connection by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the canceling of the UDP socket by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the stopping of the TCP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the starting of the TCP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the stopping of the UDP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
Simplify the starting of the UDP children by using the isc_async API
from the loopmgr instead of using the asychronous netievent mechanism in
the netmgr.
The active handles accounting was both using atomic counter and ISC_LIST
to keep track of active handles. Remove the atomic counter that was in
use before the ISC_LIST was added for better tracking of the handles
attached to the socket.
Instead of calling isc__nmhandle_detach calling
nmhandle_detach_cb() asynchronously when there's closehandle_cb
initialized, convert the closehandle_cb to use isc_job, and make the
isc__nmhandle_detach() to be fully synchronous.
The netmgr connect, read and send callbacks can now only be executed on
the same loop, convert it from asynchronous netievent queue event to
more direct isc_job.
BIND needs a collection of standard lock-free data structures,
which we can find in liburcu, along with its RCU safe memory
reclamation machinery. We will use liburcu's QSBR variant instead
of the home-grown isc_qsbr.
ISC_REFCOUNT_TRACE_IMPL uses isc_tid(), but the corresponding header
file is not included, which breaks, for example, compiling BIND with
DNS_CATZ_TRACE defined in lib/dns/include/dns/catz.h.
Add '#include <isc/tid.h>' in lib/isc/include/isc/refcount.h.
In base32_decode_char the GCC 12 static analyser fails to determine
that ctx->val[1], ctx->val[3], ctx->val[4] and ctx->val[6] are
assigned values by the previous call to base32_decode_char. Initialise
ctx->val to zeros when initalising the rest of ctx to silence the
false positive.
This "quiescent state based reclamation" module provides support for
the qp-trie module in dns/qp. It is a replacement for liburcu, written
without reference to the urcu source code, and in fact it works in a
significantly different way.
A few specifics of BIND make this variant of QSBR somewhat simpler:
* We can require that wait-free access to a qp-trie only happens in
an isc_loop callback. The loop provides a natural quiescent state,
after the callbacks are done, when no qp-trie access occurs.
* We can dispense with any API like rcu_synchronize(). In practice,
it takes far too long to wait for a grace period to elapse for each
write to a data structure.
* We use the idea of "phases" (aka epochs or eras) from EBR to
reduce the amount of bookkeeping needed to track memory that is no
longer needed, knowing that the qp-trie does most of that work
already.
I considered hazard pointers for safe memory reclamation. They have
more read-side overhead (updating the hazard pointers) and it wasn't
clear to me how to nicely schedule the cleanup work. Another
alternative, epoch-based reclamation, is designed for fine-grained
lock-free updates, so it needs some rethinking to work well with the
heavily read-biased design of the qp-trie. QSBR has the fastest read
side of the basic SMR algorithms (with no barriers), and fits well
into a libuv loop. More recent hybrid SMR algorithms do not appear to
have enough benefits to justify the extra complexity.
the isc_glob module was originally needed to support posix-style glob
processing on Windows, but is now just an unnecessary wrapper around
glob(3). this commit removes it.
Previously, the async job queue would use a locked-list (ISC_LIST).
With introduction of atomic stack (that has to be drained at once), we
could use it to remove some contention between the threads and simplify
the async queue.
Fortunately, the reverse order still works for us - instead of append
and tail/prev operation on the list, we are now using prepend and
head/next operation on the atomic stack.
Add a singly-linked stack that supports lock-free prepend and drain (to
empty the list and clean up its elements). Intended for use with QSBR
to collect objects that need safe memory reclamation, or any other user
that works with adding objects to the stack and then draining them in
one go like various work queues.
In <isc/atomic.h>, add an `atomic_ptr()` macro to make type
declarations a little less abominable, and clean up a duplicate
definition of `atomic_compare_exchange_strong_acq_rel()`
removed references in code comments, doc/dev documentation, etc, to
isc_task, isc_timer_reset(), and isc_timertype_inactive. also removed a
coccinelle patch related to isc_timer_reset() that was no longer needed.
as there is no further use of isc_task in BIND, this commit removes
it, along with isc_taskmgr, isc_event, and all other related types.
functions that accepted taskmgr as a parameter have been cleaned up.
as a result of this change, some functions can no longer fail, so
they've been changed to type void, and their callers have been
updated accordingly.
the tasks table has been removed from the statistics channel and
the stats version has been updated. dns_dyndbctx has been changed
to reference the loopmgr instead of taskmgr, and DNS_DYNDB_VERSION
has been udpated as well.
change functions using isc_taskmgr_beginexclusive() to use
isc_loopmgr_pause() instead.
also, removed an unnecessary use of exclusive mode in
named_server_tcptimeouts().
most functions that were implemented as task events because they needed
to be running in a task to use exclusive mode have now been changed
into loop callbacks instead. (the exception is catz, which is being
changed in a separate commit because it's a particularly complex change.)
This changes the internal isc_rwlock implementation to:
Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra
J. Marathe, and Nir Shavit. 2013. NUMA-aware reader-writer locks.
SIGPLAN Not. 48, 8 (August 2013), 157–166.
DOI:https://doi.org/10.1145/2517327.24425
(The full article available from:
http://mcg.cs.tau.ac.il/papers/ppopp2013-rwlocks.pdf)
The implementation is based on the The Writer-Preference Lock (C-RW-WP)
variant (see the 3.4 section of the paper for the rationale).
The implemented algorithm has been modified for simplicity and for usage
patterns in rbtdb.c.
The changes compared to the original algorithm:
* We haven't implemented the cohort locks because that would require a
knowledge of NUMA nodes, instead a simple atomic_bool is used as
synchronization point for writer lock.
* The per-thread reader counters are not being used - this would
require the internal thread id (isc_tid_v) to be always initialized,
even in the utilities; the change has a slight performance penalty,
so we might revisit this change in the future. However, this change
also saves a lot of memory, because cache-line aligned counters were
used, so on 32-core machine, the rwlock would be 4096+ bytes big.
* The readers use a writer_barrier that will raise after a while when
readers lock can't be acquired to prevent readers starvation.
* Separate ingress and egress readers counters queues to reduce both
inter and intra-thread contention.
Unfortunately, C still lacks a standard function for pause (x86,
sparc) or yeild (arm) instructions, for use in spin lock or CAS loops.
BIND has its own based on vendor intrinsics or inline asm.
Previously, it was buried in the `isc_rwlock` implementation. This
commit renames `isc_rwlock_pause()` to `isc_pause()` and moves
it into <isc/pause.h>.
This commit also fixes the configure script so that it detects ARM
yield support on systems that identify as `aarch*` instead of `arm*`.
On 64-bit ARM systems we now use the ISB (instruction synchronization
barrier) instruction in preference to yield. The ISB instruction
pauses the CPU for longer, several nanoseconds, which is more like the
x86 pause instruction. There are more details in a Rust pull request,
which also refers to MySQL making the same change:
https://github.com/rust-lang/rust/pull/84725
removed some functions that are no longer used and unlikely to
be resurrected, and also some that were only used to support Windows
and can now be replaced with generic versions.
isc_bind9 was a global bool used to indicate whether the library
was being used internally by BIND or by an external caller. external
use is no longer supported, but the variable was retained for use
by dyndb, which needed it only when being built without libtool.
building without libtool is *also* no longer supported, so the variable
can go away.
libuv support for receiving multiple UDP messages in a single system
call (recvmmsg()) has been tweaked several times between libuv versions
1.35.0 and 1.40.0. Mixing and matching libuv versions within that span
may lead to assertion failures and is therefore considered harmful, so
try to limit potential damage be preventing users from mixing libuv
versions with distinct sets of recvmmsg()-related flags.
The implementation of UDP recvmmsg in libuv 1.35 and 1.36 is
incomplete and could cause assertion failure under certain
circumstances.
Modify the configure and runtime checks to report a fatal error when
trying to compile or run with the affected versions.
ISC_MEM_ZERO requires great care to use when the space returned by
the allocator is larger than the requested space, and when memory is
reallocated. You must ensure that _every_ call to allocate or
reallocate a particular block of memory uses ISC_MEM_ZERO, to ensure
that the extra space is zeroed as expected. (When ISC_MEMFLAG_FILL
is set, the extra space will definitely be non-zero.)
When BIND is built without jemalloc, ISC_MEM_ZERO is implemented in
`jemalloc_shim.h`. This had a bug on systems that have malloc_size()
or malloc_usable_size(): memory was only zeroed up to the requested
size, not the allocated size. When an oversized allocation was
returned, and subsequently reallocated larger, memory between the
original requested size and the original allocated size could
contain unexpected nonzero junk. The realloc call does not know the
original requested size and only zeroes from the original allocated
size onwards.
After this change, `jemalloc_shim.h` always zeroes up to the
allocated size, not the requested size.
the rate limter now uses loop callbacks rather than task events.
the API for isc_ratelimiter_enqueue() has been changed; we now pass
in a loop, a callback function and a callback argument, and
receive back a rate limiter event object (isc_rlevent_t). it
is no longer necessary for the caller to allocate the event.
the callback argument needs to include a pointer to the rlevent
object so that it can be freed using isc_rlevent_free(), or by
dequeueing.
This restores the Malloced memory counter and it's now always equal to
InUse counter. This is only for backwards compatibility reason and
there is no separate counter.
The commit also cleanups little things like structure with a single
item (summary.inuse), and shuts up a wrong cppcheck warning (the
notorious NULL check after assignment).
Instead of enforcing stronger synchronization between threads, make all
the atomic operations relaxed. We are not really interested in exact
numbers at all times - the single place where we need the exact number
is when the memory context is being destroyed. Even when there's a
overmem counter, we don't care about exact ordering or exact number.
The Lost memory counter would count the memory "lost" by external
libraries. There's really no such thing as `named` require the memory
contexts to be clean on destroy.
The stats buckets were again more useful for internal allocator, because
we would see the individual "block" caches where the allocations would
fall into. Remove the stats buckets, and if needed, we can pull more
detailed statistics out of the jemalloc.
The total memory counter had again little or no meaning when we removed
the internal memory allocator. It was just a monotonic counter that
would count add the allocation sizes but never subtracted anything, so
it would be just a "big number".
The maxinuse memory counter indicated the highest amount of
memory allocated in the past. Checking and updating this high-
water mark value every time memory was allocated had an impact
on server performance, so it has been removed. Memory size can
be monitored more efficiently via an external tool logging RSS.
The malloced and maxmalloced memory counters were mostly useless since
we removed the internal allocator blocks - it would only differ from
inuse by the memory context size itself.
Use isc_job_run() instead of isc_task_send() for dnssec-signzone
worker threads.
Also fix the issue where the additional assignwork() would be run only
from the main thread effectively serializing all the signing.
The old code didn't handle race conditions and errors on systems
with non load balancing sockets gracefully. Look for an error on
any child socket and if found close all the child sockets and return
an error.
The old code didn't handle race conditions and errors on systems
with non load balancing sockets gracefully. Look for an error on
any child socket and if found close all the child sockets and return
an error.
This commit ensures that BIND and supplementary tools still can be
built on newer versions of DragonFly BSD. It used to be the case, but
somewhere between versions 6.2 and 6.4 the OS developers rearranged
headers and moved some function definitions around.
Before that the fact that it worked was more like a coincidence, this
time we, at least, looked at the related man pages included with the
OS.
No in depth testing has been done on this OS as we do not really
support this platform - so it is more like a goodwill act. We can,
however, use this platform for testing purposes, too. Also, we know
that the OS users do use BIND, as it is included in its ports
directory.
Building with './configure' and './configure --without-jemalloc' have
been fixed and are known to work at the time the commit is made.
Return 'isc_result_t' type value instead of 'bool' to indicate
the actual failure. Rename the function to something not suggesting
a boolean type result. Make changes in the places where the API
function is being used to check for the result code instead of
a boolean value.
Cherry-pick small fixup commit from 9.18/9.16 branches needed for
thread-safety. This fixup commit is not needed for 9.19+ because of
reworked application setup, but it decouples isc_iterated_hash and
isc_md units and keeps all the branches in sync.
As this code is on hot path (NSEC3) this introduces an additional
optimization of the EVP_MD API - instead of calling EVP_MD_CTX_new() on
every call to isc_iterated_hash(), we create two thread_local objects
for each thread - a basectx and mdctx, initialize basectx once and then
use EVP_MD_CTX_copy_ex() to flip the initialized state into mdctx. This
saves us couple more valuable microseconds from the isc_iterated_hash()
call.
If the OpenSSL SHA1_{Init,Update,Final} API is still available, use it.
The API has been deprecated in OpenSSL 3.0, but it is significantly
faster than EVP_MD API, so make an exception here and keep using it
until we can't.
Instead of going through another layer, use OpenSSL EVP_MD API directly
in the isc_iterated_hash() implementation. This shaves off couple of
microseconds in the microbenchmark.
The implicit algorithm fetch causes a lock contention and significant
slowdown for small input buffers. For more details, see:
https://github.com/openssl/openssl/issues/19612
Instead of using EVP_DigestInit_ex() initialize empty MD_CTX objects for
each algorithm and use EVP_MD_CTX_copy_ex() to initialize MD_CTX from a
static copy. Additionally avoid implicit algorithm fetching by using
EVP_MD_fetch() for OpenSSL 3.0.
Prefer the pthread_barrier implementation on platforms where it is
available over uv_barrier implementation. This also solves the problem
with thread sanitizer builds on macOS that doesn't have pthread barrier.
We already have a synchronization mechanism when starting the UDP and
TCP listener children - barriers. Change how we start the first-born
child (tid == 0), so we don't have to race for sock->parent->result and
sock->parent->fd.
Change the per-socket inactive uvreq cache (implemented as isc_astack)
to per-worker memory pool.
Change the per-socket inactive nmhandle cache (implemented as
isc_astack) to unlocked per-socket ISC_LIST.
Always track the per-worker sockets in the .active_sockets field in the
isc__networker_t struct and always track the per-socket handles in the
.active_handles field ian the isc_nmsocket_t struct.
DSCP has not been fully working since the network manager was
introduced in 9.16, and has been completely broken since 9.18.
This seems to have caused very few difficulties for anyone,
so we have now marked it as obsolete and removed the
implementation.
To ensure that old config files don't fail, the code to parse
dscp key-value pairs is still present, but a warning is logged
that the feature is obsolete and should not be used. Nothing is
done with configured values, and there is no longer any
range checking.
On some platforms, when a synchronizing barrier is cleared, one
thread can progress while other threads are still in the process
of releasing the barrier. If a barrier is reused by the progressing
thread during this window, it can cause a deadlock. This can occur if,
for example, we stop listening immediately after we start, because the
stop and listen functions both use socket->barrier. This has been
addressed by using separate barrier objects for stop and listen.
This commit replaces ad-hoc code for send requests buffer management
within TLS with the one based on isc_buffer_t.
Previous version of the code was trying to use pre-allocated small
buffers to avoid extra allocations. The code would allocate a larger
dynamic buffer when needed. There is no need to have ad-hoc code for
this, as isc_buffer_t now provides this functionality internally.
Additionally to the above, the old version of the code lacked any
logic to reuse the dynamically allocated buffers. Now, as we do not
manage memory buffers, but isc_buffer_t objects, we can implement this
strategy. It can be in particular helpful for longer lasting
connections, as in this case the buffer will adjust itself to the size
of the messages being transferred. That is, it is in particular useful
for XoT, as Stream DNS happen to order send requests in such a way
that the send request will get reused.
Additionally to renaming, it changes the function definition so that
it accepts a pointer to pointer instead of returning a pointer to the
new object.
It is mostly done to make it in line with other functions in the
module.
Additionally to renaming, it changes the function definition so that
it accepts a pointer to pointer instead of returning a pointer to the
new object.
It is mostly done to make it in line with other functions in the
module.
This commit modifies the Stream DNS message so that it uses the
optimised code path (isc__nm_senddns()) for sending DNS messages over
the underlying transport. This way we avoid allocating any
intermediate memory buffers needed to render a DNS message with its
length pre-pended ahead of the contents (TCP DNS message format).
The new internal function works in the same way as isc_nm_send()
except that it sends a DNS message size ahead of the DNS message
data (the format used in DNS over TCP).
The intention is to provide a fast path for sending DNS messages over
streams protocols - that is, without allocating any intermediate
memory buffers.
This commit optimises TLS send request object allocation to enable
send request object reuse, somewhat reducing pressure on the memory
manager. It is especially helpful in the case when Stream DNS uses the
TLS implementation as the transport.
This commit unties generic TLS code (isc_nm_tlssocket) from DoH, so
that it will be available regardless of the fact if BIND was built
with DNS over HTTP support or not.
This commit ensures that Stream DNS code attempts to disable Nagle's
algorithm regardless of underlying stream transport (TCP or TLS), as
we are not interested in trading latency for throughout when dealing
with DNS messages.
This commit ensures that Nagle's algorithm is disabled by default for
TLS connections on best effort basis, just like other networking
software (e.g. NGINX) does, as, in the case of TLS, we are not
interested in trading latency for throughput, rather vice versa.
We attempt to disable it as early as we can, right after TCP
connections establishment, as an attempt to speed up handshake
handling.
This commit adds ability to turn the Nagle's algorithm on or off via
connections handle. It adds the isc_nmhandle_set_tcp_nodelay()
function as the public interface for this functionality.
This commit adds an initial implementation of isc_nm_streamdnssocket
transport: a unified transport for DNS over stream protocols messages,
which is capable of replacing both TCP DNS and TLS DNS
transports. Currently, the interface it provides is a unified set of
interfaces provided by both of the transports it attempts to replace.
The transport is built around "isc_dnsbuffer_t" and
"isc_dnsstream_assembler_t" objects and attempts to minimise both the
number of memory allocations during network transfers as well as
memory usage.
This commit adds the implementation for an "isc_dnsstream_assembler_t"
object. The object is built on top of "isc_dnsbuffer_t" and is
intended to encapsulate the state machine used for handling DNS
messages received in the format used for messages transmitted over
TCP.
The idea is that the object accepts the input data received from a
socket, tries to assemble DNS messages from the incoming data and
calls the callback which contains the status of the incoming data as
well as a pointer to the memory region referencing the data of the
assembled message. It is capable of assembling DNS messages no matter
how torn apart they are when sent over network.
The following statuses might be passed to the callback:
* ISC_R_SUCCESS - a message has been successfully assembled;
* ISC_R_NOMORE - not enough data has been processed to assemble a
message;
* ISC_R_RANGE - there was an attempt to process a zero-sized DNS
message (someone attempts to send us junk data).
One could say that the object replaces the implementation of
"isc__nm_*_processbuffer()" functions used by the old TCP DNS and TLS
DNS transports with a better defined state machine completely
decoupled from the networking code itself.
Such a design makes it trivial to write unit tests for it, leading to
better verification of its correctness.
Another important difference is directly related to the fact that it
is built on top of "isc_dnsbuffer_t", which tries to manage memory in
a smart way. In particular:
* It tries to use a static buffer for smaller messages, reducing
pressure on the memory manager (hot path);
* When allocating dynamic memory for larger messages, it tries to
allocate memory conservatively (generic path).
These characteristics is a significant upgrade over the older logic
where a 64KB(+2 bytes) buffer was allocated from dynamic memory
regardless of the fact if we need a buffer this large or not. That is,
lesser memory usage is expected in a generic case for DNS transports
built on top of "isc_dnsstream_assembler_t."
This commit adds "isc_dnsbuffer_t" object implementation, a thin
wrapper on top of "isc_buffer_t" which has the following
characteristics:
* provides interface specifically atuned for handling/generating DNS
messages, especially in the format used for DNS messages over TCP;
* avoids allocating dynamic memory when handling small DNS messages,
while transparently switching to using dynamic memory when handling
larger messages. This approach significantly reduces pressure on the
memory allocator, as most of the DNS messages are small.
The added function provides the interface for getting an ALPN tag
negotiated during TLS connection establishment.
The new function can be used by higher level transports.
This commit adds manual read timer control mode, similarly to TCP.
This way the read timer can be controlled manually using:
* isc__nmsocket_timer_start();
* isc__nmsocket_timer_stop();
* isc__nmsocket_timer_restart().
The change is required to make it possible to implement more
sophisticated read timer control policies in DNS transports, built on
top of TLS.
This commit adds a manual read timer control mode to the TCP
code (adding isc__nmhandle_set_manual_timer() as the interface to it).
Manual read timer control mode suppresses read timer restarting the
read timer when receiving any amount of data. This way the read timer
can be controlled manually using:
* isc__nmsocket_timer_start();
* isc__nmsocket_timer_stop();
* isc__nmsocket_timer_restart().
The change is required to make it possible to implement more
sophisticated read timer control policies in DNS transports, built on
top of TCP.
This commit adds implementation of isc__nmsocket_timer_restart() and
isc__nmsocket_timer_stop() for generic TLS code in order to make its
interface more compatible with that of TCP.
This commit adds implementations of isc_nm_bad_request() and
isc__nmsocket_reset() to the generic TLS stream code in order to make
it more compatible with TCP code.
The purpose of this commit is to aid compiler in generating better
code when working with `isc_buffer_t` objects by using restricted
pointers (and, to a lesser extent, 'const' modifier for read-only
arguments).
This way we, basically, instruct the compiler that the members of
structured passed by pointers into the functions can be treated as
local variables in the scope of a function. That should reduce the
number of load/store operations emitted by compilers when accessing
objects (e.g. 'isc_buffer_t') via pointers.
Add two extra functions needed by StreamDNS:
1. isc_buffer_setmctx() sets the buffer internal memory context, so we
can use isc_buffer_reserve() on the buffer. For this, we also need
to track whether the .base was dynamically allocated or not. This
needs to be called after isc_buffer_init() and before first
isc_buffer_reserve() call.
2. isc_buffer_clearmctx() clears the buffer internal memory context, and
frees any dynamically allocated buffer. This needs to be called
after the last isc_buffer_reserve() call and before calling the
isc_buffer_invalidate()
When the buffer is allocated via isc_buffer_allocate() and the size is
smaller or equal ISC_BUFFER_STATIC_SIZE (currently 512 bytes), the
buffer will be allocated as a flexible array member in the buffer
structure itself instead of allocating it on the heap. This should help
when the buffer is used on the hot-path with small allocations.
When isc_buffer_t buffer is created with isc_buffer_allocate() assume
that we want it to always auto-reallocate instead of having an extra
call to enable auto-reallocation.
The isc_buffer_putdecint() could be easily replaced with
isc_buffer_printf() with just a small overhead of calling vsnprintf()
twice instead once. This is not on a hot-path (dns_catz unit), so we
can ignore the overhead and instead have less single-use code in favor
of using reusable more generic function.
The Stream DNS implementation needs a peek methods that read the value
from the buffer, but it doesn't advance the current position. Add
isc_buffer_peekuintX methods, refactor the isc_buffer_{get,put}uintN
methods to modern integer types, and move the isc_buffer_getuintN to the
header as static inline functions.
Move the U8TO{32,64}_LE and U{32,64}TO8_LE macros to endian.h and extend
the macros for 16-bit and Big-Endian variants.
Use the macros both in isc_siphash (LE) and isc_buffer (BE) units.
The isc_buffer_reserve() would be passed a reference to the buffer
pointer, which was unnecessary as the pointer would never be changed
in the current implementation. Remove the extra dereference.
Add internal logging functions isc__netmgr_log, isc__nmsocket_log(), and
isc__nmhandle_log() that can be used to add logging messages to the
netmgr, and change all direct use of isc_log_write() to use those
logging functions to properly prefix them with netmgr, nmsocket and
nmsocket+nmhandle.
In case, we are trying to hash the empty key into the hashmap, the key
is going to have zero length. This might happen in the unit test.
Allow this and add a unit test to ensure the empty zero-length key
doesn't hash to slot 0 as SipHash 2-4 (our hash function of choice) has
no problem with zero-length inputs.
This commit fixes TLS session resumption via session IDs when
client certificates are used. To do so it makes sure that session ID
contexts are set within server TLS contexts. See OpenSSL documentation
for 'SSL_CTX_set_session_id_context()', the "Warnings" section.
The only function left in the isc_resource API was setting the file
limit. Replace the whole unit with a simple getrlimit to check the
maximum value of RLIMIT_NOFILE and set the maximum back to rlimit_cur.
This is more compatible than trying to set RLIMIT_UNLIMITED on the
RLIMIT_NOFILE as it doesn't work on Linux (see man 5 proc on
/proc/sys/fs/nr_open), neither it does on Darwin kernel (see man 2
getrlimit).
The only place where the maximum value could be raised under privileged
user would be BSDs, but the `named_os_adjustnofile()` were not called
there before. We would apply the increased limits only on Linux and Sun
platforms.
This commit adds a check if 'sock->recv_cb' might have been nullified
during the call to 'sock->recv_cb'. That could happen, e.g. by an
indirect call to 'isc_nmhandle_close()' from within the callback when
wrapping up.
In this case, let's close the TLS connection.
This commit ensures that the non-atomic flags inside a DoH listener
socket object (and associated worker) are accessed when doing accept
for a connection only from within the context of the dedicated thread,
but not other worker threads.
The purpose of this commit is to avoid TSAN errors during
isc__nmsocket_closing() calls. It is a continuation of
4b5559cd8f.
This commit ensures that the non-atomic flags inside a TLS listener
socket object (and associated worker) are accessed when doing
handshake for a connection only from within the context of the
dedicated thread, but not other worker threads.
The purpose of this commit is to avoid TSAN errors during
isc__nmsocket_closing() calls. It is a continuation of
4b5559cd8f.
This commit ensures that the flags inside a TLS listener socket
object (and associated worker) are accessed when accepting a
connection only from within the context of the dedicated thread, but
not other worker threads.
The TLSDNS transport was not honouring the single read callback for
TLSDNS client. It would call the read callbacks repeatedly in case the
single TLS read would result in multiple DNS messages in the decoded
buffer.
This commit ensures that send callbacks are always called from within
the context of its worker thread even in the case of
shuttigdown/inactive socket, just like TCP transport does and with
which TLS attempts to be as compatible as possible.
This commit changes ISC_R_NOTCONNECTED error code to ISC_R_CANCELLED
when attempting to start reading data on the shutting down socket in
order to make its behaviour compatible with that of TCP and not break
the common code in the unit tests.
It turned out that after the latest Network Manager refactoring
'sock->reading' flag was not processed correctly. Due to this
isc_nm_read_stop() might not work as expected because reading from the
underlying TCP socket could have been resume in 'tls_do_bio()'
regardless of the 'sock->reading' value.
This bug did not seem to cause problems with DoH, so it was not
noticed, but Stream DNS has more strict expectations regarding the
underlying transport.
Additionally to the above, the 'sock->recv_read' flag was completely
ignored and corresponding logic was completely unimplemented. That did
not allow to implement one fine detail compared to TCP: once reading
is started, it could be satisfied by one datum reading.
This commit fixes the issues above.
The dns_adb unit has been refactored to be much simpler. Following
changes have been made:
1. Simplify the ADB to always allow GLUE and hints
There were only two places where dns_adb_createfind() was used - in
the dns_resolver unit where hints and GLUE addresses were ok, and in
the dns_zone where dns_adb_createfind() would be called without
DNS_ADBFIND_HINTOK and DNS_ADBFIND_GLUEOK set.
Simplify the logic by allowing hint and GLUE addresses when looking
up the nameserver addresses to notify. The difference is negligible
and would cause a difference in the notified addresses only when
there's mismatch between the parent and child addresses and we
haven't cached the child addresses yet.
2. Drop the namebuckets and entrybuckets
Formerly, the namebuckets and entrybuckets were used to reduced the
lock contention when accessing the double-linked lists stored in each
bucket. In the previous refactoring, the custom hashtable for the
buckets has been replaced with isc_ht/isc_hashmap, so only a single
item (mostly, see below) would end up in each bucket.
Removing the entrybuckets has been straightforward, the only matching
was done on the isc_sockaddr_t member of the dns_adbentry.
Removing the zonebuckets required GLUEOK and HINTOK bits to be
removed because the find could match entries with-or-without the bits
set, and creating a custom key that stores the
DNS_ADBFIND_STARTATZONE in the first byte of the key, so we can do a
straightforward lookup into the hashtable without traversing a list
that contains items with different flags.
3. Remove unassociated entries from ADB database
Previously, the adbentries could live in the ADB database even after
unlinking them from dns_adbnames. Such entries would show up as
"Unassociated entries" in the ADB dump. The benefit of keeping such
entries is little - the chance that we link such entry to a adbname
is small, and it's simpler to evict unlinked entries from the ADB
cache (and the hashtable) than create second LRU cleaning mechanism.
Unlinked ADB entries are now directly deleted from the hash
table (hashmap) upon destruction.
4. Cleanup expired entries from the hash table
When buckets were still in place, the code would keep the buckets
always allocated and never shrink the hash table (hashmap). With
proper reference counting in place, we can delete the adbnames from
the hash table and the LRU list.
5. Stop purging the names early when we hit the time limit
Because the LRU list is now time ordered, we can stop purging the
names when we find a first entry that doesn't fullfil our time-based
eviction criteria because no further entry on the LRU list will meet
the criteria.
Future work:
1. Lock contention
In this commit, the focus was on correctness of the data structure,
but in the future, the lock contention in the ADB database needs to
be addressed. Currently, we use simple mutex to lock the hash
tables, because we almost always need to use a write lock for
properly purging the hashtables. The ADB database needs to be
sharded (similar to the effect that buckets had in the past). Each
shard would contain own hashmap and own LRU list.
2. Time-based purging
The ADB names and entries stay intact when there are no lookups.
When we add separate shards, a timer needs to be added for time-based
cleaning in case there's no traffic hashing to the inactive shard.
3. Revisit the 30 minutes limit
The ADB cache is capped at 30 minutes. This needs to be revisited,
and at least the limit should be configurable (in both directions).
The new ISC_REFCOUNT_TRACE_{IMPL,DECL} macros can be used to add a
reference tracing capability to any unit using the reference counting.
It requires a little bit of extra work in each header as you can't have
a define from inside a define (see rpz.h), but it's fairly easy to add
tracing to any struct using reference counting with these macros.
This commit make TCP code use uv_try_write() on best effort basis,
just like TCP DNS and TLS DNS code does.
This optimisation was added in
'caa5b6548a11da6ca772d6f7e10db3a164a18f8d' but, similar change was
mistakenly omitted for generic TCP code. This commit fixes that.
Don't restart reading in the send callback after the httpdmgr has been
shut down, and call httpd_request(..., ISC_R_SHUTDOWN, ...) when
shutting down the httpdmgr to reduce code duplication.
Previously, the send callback would be synchronous only on success. Add
an option (similar to what other callbacks have) to decide whether we
need the asynchronous send callback on a higher level.
On a general level, we need the asynchronous callbacks to happen only
when we are invoking the callback from the public API. If the path to
the callback went through the libuv callback or netmgr callback, we are
already on asynchronous path, and there's no need to make the call to
the callback asynchronous again.
For the send callback, this means we need the asynchronous path for
failure paths inside the isc_nm_send() (which calls isc__nm_udp_send(),
isc__nm_tcp_send(), etc...) - all other invocations of the send callback
could be synchronous, because those are called from the respective libuv
send callbacks.
Previously, the read callback would be synchronous only on success or
timeout. Add an option (similar to what other callbacks have) to decide
whether we need the asynchronous read callback on a higher level.
On a general level, we need the asynchronous callbacks to happen only
when we are invoking the callback from the public API. If the path to
the callback went through the libuv callback or netmgr callback, we are
already on asynchronous path, and there's no need to make the call to
the callback asynchronous again.
For the read callback, this means we need the asynchronous path for
failure paths inside the isc_nm_read() (which calls isc__nm_udp_read(),
isc__nm_tcp_read(), etc...) - all other invocations of the read callback
could be synchronous, because those are called from the respective libuv
or netmgr read callbacks.
The various factors like NS_PER_MS are now defined in a single place
and the names are no longer inconsistent. I chose the _PER_SEC names
rather than _PER_S because it is slightly more clear in isolation;
but the smaller units are always NS, US, and MS.
Firefox 90+ apparently sends more than 10 headers, so we need to bump
the number to some higher number. Bump it to 100 just to be on a save
side, this is for internal use only anyway.
Add new isc_hashmap API that differs from the current isc_ht API in
several aspects:
1. It implements Robin Hood Hashing which is open-addressing hash table
algorithm (e.g. no linked-lists)
2. No memory allocations - the array to store the nodes is made of
isc_hashmap_node_t structures instead of just pointers, so there's
only allocation on resize.
3. The key is not copied into the hashmap node and must be also stored
externally, either as part of the stored value or in any other
location that's valid as long the value is stored in the hashmap.
This makes the isc_hashmap_t a little less universal because of the key
storage requirements, but the inserts and deletes are faster because
they don't require memory allocation on isc_hashmap_add() and memory
deallocation on isc_hashmap_delete().
While using mutrace, the phtread-rwlock based isc_rwlock implementation
would be all tracked in the rwlock.c unit losing all useful information
as all rwlocks would be traced in a single place. Rewrite the
pthread_rwlock based implementation to be header-only macros, so we can
use mutrace to properly track the rwlock contention without heavily
patching mutrace to understand the libisc synchronization primitives.
Instead of checking the PTHREAD_RUNTIME_CHECK from the header, move it
to the pthread_rwlock implementation functions. The internal isc_rwlock
actually cannot fail, so the checks in the header was useless anyway.
when more than one event was scheduled in the isc_aysnc queue,
they were executed in reverse order. we need to pull events
off the back of queue instead the front, so that uv_loop will
run them in the right order.
note that isc_job_run() has the same behavior, because it calls
uv_idle_start() directly. in that case we just document it so
it'll be less surprising in the future.
dns_rdata_checksvcb performs data entry checks on SVCB records.
In particular that _dns SVBC record have an 'alpn' and if that 'alpn'
parameter indicates HTTP is in use that 'dophath' is present.
The netievent handler for isc_nmsocket_set_tlsctx() was inadvertently
ifdef'd out when BIND was built with --disable-doh, resulting in an
assertion failure on startup when DoT was configured.
The statschannel truncated test still terminates abruptly sometimes and
it doesn't return the answer for the first query. This might happen
when the second process_request() discovers there's not enough space
before the sending is complete and the connection is terminated before
the client gets the data.
Change the isc_http, so it pauses the reading when it receives the data
and resumes it only after the sending has completed or there's
incomplete request waiting for more data.
This makes the request processing slightly less efficient, but also less
taxing for the server, because previously all requests that has been
received via single TCP read would be processed in the loop and the
sends would be queued after the read callback has processed a full
buffer.
In non-developer build, a wrong condition prevented the
isc__tls_malloc_ex, isc__tls_realloc_ex and isc__tls_free_ex to be
defined. This was causing FTBFS on platforms with OpenSSL 1.0.2.
It was possible that accept callback can be called after listener
shutdown. In such a case the callback pointer equals NULL, leading to
segmentation fault. This commit fixes that.
Since we are using designated initializers, we were missing initializers
for ISC_LIST and ISC_LINK, add them, so you can do
*foo = (foo_t){ .list = ISC_LIST_INITIALIZER };
Instead of:
*foo = (foo_t){ 0 };
ISC_LIST_INIT(foo->list);
This commit introduces a primitive isc__nmsocket_stop() which performs
shutting down on a multilayered socket ensuring the proper order of
the operations.
The shared data within the socket object can be destroyed after the
call completed, as it is guaranteed to not be used from within the
context of other worker threads.
I.e. print the name of the function in BIND that called the system
function that returned an error. Since it was useful for pthreads
code, it seems worthwhile doing so everywhere.
Mostly generated automatically with the following semantic patch,
except where coccinelle was confused by #ifdef in lib/isc/net.c
@@ expression list args; @@
- UNEXPECTED_ERROR(__FILE__, __LINE__, args)
+ UNEXPECTED_ERROR(args)
@@ expression list args; @@
- FATAL_ERROR(__FILE__, __LINE__, args)
+ FATAL_ERROR(args)
During loop manager refactoring isc_nmsocket_set_tlsctx() was not
properly adapted. The function is expected to broadcast the new TLS
context for every worker, but this behaviour was accidentally broken.
Replace all uses of RUNTIME_CHECK() in lib/isc/include/isc/once.h with
PTHEADS_RUNTIME_CHECK(), in order to improve error reporting for any
once-related run-time failures (by augmenting error messages with
file/line/caller information and the error string corresponding to
errno).
Rewrite the isc_httpd to be more robust.
1. Replace the hand-crafted HTTP request parser with picohttpparser for
parsing the whole HTTP/1.0 and HTTP/1.1 requests. Limit the number
of allowed headers to 10 (arbitrary number).
2. Replace the hand-crafted URL parser with isc_url_parse for parsing
the URL from the HTTP request.
3. Increase the receive buffer to match the isc_netmgr buffers, so we
can at least receive two full isc_nm_read()s. This makes the
truncation processing much simpler.
4. Process the received buffer from single isc_nm_read() in a single
loop and schedule the sends to be independent of each other.
The first two changes makes the code simpler and rely on already
existing libraries that we already had (isc_url based on nodejs) or are
used elsewhere (picohttpparser).
The second two changes remove the artificial "truncation" limit on
parsing multiple request. Now only a request that has too many
headers (currently 10) or is too big (so, the receive buffer fills up
without reaching end of the request) will end the connection.
We can be benevolent here with the limites, because the statschannel
channel is by definition private and access must be allowed only to
administrators of the server. There are no timers, no rate-limiting, no
upper limit on the number of requests that can be served, etc.
PicoHTTPParser is a tiny, primitive, fast HTTP request/response parser.
Unlike most parsers, it is stateless and does not allocate memory by
itself. All it does is accept pointer to buffer and the output
structure, and setups the pointers in the latter to point at the
necessary portions of the buffer.
The isc_nm_udpconnect() erroneously set the reuse port with
load-balancing on the outgoing connected UDP sockets. This socket
option makes only sense for the listening sockets. Don't set the
load-balancing reuse port option on the outgoing UDP sockets.
This commit fixes TLS DNS verification error message reporting which
we probably broke during one of the recent networking code
refactorings.
This prevent e.g. dig from producing useful error messages related to
TLS certificates verification.
Ensure that TLS error is empty before calling SSL_get_error() or doing
SSL I/O so that the result will not get affected by prior error
statuses.
In particular, the improper error handling led to intermittent unit
test failure and, thus, could be responsible for some of the system
test failures and other intermittent TLS-related issues.
See here for more details:
https://www.openssl.org/docs/man3.0/man3/SSL_get_error.html
In particular, it mentions the following:
> The current thread's error queue must be empty before the TLS/SSL
> I/O operation is attempted, or SSL_get_error() will not work
> reliably.
As we use the result of SSL_get_error() to decide on I/O operations,
we need to ensure that it works reliably by cleaning the error queue.
TLS DNS: empty error queue before attempting I/O
The check is left from when tcp_connect_direct() called isc__nm_socket()
and it was uncertain whether it had succeeded, but now isc__nm_socket()
is called before tcp_connect_direct(), so sock->fd cannot be -1.
*** CID 357292: (REVERSE_NEGATIVE)
/lib/isc/netmgr/tcp.c: 309 in isc_nm_tcpconnect()
303
304 atomic_store(&sock->active, true);
305
306 result = tcp_connect_direct(sock, req);
307 if (result != ISC_R_SUCCESS) {
308 atomic_store(&sock->active, false);
>>> CID 357292: (REVERSE_NEGATIVE)
>>> You might be using variable "sock->fd" before verifying that it is >= 0.
309 if (sock->fd != (uv_os_sock_t)(-1)) {
310 isc__nm_tcp_close(sock);
311 }
312 isc__nm_connectcb(sock, req, result, true);
313 }
314
The value of `sign_bit` is platform-dependent but constant at compile
time. Use a cast to convert the boolean `sign_bit` to 0 or 1 instead of
ternary `?:` because one branch of the conditional is dead code. (We
could leave out the cast to `size_t` but our style prefers to handle
booleans more explicitly, hence the `?:` that caused the issue.)
*** CID 358310: Possible Control flow issues (DEADCODE)
/lib/isc/resource.c: 118 in isc_resource_setlimit()
112 * rlim_t, and whether rlim_t has a sign bit.
113 */
114 isc_resourcevalue_t rlim_max = UINT64_MAX;
115 size_t wider = sizeof(rlim_max) - sizeof(rlim_t);
116 bool sign_bit = (double)(rlim_t)-1 < 0;
117
>>> CID 358310: Possible Control flow issues (DEADCODE)
>>> Execution cannot reach the expression "1" inside this statement: "rlim_max >>= 8UL * wider + ...".
118 rlim_max >>= CHAR_BIT * wider + (sign_bit ? 1 : 0);
119 rlim_value = ISC_MIN(value, rlim_max);
120 }
121
122 rl.rlim_cur = rl.rlim_max = rlim_value;
123 unixresult = setrlimit(unixresource, &rl);
In several places, the structures were cleaned with memset(...)) and
thus the semantic patch converted the isc_mem_get(...) to
isc_mem_getx(..., ISC_MEM_ZERO). Use the designated initializer to
initialized the structures instead of zeroing the memory with
ISC_MEM_ZERO flag as this better matches the intended purpose.
Add new semantic patch to replace the straightfoward uses of:
ptr = isc_mem_{get,allocate}(..., size);
memset(ptr, 0, size);
with the new API call:
ptr = isc_mem_{get,allocate}x(..., size, ISC_MEM_ZERO);
Previously, the isc_mem_get_aligned() and friends took alignment size as
one of the arguments. Replace the specific function with more generic
extended variant that now accepts ISC_MEM_ALIGN(alignment) for aligned
allocations and ISC_MEM_ZERO for allocations that zeroes
the (re-)allocated memory before returning the pointer to the caller.
Formerly, the isc_hash32() would have to change the key in a local copy
to make it case insensitive. Change the isc_siphash24() and
isc_halfsiphash24() functions to lowercase the input directly when
reading it from the memory and converting the uint8_t * array to
64-bit (respectively 32-bit numbers).
On systems with signed rlim_t the old code calculated its maximum
value by shifting 1 into the sign bit, which is undefined behaviour.
Avoid the bug by using an unsigned shift.
Because the dns_zonemgr_create() was run before the loopmgr was started,
the isc_ratelimiter API was more complicated that it had to be. Move
the dns_zonemgr_create() to run_server() task which is run on the main
loop, and simplify the isc_ratelimiter API implementation.
The isc_timer is now created in the isc_ratelimiter_create() and
starting the timer is now separate async task as is destroying the timer
in case it's not launched from the loop it was created on. The
ratelimiter tick now doesn't have to create and destroy timer logic and
just stops the timer when there's no more work to do.
This should also solve all the races that were causing the
isc_ratelimiter to be left dangling because the timer was stopped before
the last reference would be detached.
Unlike standard free(), isc_mem_free() is not a no-op when passed a
NULL pointer. For size accounting purposes it calls sallocx(), which
crashes when passed a NULL pointer. To get more helpful diagnostics,
REQUIRE() that the pointer is not NULL so that when the programmer
makes a mistake they get a backtrace that shows what went wrong.
The isc__nm_udp_send() callback would be called synchronously when
shutting down or when the socket has been closed. This could lead to
double locking in the calling code and thus those callbacks needs to be
called asynchronously.
There's a known memory leak in the engine_pkcs11 at the time of writing
this and it interferes with the named ability to check for memory leaks
in the OpenSSL memory context by default.
Add an autoconf option to explicitly enable the memory leak detection,
and use it in the CI except for pkcs11 enabled builds. When this gets
fixed in the engine_pkc11, the option can be enabled by default.
The libxml2 library provides a way to replace the default allocator with
user supplied allocator (malloc, realloc, strdup and free).
Create a memory context specifically for libxml2 to allow tracking the
memory usage that has originated from within libxml2. This will provide
a separate memory context for libxml2 to track the allocations and when
shutting down the application it will check that all libxml2 allocations
were returned to the allocator.
Additionally, move the xmlInitParser() and xmlCleanupParser() calls from
bin/named/main.c to library constructor/destructor in libisc library.
The OpenSSL library provides a way to replace the default allocator with
user supplied allocator (malloc, realloc, and free).
Create a memory context specifically for OpenSSL to allow tracking the
memory usage that has originated from within OpenSSL. This will provide
a separate memory context for OpenSSL to track the allocations and when
shutting down the application it will check that all OpenSSL allocations
were returned to the allocator.
The libuv library provides a way to replace the default allocator with
user supplied allocator (malloc, realloc, calloc and free).
Create a memory context specifically for libuv to allow tracking the
memory usage that has originated from within libuv. This requires
libuv >= 1.38.0 which provides uv_library_shutdown() function that
assures no more allocations will be made.
Instead of using generic HAVE_BUILTIN_OVERFLOW, we need to check whether
the overflow functions actually work as there was a bug in GCC that it
would not detect mul overflow when compiled with `-m32` option without
optimizations and the bug was fixed only for GCC 6.5+ and 7.3+/8+.
For further details see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82274
Previously, the isc_mem_debugging would be single global variable that
would affect the behavior of the memory context whenever it would be
changed which could be after some allocation were already done.
Change the memory debugging options to be local to the memory context
and immutable, so all allocations within the same memory context are
treated the same.
By bumping the minimum libuv version to 1.34.0, it allows us to remove
all libuv shims we ever had and makes the code much cleaner. The
up-to-date libuv is available in all distributions supported by BIND
9.19+ either natively or as a backport.
previously, when ISC_BUFFER_USEINLINE was defined, macros were
used to implement isc_buffer primitives (isc_buffer_init(),
isc_buffer_region(), etc). these macros were missing the DbC
assertions for those primitives, which made it possible for
coding errors to go undetected.
adding the assertions to the macros caused compiler warnings on
some platforms. therefore, this commit converts the ISC__BUFFER
macros to static inline functions instead, with assertions included,
and eliminates the non-inline implementation from buffer.c.
the --enable-buffer-useinline configure option has been removed.
The RAND_bytes() implementation differs between the OpenSSL versions and
uses the system entropy only for seeding its internal CSPRNG. The
uv_random() on the other hand uses the system provided CSPRNG.
Switch from RAND_bytes() to uv_random() to use system provided CSPRNG.
After the loopmgr work has been merged, we can now cleanup the TCP and
TLS protocols a little bit, because there are stronger guarantees that
the sockets will be kept on the respective loops/threads. We only need
asynchronous call for listening sockets (start, stop) and reading from
the TCP (because the isc_nm_read() might be called from read callback
again.
This commit does the following changes (they are intertwined together):
1. Cleanup most of the asynchronous events in the TCP code, and add
comments for the events that needs to be kept asynchronous.
2. Remove isc_nm_resumeread() from the netmgr API, and replace
isc_nm_resumeread() calls with existing isc_nm_read() calls.
3. Remove isc_nm_pauseread() from the netmgr API, and replace
isc_nm_pauseread() calls with a new isc_nm_read_stop() call.
4. Disable the isc_nm_cancelread() for the streaming protocols, only the
datagram-like protocols can use isc_nm_cancelread().
5. Add isc_nmhandle_close() that can be used to shutdown the socket
earlier than after the last detach. Formerly, the socket would be
closed only after all reading and sending would be finished and the
last reference would be detached. The new isc_nmhandle_close() can
be used to close the underlying socket earlier, so all the other
asynchronous calls would call their respective callbacks immediately.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Artem Boldariev <artem@isc.org>
Each isc_timer needs to be created, started and destroyed on the current
loop. The isc_timer_stop() can be run on any loop, but when run from
different loop than the one associated with the timer, the request to
stop the timer will be recorded in atomic variable and the underlying
uv_timer_t will be stopped on next uv_timer_t callback call. This
allows any thread to stop the timer.
In preparation for the on-loop timers, the isc_ratelimiter API was
converted to use the timer on main loop and start and stop the timer
asynchronously on the main loop.
As it sometimes happens that the object using isc_timer_t is destroyed
via detaching all the references with no guarantee that the last thread
will be matching thread, add a helper isc_timer_async_destroy() function
that stops the timer and runs the destroy function via isc_async_run()
on the matching thread.
- use isc_buffer functions when appropriate, rather than converting
to and from isc_region unnecessarily
- use the zlib total_out value instead of calculating it
- use c99 struct initialization
In fuzzing mode, `isc_random` uses a fixed seed for reproducibility.
The particular seed chosen happened to produce zero as its first
number, however commit bd251de0 introduced an initialization check in
`random_test` that required it to be non-zero. This change adjusts the
seed to avoid spurious test failures.
Also, remove the temporary variable that was used for initialization
because it did not match the type of the thread-local seed array.
Instead of checking if we need to re-seed for every isc_random call,
seed the random number generator in the libisc global initializer
and the per-thread initializer.
The destructor for the isc__nmsocket_t was missing call to the
isc_refcount_destroy() on the reference counter, which might lead to
spurious ThreadSanitizer data race warnings if we ever change the
acquire-release memory order in the isc_refcount_decrement().
Simplify the closing code - during the loopmgr implementation, it was
discovered that the various lists used by the uv_loop_t aren't FIFO, but
LIFO. See doc/dev/libuv.md for more details.
With this knowledge, we can close the protocol handles (uv_udp_t and
uv_tcp_t) and uv_timer_t at the same time by reordering the uv_close()
calls, and thus making sure that after calling the
isc__nm_stoplistening(), the code will not issue any additional callback
calls (accept, read) on the socket that stopped listening.
This might help with the TLS and DoH shutting down sequence as described
in the [GL #3509] as we now stop the reading, stop the timer and call
the uv_close() as earliest as possible.
The network manager UDP code was misinterpreting when the libuv called
the udp_recv_cb with nrecv == 0 and addr == NULL -> this doesn't really
mean that the "stream" has ended, but the libuv indicates that the
receive buffer can be freed. This could lead to assertion failure in
the code that calls isc_nm_read() from the network manager read callback
due to the extra spurious callbacks.
Properly handle the extra callback calls from the libuv in the client
read callback, and refactor the UDP isc_nm_read() implementation to be
synchronous, so no datagram is lost between the time that we stop the
reading from the UDP socket and we restart it again in the asychronous
udpread event.
Add a unit test that tests the isc_nm_read() call from the read
callback to receive two datagrams.
An assertion failure would be triggered when the TCP connection
is canceled during sending the data back to the client.
Don't require the state to be `RECV` on non successful read to
gracefully handle canceled TCP connection during the SEND state of the
HTTPD channel.
When converting a string to lower case, the compiler is able to
autovectorize nicely, so a nice simple implementation is also very
fast, comparable to memcpy().
Comparisons are more difficult for the compiler, so we convert eight
bytes at a time using "SIMD within a register" tricks. Experiments
indicate it's best to stick to simple loops for shorter strings and
the remainder of long strings.
There were a number of places that had copies of various ASCII
tables (case conversion, hex and decimal conversion) that are intended
to be faster than the ctype.h macros, or avoid locale pollution.
Move them into libisc, and wrap the lookup tables with macros that
avoid the ctype.h gotchas.
Commit 3608abc8fa6a33046e1d34a0789cf7c9547f09ad inadvertently carried
over a mistake in logging pthread_cond_init() errors to the
ERRNO_CHECK() preprocessor macro: instead of passing the value returned
by a given pthread_*() function to strerror_r(), ERRNO_CHECK() passes
the errno variable to strerror_r(). This causes bogus error reports
because POSIX Threads API functions do not set the errno variable.
Fix by passing the value returned by a given pthread_*() function
instead of the errno variable to strerror_r(). Since this change makes
the name of the affected macro (ERRNO_CHECK()) confusing, rename the
latter to PTHREADS_RUNTIME_CHECK(). Also log the integer error value
returned by a given pthread_*() function verbatim to rule out any
further confusion in runtime error reporting.
when the compression buffer was reused for multiple statistics
requests, responses could grow beyond the correct size. this was
because the buffer was not cleared before reuse; compressed data
was still written to the beginning of the buffer, but then the size
of used region was increased by the amount written, rather than set
to the amount written. this caused responses to grow larger and
larger, potentially reading past the end of the allocated buffer.
Commit b69e783164 inadvertently caused
builds using the --disable-doh switch to fail, by putting the
declaration of the isc__nm_async_settlsctx() function inside an #ifdef
block that is only evaluated when DNS-over-HTTPS support is enabled.
This results in the following compilation errors being triggered:
netmgr/netmgr.c:2657:1: error: no previous prototype for 'isc__nm_async_settlsctx' [-Werror=missing-prototypes]
2657 | isc__nm_async_settlsctx(isc__networker_t *worker, isc__netievent_t *ev0) {
| ^~~~~~~~~~~~~~~~~~~~~~~
Fix by making the declaration of the isc__nm_async_settlsctx() function
in lib/isc/netmgr/netmgr-int.h visible regardless of whether
DNS-over-HTTPS support is enabled or not.
The isc_nm_listentlsdns() function erroneously calls
isc__nm_tcpdns_stoplistening() instead of isc__nm_tlsdns_stoplistening()
when something goes wrong, which can cause an assertion failure.
When we are closing the listening sockets, there's a time window in
which the TCP connection could be accepted although the respective
stoplistening function has already returned to control to the caller.
Clear the accept callback function early, so it doesn't get called when
we are not interested in the incoming connections anymore.
Previously:
* applications were using isc_app as the base unit for running the
application and signal handling.
* networking was handled in the netmgr layer, which would start a
number of threads, each with a uv_loop event loop.
* task/event handling was done in the isc_task unit, which used
netmgr event loops to run the isc_event calls.
In this refactoring:
* the network manager now uses isc_loop instead of maintaining its
own worker threads and event loops.
* the taskmgr that manages isc_task instances now also uses isc_loopmgr,
and every isc_task runs on a specific isc_loop bound to the specific
thread.
* applications have been updated as necessary to use the new API.
* new ISC_LOOP_TEST macros have been added to enable unit tests to
run isc_loop event loops. unit tests have been updated to use this
where needed.
* isc_timer was rewritten using the uv_timer, and isc_timermgr_t was
completely removed; isc_timer objects are now directly created on the
isc_loop event loops.
* the isc_timer API has been simplified. the "inactive" timer type has
been removed; timers are now stopped by calling isc_timer_stop()
instead of resetting to inactive.
* isc_manager now creates a loop manager rather than a timer manager.
* modules and applications using isc_timer have been updated to use the
new API.
This commit introduces new APIs for applications and signal handling,
intended to replace isc_app for applications built on top of libisc.
* isc_app will be replaced with isc_loopmgr, which handles the
starting and stopping of applications. In isc_loopmgr, the main
thread is not blocked, but is part of the working thread set.
The loop manager will start a number of threads, each with a
uv_loop event loop running. Setup and teardown functions can be
assigned which will run when the loop starts and stops, and
jobs can be scheduled to run in the meantime. When
isc_loopmgr_shutdown() is run from any the loops, all loops
will shut down and the application can terminate.
* signal handling will now be handled with a separate isc_signal unit.
isc_loopmgr only handles SIGTERM and SIGINT for application
termination, but the application may install additional signal
handlers, such as SIGHUP as a signal to reload configuration.
* new job running primitives, isc_job and isc_async, have been added.
Both units schedule callbacks (specifying a callback function and
argument) on an event loop. The difference is that isc_job unit is
unlocked and not thread-safe, so it can be used to efficiently
run jobs in the same thread, while isc_async is thread-safe and
uses locking, so it can be used to pass jobs from one thread to
another.
* isc_tid will be used to track the thread ID in isc_loop worker
threads.
* unit tests have been added for the new APIs.
When the HTTP request has a body part after the HTTP headers, it is
not getting processed and is being prepended to the next request's data,
which results in an error when trying to parse it.
Improve the httpd.c:process_request() function with the following
additions:
1. Require that HTTP POST requests must have Content-Length header.
2. When Content-Length header is set, extract its value, and make sure
that it is valid and that the whole request's body is received before
processing the request.
3. Discard the request's body by consuming Content-Length worth of data
in the buffer.
In some circumstances generic TLS code could have resumed data reading
unexpectedly on the TCP layer code. Due to this, the behaviour of
isc_nm_pauseread() and isc_nm_resumeread() might have been
unexpected. This commit fixes that.
The bug does not seems to have real consequences in the existing code
due to the way the code is used. However, the bug could have lead to
unexpected behaviour and, at any rate, makes the TLS code behave
differently from the TCP code, with which it attempts to be as
compatible as possible.
Sometimes tls_do_bio() might be called when there is no new data to
process (most notably, when resuming reads), in such a case internal
TLS session state will remain untouched and old value in 'errno' will
alter the result of SSL_get_error() call, possibly making it to return
SSL_ERROR_SYSCALL. This value will be treated as an error, and will
lead to closing the connection, which is not what expected.
The STATID_CONNECT and STATID_CONNECTFAIL statistics were used
incorrectly. The STATID_CONNECT was incremented twice (once in
the *_connect_direct() and once in the callback) and STATID_CONNECTFAIL
would not be incremented at all if the failure happened in the callback.
Closes: #3452
On FreeBSD (and perhaps other *BSD) systems, the TCP connect() call (via
uv_tcp_connect()) can fail with transient UV_EADDRINUSE error. The UDP
code already handles this by trying three times (is a charm) before
giving up. Add a code for the TCP, TCPDNS and TLSDNS layers to also try
three times before giving up by calling uv_tcp_connect() from the
callback two more time on UV_EADDRINUSE error.
Additionally, stop the timer only if we succeed or on hard error via
isc__nm_failed_connect_cb().
uv_barrier_init() errors are currently ignored. Use UV_RUNTIME_CHECK()
to catch them and to improve error reporting for any uv_barrier_init()
run-time failures (by augmenting error messages with file/line
information and the error string corresponding to the value returned).
Replace direct uses of implementation-specific rwlock functions in
lib/isc/include/isc/rwlock.h with preprocessor macros that use
ERRNO_CHECK(), in order to augment rwlock-related error messages with
file/line/caller information and the error string corresponding to
errno. Adjust the implementation-specific functions for pthreads-based
rwlocks so that they return any errors encountered to the caller instead
of aborting execution immediately using RUNTIME_CHECK().
To keep code modifications simple, make the non-pthreads-based
implementation-specific rwlock functions always return 0; these
functions continue to handle errors using less verbose run-time
assertions as they do not set errno anyway.
Replace all uses of RUNTIME_CHECK() in lib/isc/include/isc/condition.h
with ERRNO_CHECK(), in order to improve error reporting for any
condition-variable-related run-time failures (by augmenting error
messages with file/line/caller information and the error string
corresponding to errno).
Replace all uses of RUNTIME_CHECK() in lib/isc/include/isc/mutex.h with
ERRNO_CHECK(), in order to improve error reporting for any mutex-related
run-time failures (by augmenting error messages with file/line/caller
information and the error string corresponding to errno).
Some POSIX threads implementations (e.g. FreeBSD's libthr) allocate
memory on the heap when pthread_barrier_init() is called. Every call to
that function must be accompanied by a corresponding call to
pthread_barrier_destroy() or else the memory allocated for the barrier
will leak.
jemalloc can be used for detecting memory allocations which are not
released by a process when it exits. Unfortunately, since jemalloc is
also the system allocator on FreeBSD and a special (profiling-enabled)
build of jemalloc is required for memory leak detection, this method
cannot be used for detecting leaked memory allocated by libthr on a
stock FreeBSD installation.
However, libthr's behavior can be emulated on any platform by
implementing alternative versions of libisc functions for creating and
destroying barriers that allocate memory using malloc() and release it
using free(). This enables using jemalloc for detecting missing
pthread_barrier_destroy() calls on any platform on which it works
reliably.
When the newly introduced ISC_TRACK_PTHREADS_OBJECTS preprocessor macro
is set, allocate isc_barrier_t structures on the heap in
isc_barrier_init() and free them in isc_barrier_destroy(). Reuse
existing barrier macros (after renaming them appropriately) for other
operations.
Some POSIX threads implementations (e.g. FreeBSD's libthr) allocate
memory on the heap when pthread_rwlock_init() is called. Every call to
that function must be accompanied by a corresponding call to
pthread_rwlock_destroy() or else the memory allocated for the rwlock
will leak.
jemalloc can be used for detecting memory allocations which are not
released by a process when it exits. Unfortunately, since jemalloc is
also the system allocator on FreeBSD and a special (profiling-enabled)
build of jemalloc is required for memory leak detection, this method
cannot be used for detecting leaked memory allocated by libthr on a
stock FreeBSD installation.
However, libthr's behavior can be emulated on any platform by
implementing alternative versions of libisc functions for creating and
destroying rwlocks that allocate memory using malloc() and release it
using free(). This enables using jemalloc for detecting missing
pthread_rwlock_destroy() calls on any platform on which it works
reliably.
When the newly introduced ISC_TRACK_PTHREADS_OBJECTS preprocessor macro
is set (and --enable-pthread-rwlock is used), allocate isc_rwlock_t
structures on the heap in isc_rwlock_init() and free them in
isc_rwlock_destroy(). Reuse existing functions defined in
lib/isc/rwlock.c for other operations, but rename them first, so that
they contain triple underscores (to indicate that these functions are
implementation-specific, unlike their mutex and condition variable
counterparts, which always use the pthreads implementation). Define the
isc__rwlock_init() macro so that it is a logical counterpart of
isc__mutex_init() and isc__condition_init(); adjust isc___rwlock_init()
accordingly. Remove a redundant function prototype for
isc__rwlock_lock() and rename that (static) function to rwlock_lock() in
order to avoid having to use quadruple underscores.
Some POSIX threads implementations (e.g. FreeBSD's libthr) allocate
memory on the heap when pthread_cond_init() is called. Every call to
that function must be accompanied by a corresponding call to
pthread_cond_destroy() or else the memory allocated for the condition
variable will leak.
jemalloc can be used for detecting memory allocations which are not
released by a process when it exits. Unfortunately, since jemalloc is
also the system allocator on FreeBSD and a special (profiling-enabled)
build of jemalloc is required for memory leak detection, this method
cannot be used for detecting leaked memory allocated by libthr on a
stock FreeBSD installation.
However, libthr's behavior can be emulated on any platform by
implementing alternative versions of libisc functions for creating and
destroying condition variables that allocate memory using malloc() and
release it using free(). This enables using jemalloc for detecting
missing pthread_cond_destroy() calls on any platform on which it works
reliably.
When the newly introduced ISC_TRACK_PTHREADS_OBJECTS preprocessor macro
is set, allocate isc_condition_t structures on the heap in
isc_condition_init() and free them in isc_condition_destroy(). Reuse
existing condition variable macros (after renaming them appropriately)
for other operations.
Some POSIX threads implementations (e.g. FreeBSD's libthr) allocate
memory on the heap when pthread_mutex_init() is called. Every call to
that function must be accompanied by a corresponding call to
pthread_mutex_destroy() or else the memory allocated for the mutex will
leak.
jemalloc can be used for detecting memory allocations which are not
released by a process when it exits. Unfortunately, since jemalloc is
also the system allocator on FreeBSD and a special (profiling-enabled)
build of jemalloc is required for memory leak detection, this method
cannot be used for detecting leaked memory allocated by libthr on a
stock FreeBSD installation.
However, libthr's behavior can be emulated on any platform by
implementing alternative versions of libisc functions for creating and
destroying mutexes that allocate memory using malloc() and release it
using free(). This enables using jemalloc for detecting missing
pthread_mutex_destroy() calls on any platform on which it works
reliably.
Introduce a new ISC_TRACK_PTHREADS_OBJECTS preprocessor macro, which
causes isc_mutex_t structures to be allocated on the heap by
isc_mutex_init() and freed by isc_mutex_destroy(). Reuse existing mutex
macros (after renaming them appropriately) for other operations.
Instead of returning error values from isc_rwlock_*(), isc_mutex_*(),
and isc_condition_*() macros/functions and subsequently carrying out
runtime assertion checks on the return values in the calling code,
trigger assertion failures directly in those macros/functions whenever
any pthread function returns an error, as there is no point in
continuing execution in such a case anyway.
Instead of using isc_once_do() on every isc_mutex_init() call, use the
global library constructor to initialize the default mutex attr
object (optionally with PTHREAD_MUTEX_ADAPTIVE_NP if supported) just
once when the library is loaded.
isc_rwlock_init() currently detects pthread_rwlock_init() failures using
a REQUIRE() assertion. Use the ERRNO_CHECK() macro for that purpose
instead, so that read-write lock initialization failures are handled
identically as condition variable (pthread_cond_init()) and mutex
(pthread_mutex_init()) initialization failures.
In a number of situations in pthreads-related code, a common sequence of
steps is taken: if the value returned by a library function is not 0,
pass errno to strerror_r(), log the string returned by the latter, and
immediately abort execution. Add an ERRNO_CHECK() preprocessor macro
which takes those exact steps and use it wherever (conveniently)
possible.
Notes:
1. The "log the return value of strerror_r() and abort" pattern is used
in a number of other places that this commit does not touch; only
"!= 0" checks followed by isc_error_fatal() calls with
non-customized error messages are replaced here.
2. This change temporarily breaks file name & line number reporting for
isc__mutex_init() errors, to prevent breaking the build. This issue
will be rectified in a subsequent change.
Before this change the TLS code would ignore the accept callback result,
and would not try to gracefully close the connection. This had not been
noticed, as it is not really required for DoH. Now the code tries to
shut down the TLS connection gracefully when accepting it is not
successful.
Otherwise the code path will lead to a call to SSL_get_error()
returning SSL_ERROR_SSL, which in turn might lead to closing
connection to early in an unexpected way, as it is clearly not what is
intended.
The issue was found when working on loppmgr branch and appears to
be timing related as well. Might be responsible for some unexpected
transmission failures e.g. on zone transfers.
In some operations - most prominently when establishing connection -
it might be beneficial to bail out earlier when the network manager
is stopping.
The issue is backported from loopmgr branch, where such a change is
not only beneficial, but required.
In some cases - in particular, in case of errors, NULL might be passed
to a connection callback instead of a handle that could have led to
an abort. This commit ensures that such a situation will not occur.
The issue was found when working on the loopmgr branch.
This commit ensures that the underlying TCP socket of a TLS connection
gets closed earlier whenever there are no pending operations on it.
In the loop-manager branch, in some circumstances the connection
could have remained opened for far too long for no reason. This
commit ensures that will not happen.
it's a style violation to have REQUIRE or INSIST contain code that
must run for the server to work. this was being done with some
atomic_compare_exchange calls. these have been cleaned up. uses
of atomic_compare_exchange in assertions have been replaced with
a new macro atomic_compare_exchange_enforced, which uses RUNTIME_CHECK
to ensure that the exchange was successful.
Before the changes from this commit were introduced, the accept
callback function will get called twice when accepting connection
during two of these stages:
* when accepting the TCP connection;
* when handshake has completed.
That is clearly an error, as it should have been called only once. As
far as I understand it the mistake is a result of TLS DNS transport
being essentially a fork of TCP transport, where calling the accept
callback immediately after accepting TCP connection makes sense.
This commit fixes this mistake. It did not have any very serious
consequences because in BIND the accept callback only checks an ACL
and updates stats.
Under specific rare timing circumstances the uv_read_start() could
fail with UV_EINVAL when the connection is reset between the connect (or
accept) and the uv_read_start() call on the nmworker loop. Handle such
situation gracefully by propagating the errors from uv_read_start() into
upper layers, so the socket can be internally closed().
The unit tests are now using a common base, which means that
lib/dns/tests/ code now has to include lib/isc/include/isc/test.h and
link with lib/isc/test.c and lib/ns/tests has to include both libisc and
libdns parts.
Instead of cross-linking code between the directories, move the
/lib/<foo>/test.c to /tests/<foo>.c and /lib/<foo>/include/<foo>test.h
to /tests/include/tests/<foo>.h and create a single libtest.la
convenience library in /tests/.
At the same time, move the /lib/<foo>/tests/ to /tests/<foo>/ (but keep
it symlinked to the old location) and adjust paths accordingly. In few
places, we are now using absolute paths instead of relative paths,
because the directory level has changed. By moving the directories
under the /tests/ directory, the test-related code is kept in a single
place and we can avoid referencing files between libns->libdns->libisc
which is unhealthy because they live in a separate Makefile-space.
In the future, the /bin/tests/ should be merged to /tests/ and symlink
kept, and the /fuzz/ directory moved to /tests/fuzz/.
The unit tests contain a lot of duplicated code and here's an attempt
to reduce code duplication.
This commit does several things:
1. Remove #ifdef HAVE_CMOCKA - we already solve this with automake
conditionals.
2. Create a set of ISC_TEST_* and ISC_*_TEST_ macros to wrap the test
implementations, test lists, and the main test routine, so we don't
have to repeat this all over again. The macros were modeled after
libuv test suite but adapted to cmocka as the test driver.
A simple example of a unit test would be:
ISC_RUN_TEST_IMPL(test1) { assert_true(true); }
ISC_TEST_LIST_START
ISC_TEST_ENTRY(test1)
ISC_TEST_LIST_END
ISC_TEST_MAIN (Discussion: Should this be ISC_TEST_RUN ?)
For more complicated examples including group setup and teardown
functions, and per-test setup and teardown functions.
3. The macros prefix the test functions and cmocka entries, so the name
of the test can now match the tested function name, and we don't have
to append `_test` because `run_test_` is automatically prepended to
the main test function, and `setup_test_` and `teardown_test_` is
prepended to setup and teardown function.
4. Update all the unit tests to use the new syntax and fix a few bits
here and there.
5. In the future, we can separate the test declarations and test
implementations which are going to greatly help with uncluttering the
bigger unit tests like doh_test and netmgr_test, because the test
implementations are not declared static (see `ISC_RUN_TEST_DECLARE`
and `ISC_RUN_TEST_IMPL` for more details.
NOTE: This heavily relies on preprocessor macros, but the result greatly
outweighs all the negatives of using the macros. There's less
duplicated code, the tests are more uniform and the implementation can
be more flexible.
Previously, tasks could be created either unbound or bound to a specific
thread (worker loop). The unbound tasks would be assigned to a random
thread every time isc_task_send() was called. Because there's no logic
that would assign the task to the least busy worker, this just creates
unpredictability. Instead of random assignment, bind all the previously
unbound tasks to worker 0, which is guaranteed to exist.
This commit separates TLS context creation code from xfrin_start() as
it has become too large and hard to follow into a new
function (similarly how it is done in dighost.c)
The dead code has been removed from the cleanup section of the TLS
creation code:
* there is no way 'tlsctx' can equal 'found';
* there is no way 'sess_cache' can be non-NULL in the cleanup section.
Also, it fixes a bug in the older version of the code, where TLS
client session context fetched from the cache would not get passed to
isc_nm_tlsdnsconnect().
Typing from libuv structure to isc_region_t is not possible, because
their sizes differ on 64 bit architectures. Little endian machines seems
to be lucky and still result in test passed. But big endian machine such
as s390x fails the test reliably.
Fix by directly creating the buffer as isc_region_t and skipping the
type conversion. More readable and still more correct.
The recently added TLS client session cache used
SSL_SESSION_is_resumable() to avoid polluting the cache with
non-resumable sessions. However, it turned out that we cannot provide
a shim for this function across the whole range of OpenSSL versions
due to the fact that OpenSSL 1.1.0 does uses opaque pointers for
SSL_SESSION objects.
The commit replaces the shim for SSL_SESSION_is_resumable() with a non
public approximation of it on systems shipped with OpenSSL 1.1.0. It
is not turned into a proper shim because it does not fully emulate the
behaviour of SSL_SESSION_is_resumable(), but in our case it is good
enough, as it still helps to protect the cache from pollution.
For systems shipped with OpenSSL 1.0.X and derivatives (e.g. older
versions of LibreSSL), the provided replacement perfectly mimics the
function it is intended to replace.
The commit fixes a corner case in client-side DoH code, when a write
attempt is done on a closing socket (session).
The change ensures that the write call-back will be called with a
proper error code (see failed_send_cb() call in client_httpsend()).
This commit extends TLS context cache with TLS client session cache so
that an associated session cache can be stored alongside the TLS
context within the context cache.
This commit adds an implementation of a client TLS session cache. TLS
client session cache is an object which allows efficient storing and
retrieval of previously saved TLS sessions so that they can be
resumed. This object is supposed to be a foundation for implementing
TLS session resumption - a standard technique to reduce the cost of
re-establishing a connection to the remote server endpoint.
OpenSSL does server-side TLS session caching transparently by
default. However, on the client-side, a TLS session to resume must be
manually specified when establishing the TLS connection. The TLS
client session cache is precisely the foundation for that.
Setting the sock->write_timeout from the TCP, TCPDNS, and TLSDNS send
functions could lead to (harmless) data race when setting the value for
the first time when the isc_nm_send() function would be called from
thread not-matching the socket we are sending to. Move the setting the
sock->write_timeout to the matching async function which is always
called from the matching thread.