If uv_tcp_close_reset() returns an error code, this means the
reset_shutdown callback has not been issued, so do it now.
(cherry picked from commit c40e5c8653)
Failing to accept TCP/TLS connections in 9.18 detaches the quota in
isc__nm_failed_accept_cb, causing TCP4Clients and TCP6Clients statistics
to not decrease inside cleanup.
Fix by increasing the counter after the point of no failure but before
handling statistics through the client's socket is no longer valid.
When isc_task_purgeevent() is called for and 'event', the event, in
the meanwhile, could in theory get processed, unlinked, and freed.
So when the function then operates on the 'event', it causes a
segmentation fault.
The only place where isc_task_purgeevent() is called is from
timer_purge().
In order to resolve the data race, call isc_task_purgeevent() inside
the 'timer->lock' locked block, so that timerevent_destroy() won't
be able to destroy the event if it was processed in the meanwhile,
before isc_task_purgeevent() had a chance to purge it.
In order to be able to do that, move the responsibility of calling
isc_event_free() (upon a successful purge) out from the
isc_task_purgeevent() function to its caller instead, so that it can
be called outside of the timer->lock locked block.
Let basic_tick() of 'task1' and 'basic_quick' of 'task4' run in
different threads, and insert an artificial delay in timer_purge()
to cause an existing race condition to appear.
The statistics channel does not expose the current number of TCP clients
connected, only the highwater. Therefore, users did not have an easy
means to collect statistics about TCP clients served over time. This
information could only be measured as a seperate mechanism via rndc by
looking at the TCP quota filled.
In order to expose the exact current count of connected TCP clients
(tracked by the "tcp-clients" quota) as a statistics counter, an
extra, dedicated Network Manager callback would need to be
implemented for that purpose (a counterpart of ns__client_tcpconn()
that would be run when a TCP connection is torn down), which is
inefficient. Instead, track the number of currently-connected TCP
clients separately for IPv4 and IPv6, as Network Manager statistics.
(cherry picked from commit 2690dc48d3)
The case insensitive matching in isc_ht was basically completely broken
as only the hashvalue computation was case insensitive, but the key
comparison was always case sensitive.
(cherry picked from commit ec11aa2836)
The case insensitive matching in isc_ht was basically completely broken
as only the hashvalue computation was case insensitive, but the key
comparison was always case sensitive.
(cherry picked from commit 34ae6916f115fc291865857509433f95c2bc0871)
Change the taskmgr (and thus netmgr) in a way that it supports fast and
slow task queues. The fast queue is used for incoming DNS traffic and
it will pass the processing to the slow queue for sending outgoing DNS
messages and processing resolver messages.
In the future, more tasks might get moved to the slow queues, so the
cached and authoritative DNS traffic can be handled without being slowed
down by operations that take longer time to process.
cmocka.h and jemalloc.h/malloc_np.h has conflicting macro definitions.
While fixing them with push_macro for only malloc is done below, we only
need the non-standard mallocx interface which is easy to just define by
ourselves.
(cherry picked from commit 197de93bdc)
Because we don't use jemalloc functions directly, but only via the
libisc library, the dynamic linker might pull the jemalloc library
too late when memory has been already allocated via standard libc
allocator.
Add a workaround round isc_mem_create() that makes the dynamic linker
to pull jemalloc earlier than libc.
(cherry picked from commit 41a0ee1071)
When connecting to a remote party the TLS DNS code could process more
than one message at a time despite the fact that it is expected that
we should stop after every DNS message.
Every DNS message is handled and consumed from the input buffer by
isc__nm_process_sock_buffer(). However, as opposed to TCP DNS code, it
can be called more than once when processing incoming data from a
server (see tls_cycle_input()). That, in turn means that we can
process more than one message at a time. Some higher level code might
not expect that, as it breaks the contract.
In particular, in the original report that happened during
isc__nm_async_tlsdnsshutdown() call: when shutting down multiple calls
to tls_cycle() are possible (each possibly leading to a
isc__nm_process_sock_buffer()). If there are any non processed
messages left, for any of the messages left the read callback will be
called even when it is not expected as there were no preceding
isc_nm_read().
To keep TCP DNS and TLS DNS code in sync, we make a similar change to
it as well, although it should not matter.
When jemalloc is linked into BIND 9 binaries (rather than preloaded or
used as the system allocator), depending on the decisions made by the
linker, the malloc() symbol may be resolved to a non-jemalloc
implementation at runtime. Such a scenario foils the workaround added
in commit 2da371d005 as it relies on the
jemalloc implementation of malloc() to be executed.
Handle the above scenario properly by calling mallocx() explicitly
instead of relying on the runtime resolution of the malloc() symbol.
Use trivial wrapper functions to avoid the need to copy multiple #ifdef
lines from lib/isc/mem.c to lib/isc/trampoline.c. Using a simpler
alternative, e.g. calling isc_mem_create() & isc_mem_destroy(), was
already considered before and rejected, as described in the log message
for commit 2da371d005.
ADJUST_ZERO_ALLOCATION_SIZE() is only used in isc__mem_free_noctx() to
concisely avoid compilation warnings about its 'size' parameter not
being used when building against jemalloc < 4.0.0 (as sdallocx() is then
redefined to dallocx(), which has a different signature).
Apply the semantic patch to catch all the places where we pass 'char' to
the <ctype.h> family of functions (isalpha() and friends, toupper(),
tolower()).
(cherry picked from commit 29caa6d1f0)
In TLS DNS, it was possible for an 'isc__nm_uvreq_t' object to be used
twice when it was not implied, leading to use after free errors,
leading to abort()'s and/or crashes.
In particular, in the older version of the code, the following was
happening:
* 'tlsdns_send_direct()' is eventually called from
'isc__nm_async_tlsdnssend()' when sending data;
* 'tlsdns_send_direct()' could fail. If it happens early closer to the
beginning of the function, then the 'isc__nm_uvreq_t' object is passed
to 'isc__nm_failed_send_cb()' and eventually gets destroyed normally
without any negative consequences;
* However, if 'tlsdns_send_direct()' fails due to the call to
'tls_cycle()', then the send request object is re-queued to be
processed asynchronously later by a call to 'tlsdns_send_enqueue()'.
* The error processing code in 'tlsdns_send_direct()' is written
without an assumption that the send request object could be re-queued
in 'tlsdns_send_direct()' and, thus, it was passed to
'isc__nm_failed_send_cb()' for asynchronous error processing.
* As a result of asynchronous processing initiated by
`tlsdns_send_enqueue()` the 'isc__nm_uvreq_t' send request object gets
properly destroyed eventually.
* Then, as a result of asynchronous error processing initiated by the
'isc__nm_failed_send_cb()' call, the send request object is used a
second time. However, it was destroyed at the previous step.
So we have a (relatively) obvious logical mistake here. Why
'tls_cycle()' was failing is of no particular importance, as I can see
many ways for it to fail both on TLS and networking levels. In
particular, during testing with 'dnsperf', I have encountered the
following two cases:
1. 'result == ISC_R_TLSERROR' - the remote side was wrapping up,
causing errors on TLS level (e.g. handshake has not been completed
yet);
2. 'result == ISC_R_CONNECTIONRESET' - the remote side closed the
connection after we have initiated the asynchronous send request
operation and reached the point when its processing started.
One could ask: why haven't we encountered the error before? That
happened because 'send()' is, in general, a faster operation than
'read()' so we were lucky enough not to encounter this, as it took
dnsperf to make our code fail in a particular way to trigger the
problematic code path. It requires closing the connection shortly
after we have made an asynchronous 'send()' attempt and passed the
data for processing by the libuv code (or, to be precise, down our
technologies stack).
This commit changes TLS DNS so that partial writes by the
SSL_write_ex() function are taken into account properly. Now, before
doing encryption, we are flushing the buffers for outgoing encrypted
data.
The problem is fairly complicated and originates from the fact that it
is somewhat hard to understand by reading the documentation if and
when partial writes are supported/enabled or not, and one can get a
false impression that they are not supported or enabled by
default (https://www.openssl.org/docs/man3.1/man3/SSL_write_ex.html). I
have added a lengthy comment about that into the code because it will
be more useful there. The documentation on this topic is vague and
hard to follow.
The main point is that when SSL_write_ex() fails with
SSL_ERROR_WANT_WRITE, the OpenSSL code tells us that we need to flush
the outgoing buffers and then call SSL_write_ex() again with exactly
the same arguments in order to continue as partial write could have
happened on the previous call to SSL_write_ex() (that is not hard to
verify by calling BIO_pending(sock->tls.app_rbio) before and after the
call to SSL_write_ex() and comparing the returned values). This aspect
was not taken into account in the code.
Now, one can wonder how that could have led to the behaviour that we
saw in the #4255 bug report. In particular, how could we lose one
message and duplicate one twice? That is where things get interesting.
One needs to keep two things in mind (that is important):
Firstly, the possibility that two (or more) subsequent SSL_write_ex()
calls will be done with exactly the same arguments is very high (the
code does not guarantee that in any way, but in practice, that happens
a lot).
Secondly, the dnsperf (the software that helped us to trigger the bug)
bombed the test server with messages that contained exactly the same
data. The only difference in the responses is message IDs, which can
be found closer to the start of a message.
So, that is what was going on in the older version of the code:
1. During one of the isc_nm_send() calls, the SSL_write_ex() call
fails with SSL_ERROR_WANT_WRITE. Partial writing has happened, though,
and we wrote a part of the message with the message
ID (e.g. 2014). Nevertheless, we have rescheduled the complete send
operation asynchronously by a call to tlsdns_send_enqueue().
2. While the asynchronous request has not been completed, we try to
send the message (e.g. with ID 2015). The next isc_nm_send() or
re-queued send happens with a call to SSL_write_ex() with EXACTLY the
same arguments as in the case of the previous call. That is, we are
acting as if we want to complete the previously failed SSL_write_ex()
attempt (according to the OpenSSL documentation:
https://www.openssl.org/docs/man3.1/man3/SSL_write_ex.html, the
"Warnings" section). This way, we already have a start of the message
containing the previous ID (2014 in our case) but complete the write
request with the rest of the data given in the current write
attempt. However, as responses differ only in message ID, we end up
sending a valid (properly structured) DNS message but with the ID of
the previous one. This way, we send a message with ID from the
previous isc_nm_send() attempt. The message with the ID from the send
request from this attempt will never be sent, as the code thinks that
it is sending it now (that is how we send the message with ID 2014
instead of 2015, as in our example, thus making the message with ID
2015 never to be sent).
3. At some point later, the asynchronous send request (the rescheduled
on the first step) completes without an error, sending a second
message with the same ID (2014).
It took exhausting SSL write buffers (so that a data encryption
attempt cannot be completed in one operation) via long DoT streams in
order to exhibit the behaviour described above. The exhaustion
happened because we have not been trying to flush the buffers often
enough (especially in the case of multiple subsequent writes).
In my opinion, the origin of the problem can be described as follows:
It happened due to making wrong guesses caused by poorly written
documentation.
This commit adds couple of functions to change "dirty_decay_ms" and
"muzzy_decay_ms" settings on arenas associated with memory contexts.
(cherry picked from commit 6e98b58d15)
This commit extends the internal memory management middleware code in
BIND so that memory contexts backed by dedicated jemalloc arenas can
be created. A new function (isc_mem_create_arena()) is added for that.
Moreover, it extends the existing code so that specialised memory
contexts can be created easily, should we need that functionality for
other future purposes. We have achieved that by passing the flags to
the underlying jemalloc-related calls. See the above
isc_mem_create_arena(), which can serve as an example of this.
Having this opens up possibilities for creating memory contexts tuned
for specific needs.
(cherry picked from commit 8550c52588)
This commit ensures that BIND and supplementary tools still can be
built on newer versions of DragonFly BSD. It used to be the case, but
somewhere between versions 6.2 and 6.4 the OS developers rearranged
headers and moved some function definitions around.
Before that the fact that it worked was more like a coincidence, this
time we, at least, looked at the related man pages included with the
OS.
No in depth testing has been done on this OS as we do not really
support this platform - so it is more like a goodwill act. We can,
however, use this platform for testing purposes, too. Also, we know
that the OS users do use BIND, as it is included in its ports
directory.
Building with './configure' and './configure --without-jemalloc' have
been fixed and are known to work at the time the commit is made.
(cherry picked from commit 942569a1bb)
A negative or excessively large Content-Length could cause a crash
by making `INSIST(httpd->consume != 0)` fail.
(cherry picked from commit 26e10e8fb5)
Oracle Linux 7 sets __STDC_VERSION__ to 201112L, but doesn't define
__STDC_NO_ATOMICS__, so we try to include <stdatomic.h> without the
header present in the system. Since we are already detecting the header
in the autoconf, use the HAVE_STDATOMIC_H for more reliable detecting
whether <stdatomic.h> header is present.
The detach (and possibly close) netmgr events can cause additional
callbacks to be called when under exclusive mode. The detach can
trigger next queued TCP query to be processed and close will call
configured close callback.
Move the detach and close netmgr events from the priority queue to the
normal queue as the detaching and closing the sockets can wait for the
exclusive mode to be over.
In HTTP/1.0 and HTTP/1.1, RFC 9112 section 9.6 says the last response
in a connection should include a `Connection: close` header, but the
statschannel server omitted it.
In an HTTP/1.0 response, the statschannel server can sometimes send a
`Connection: keep-alive` header when it is about to close the
connection. There are two ways:
If the first request on a connection is keep-alive and the second
request is not, then _both_ responses have `Connection: keep-alive`
but the connection is (correctly) closed after the second response.
If a single request contains
Connection: close
Connection: keep-alive
then RFC 9112 section 9.3 says the keep-alive header is ignored, but
the statschannel sends a spurious keep-alive in its response, though
it correctly closes the connection.
To fix these bugs, make it more clear that the `httpd->flags` are part
of the per-request-response state. The Connection: flags are now
described in terms of the effect they have instead of what causes them
to be set.
(manually picked from commit e18ca83a3b)
Tasks can block for a long time, especially when used by tools in
interactive mode. Update the event loop's time to avoid unexpected
errors when processing later events during the same callback.
For example, newly started timers can fire too early, because the
current time was stale. See the note about uv_update_time() in the
https://docs.libuv.org/en/v1.x/timer.html#c.uv_timer_start page.
The isc_result_t enum was to sparse when each library code would skip to
next << 16 as a base. Remove the huge holes in the isc_result_t enum to
make the isc_result tables more compact.
This change required a rewrite how we map dns_rcode_t to isc_result_t
and back, so we don't ever return neither isc_result_t value nor
dns_rcode_t out of defined range.
(cherry picked from commit a8e6c3b8f7)
Report "permission denied" instead of "unexpected error"
when trying to update a zone file on a read-only file system.
(cherry picked from commit dd6acc1cac)
isc_mem_put NULL's the pointer to the memory being freed. The
equality test 'parent->r == node' was accidentally being turned
into a test against NULL.
(cherry picked from commit ac2e0bc3ff)
Seemingly by omission, sockstop netievent used by multi-layer sockets
was not a high priority event, like it should be (similarly to other
socket types).
In particular, that could make BIND stuck on reconfiguration after a
DoH-listener is removed from the configuration.
This commit fixes that.