The task exclusive mode stops all processing (tasks and networking IO)
except the designated exclusive task events. This has impact on the
operation of the server. Add log messages indicating when we start the
exclusive mode, and when we end exclusive task mode.
There are reported occurences where the statitic counters underflows and
starts reporting non-sense.
Add a check for the underflow, when ``named`` is compiled in the
developer mode.
The isc_thread_setaffinity call was removed in !5265 and we are not
going to restore it because it was proven that the performance is better
without it. Additionally, remove the already disabled cpu system test.
The isc_thread_setconcurrency function is unused and also calling
pthread_setconcurrency() on Linux has no meaning, formerly it was
added because of Solaris in 2001 and it was removed when taskmgr was
refactored to run on top of netmgr in !4918.
When isc_quota_attach_cb() API returns ISC_R_QUOTA (meaning hard quota
was reached) the accept_connection() would return without logging a
message about quota reached.
Change the connection callback to log the quota reached message.
IBM power architecture has L1 cache line size equal to 128. Take
advantage of that on that architecture, do not force more common value
of 64. When it is possible to detect higher value, use that value
instead. Keep the default to be 64.
TLS clients can have their clock a short time in the past which will
result in not being able to validate the certificate.
Setting the "not before" property 5 minutes in the past will
accommodate with some possible clock skew across systems.
On some systems, the glibc can return 0 instead of cache-line size to
indicate the cache line sizes cannot be determined. This is comment
from glibc source code:
/* In general we cannot determine these values. Therefore we
return zero which indicates that no information is
available. */
As the goal of the check is to determine whether the L1 cache line size
is still 64 and we would use this value in case the sysconf() call is
not available, we can also ignore the invalid values returned by the
sysconf() call.
Some operating systems (OpenBSD and DragonFly BSD) don't restrict the
IPv6 sockets to sending and receiving IPv6 packets only. Explicitly
enable the IPV6_V6ONLY socket option on the IPv6 sockets to prevent
failures from using the IPv4-mapped IPv6 address.
The server_send_error_response() function is supposed to be used only
in case of failures and never in case of legitimate requests. Ensure
that ISC_HTTP_ERROR_SUCCESS is never passed there by mistake.
Previously, the netmgr/udp.c tried to detect the recvmmsg detection in
libuv with #ifdef UV_UDP_<foo> preprocessor macros. However, because
the UV_UDP_<foo> are not preprocessor macros, but enum members, the
detection didn't work. Because the detection didn't work, the code
didn't have access to the information when we received the final chunk
of the recvmmsg and tried to free the uvbuf every time. Fortunately,
the isc__nm_free_uvbuf() had a kludge that detected attempt to free in
the middle of the receive buffer, so the code worked.
However, libuv 1.37.0 changed the way the recvmmsg was enabled from
implicit to explicit, and we checked for yet another enum member
presence with preprocessor macro, so in fact libuv recvmmsg support was
never enabled with libuv >= 1.37.0.
This commit changes to the preprocessor macros to autoconf checks for
declaration, so the detection now works again. On top of that, it's now
possible to cleanup the alloc_cb and free_uvbuf functions because now,
the information whether we can or cannot free the buffer is available to
us.
Clients can cache the TLS certificates and refuse to accept
another one with the same serial number from the same issuer.
Generate a random serial number for the self-signed certificates
instead of using a fixed value.
GnuTLS, NSS, and possibly other TLS libraries currently fail to work
with compressed point conversion form supported by OpenSSL.
Use uncompressed point conversion form for better compatibility.
This commit converts the license handling to adhere to the REUSE
specification. It specifically:
1. Adds used licnses to LICENSES/ directory
2. Add "isc" template for adding the copyright boilerplate
3. Changes all source files to include copyright and SPDX license
header, this includes all the C sources, documentation, zone files,
configuration files. There are notes in the doc/dev/copyrights file
on how to add correct headers to the new files.
4. Handle the rest that can't be modified via .reuse/dep5 file. The
binary (or otherwise unmodifiable) files could have license places
next to them in <foo>.license file, but this would lead to cluttered
repository and most of the files handled in the .reuse/dep5 file are
system test files.
The isc__nm_tcp_resumeread() was using maybe_enqueue function to enqueue
netmgr event which could case the read callback to be executed
immediately if there was enough data waiting in the TCP queue.
If such thing would happen, the read callback would be called before the
previous read callback was finished and the worker receive buffer would
be still marked "in use" causing a assertion failure.
This would affect only raw TCP channels, e.g. rndc and http statistics.
The isc_queue_new() was using dirty tricks to allocate the head and tail
members of the struct aligned to the cacheline. We can now use
isc_mem_get_aligned() to allocate the structure to the cacheline
directly.
Use ISC_OS_CACHELINE_SIZE (64) instead of arbitrary ALIGNMENT (128), one
cacheline size is enough to prevent false sharing.
Cleanup the unused max_threads variable - there was actually no limit on
the maximum number of threads. This was changed a while ago.
The hazard pointers implementation was bit of frivolous with memory
usage allocating memory based on maximum constants rather than on the
usage.
Make the retired list bit use exactly the memory needed for specified
number of hazard pointers. This reduced the memory used by hazard
pointers to one quarter in our specific case because we only use single
HP in the queue implementation (as opposed to allocating memory for
HP_MAX_HPS = 4).
Previously, the alignment to prevent false sharing was double the
cacheline size. This was copied from the ConcurrencyFreaks
implementation, but one cacheline size is enough to prevent false
sharing, so we are using this now to save few bits of memory.
The top level hazard pointers and retired list arrays are now not
aligned to the cacheline size - they are read-only for the whole
life-time of the isc_hp object. Only hp (hazard pointer) and
rl (retired list) array members are allocated aligned to the cacheline
size to avoid false sharing between threads.
Cleanup HP_MAX_HPS and HP_THRESHOLD_R constants from the paper, because
we don't use them in the code. HP_THRESHOLD_R was 0, so the check
whether the retired list size was smaller than the value was basically a
dead code.
There are some situations where having aligned allocations would be
useful, so we don't have to play tricks with padding the data to the
cacheline sizes.
Add isc_mem_{get,put,reget,putanddetach}_aligned() functions that has
alignment and size as last argument mimicking the POSIX posix_memalign()
functions on systems with jemalloc (see the documentation on
MALLOX_ALIGN() for more details). On systems without jemalloc, those
functions are same as non-aligned variants.
Add library ctor and dtor for isc_os compilation unit which initializes
the numbers of the CPUs and also checks whether L1 cacheline size is
really 64 if the sysconf() call is available.
While doing code review, it was found that the taskmgr->exiting is set
under taskmgr->lock, but accessed under taskmgr->excl_lock in the
isc_task_beginexclusive().
Additionally, before the change that moved running the tasks to the
netmgr, the task_ready() subrouting of isc_task_detach() would lock
mgr->lock, requiring the mgr->excl to be protected mgr->excl_lock
to prevent deadlock in the code. After !4918 has been merged, this is
no longer true, and we can remove taskmgr->excl_lock and use
taskmgr->lock in its stead.
Solve both issues by removing the taskmgr->excl_lock and exclusively use
taskmgr->lock to protect both taskmgr->excl and taskmgr->exiting which
now doesn't need to be atomic_bool, because it's always accessed from
within the locked section.
The isc_taskmgr_excltask() would return ISC_R_NOTFOUND either when the
exclusive task was not set (yet) or when the taskmgr is shutting down
and the exclusive task has been already cleared.
Distinguish between the two states and return ISC_R_SHUTTINGDOWN when
the taskmgr is being shut down instead of ISC_R_NOTFOUND.
For consistency with similar functions, rename `pcache` to `cachep`,
call a separate destroy function when references reach 0, and add
a missing call to isc_refcount_destroy().
Using the TLS context cache for server-side contexts could reduce the
number of contexts to initialise in the configurations when e.g. the
same 'tls' entry is used in multiple 'listen-on' statements for the
same DNS transport, binding to multiple IP addresses.
In such a case, only one TLS context will be created, instead of a
context per IP address, which could reduce the initialisation time, as
initialising even a non-ephemeral TLS context introduces some delay,
which can be *visually* noticeable by log activity.
Also, this change lays down a foundation for Mutual TLS (when the
server validates a client certificate, additionally to a client
validating the server), as the TLS context cache can be extended to
store additional data required for validation (like intermediates CA
chain).
Additionally to the above, the change ensures that the contexts are
not being changed after initialisation, as such a practice is frowned
upon. Previously we would set the supported ALPN tags within
isc_nm_listenhttp() and isc_nm_listentlsdns(). We do not do that for
client-side contexts, so that appears to be an overlook. Now we set
the supported ALPN tags right after server-side contexts creation,
similarly how we do for client-side ones.
This commit adds a TLS context object cache implementation. The
intention of having this object is manyfold:
- In the case of client-side contexts: allow reusing the previously
created contexts to employ the context-specific TLS session resumption
cache. That will enable XoT connection to be reestablished faster and
with fewer resources by not going through the full TLS handshake
procedure.
- In the case of server-side contexts: reduce the number of contexts
created on startup. That could reduce startup time in a case when
there are many "listen-on" statements referring to a smaller amount of
`tls` statements, especially when "ephemeral" certificates are
involved.
- The long-term goal is to provide in-memory storage for additional
data associated with the certificates, like runtime
representation (X509_STORE) of intermediate CA-certificates bundle for
Strict TLS/Mutual TLS ("ca-file").
Commit 9ee60e7a17 erroneously introduced
duplicate conditions to several existing conditional statements
responsible for determining error codes passed to connection callbacks
upon failure. Fix the affected expressions to ensure connection
callbacks are invoked with:
- the ISC_R_SHUTTINGDOWN error code when a global netmgr shutdown is
in progress,
- the ISC_R_CANCELED error code when a specific operation has been
canceled.
This does not fix any known bugs, it only adjusts the changes introduced
by commit 9ee60e7a17 so that they match
its original intent.
The SSL_CTX_set_keylog_callback() function is a fairly recent OpenSSL
addition, having first appeared in version 1.1.1. Add a configure.ac
check for the availability of that function to prevent build errors on
older platforms. Sort similar checks alphabetically.
This makes the SSLKEYLOGFILE mechanism a silent no-op on unsupported
platforms, which is considered acceptable for a debugging feature.
Generate log messages containing TLS pre-master secrets when the
SSLKEYLOGFILE environment variable is set. This only ensures such
messages are prepared using the right logging category and passed to
libisc for further processing.
The TLS pre-master secret logging callback needs to be set on a
per-context basis, so ensure it happens for both client-side and
server-side TLS contexts.
TLS pre-master secrets will be dumped to disk using the logging
framework provided by libisc. Add a new logging category for this type
of debugging data in order to enable exporting it to a dedicated
channel. Derive the name of the new category from the name of the
relevant environment variable, SSLKEYLOGFILE.
ECDSA P-256 performs considerably better than the previously used
4096-bit RSA (can be observed using `openssl speed`), and, according
to RFC 6605, provides a security level comparable to 3072-bit RSA.
Previously, whole isc_mempool_get() and isc_mempool_set() would be
replaced by simpler version when run with address sanitizer.
Change the code to limit the fillcount to 1 and freemax to 0. This
change will make isc_mempool_get() to always allocate and use a single
new item and isc_mempool_put() will always return the item to the
allocator.
OpenSSL 3.0.1 does not accept 0 as a digest buffer length when
calling EVP_DigestSignFinal as it now checks that the digest buffer
length is large enough for the digest. Pass the digest buffer
length instead.
Surprising error IO error is returned when directory name
is given instead of named.conf file. It can be passed to named-checkconf
or include statement. Make a simple change to return Invalid file
instead. Still not precise, but much better error message is returned.
Fix of rhbz#490837.
Mutex debugging code (used when the ISC_MUTEX_DEBUG preprocessor macro
is set to 1 and PTHREAD_MUTEX_ERRORCHECK is defined) has been broken for
the past 3 years (since commit 2f3eee5a4f)
and nobody complained, which is a strong indication that this code is
not being used these days any more. External tools for detecting
locking issues are already wired into various GitLab CI checks. Drop
all code depending on the ISC_MUTEX_DEBUG preprocessor macro being set.
Mutex profiling code (used when the ISC_MUTEX_PROFILE preprocessor macro
is set to 1) has been broken for the past 3 years (since commit
0bed9bfc28) and nobody complained, which
is a strong indication that this code is not being used these days any
more. External tools for both measuring performance and detecting
locking issues are already wired into various GitLab CI checks. Drop
all code depending on the ISC_MUTEX_PROFILE preprocessor macro being
set.
On FreeBSD, the pthread primitives are not solely allocated on stack,
but part of the object lives on the heap. Missing pthread_*_destroy
causes the heap memory to grow and in case of fast lived object it's
possible to run out-of-memory.
Properly destroy the leaking mutex (worker->lock) and
the leaking condition (sock->cond).
Previously, we set the number of the hazard pointers to be 4 times the
number of workers because the dispatch ran on the old socket code.
Since the old socket code was removed there's a smaller number of
threads, namely:
- 1 main thread
- 1 timer thread
- <n> netmgr threads
- <n> threadpool threads
Set the number of hazard pointers to 2 + 2 * workers.
Previously, the isc_hp_init() could not lower the value of
isc__hp_max_threads, but because of a mistake the isc__hp_max_threads
would be set to HP_MAX_THREADS (e.g. 128 threads) thus it would be
always set to 128. This would result in increased memory usage even
when small number of workers were in use.
Change the default value of isc__hp_max_threads to be 1.
Additionally, enforce the max_hps value in isc_hp_new() to be smaller or
equal to HP_MAX_HPS. The only user is isc_queue which uses just 1
hazard pointer, so it's only theoretical issue.
Previously, when TCP accept failed, we have logged a message with
ISC_LOG_ERROR level. One common case, how this could happen is that the
client hits TCP client quota and is put on hold and when resumed, the
client has already given up and closed the TCP connection. In such
case, the named would log:
TCP connection failed: socket is not connected
This message was quite confusing because it actually doesn't say that
it's related to the accepting the TCP connection and also it logs
everything on the ISC_LOG_ERROR level.
Change the log message to "Accepting TCP connection failed" and for
specific error states lower the severity of the log message to
ISC_LOG_INFO.
There was a logical bug when setting a list of enabled TLS protocols,
which may lead to a crash (an abort()) on systems with ancient OpenSSL
versions.
The problem was due to the fact that we were INSIST()ing on supporting
all of the TLS versions, while checking only for mentioned in the
configuration was implied.
This commit adds an isc_nm_socket_type() function which can be used to
obtain a handle's socket type.
This change obsoletes isc_nm_is_tlsdns_handle() and
isc_nm_is_http_handle(). However, it was decided to keep the latter as
we eventually might end up supporting multiple HTTP versions.
This commit makes the TLS stream code to not issue mostly useless
debug log message on error during TLS I/O. This message was cluttering
logs a lot, as it can be generated on (almost) any non-clean TLS
connection termination, even in the cases when the actual query
completed successfully. Nor does it provide much value for end-users,
yet it can occasionally be seen when using dig and quite often when
running BIND over a publicly available network interface.
This commit removes unneeded isc__nmsocket_prep_destroy() call on ALPN
negotiation failure, which was eventually causing the TLS handle to
leak.
This call is not needed, as not attaching to the transport (TLS)
handle should be enough. At this point it seems like a kludge from
earlier days of the TLS code.
This prevents a direct leak in OPENSSL_init_crypto (called from
OPENSSL_init_ssl).
Add shim version of OPENSSL_cleanup because it is missing in LibreSSL on
OpenBSD.
This commit fixes a peculiar corner case in the client-side DoT code
because of which a crash could occur during a zone transfer. A junk
DNS message should be sent at the end of a zone transfer via TLS to
trigger the crash (abort).
This commit, hopefully, fixes that.
Also, this commit adds similar changes to the TCP DNS code, as it
shares the same origin and most of the logic.
Change 5756 (GL #2854) introduced build errors when using
'configure --disable-doh'. To fix this, isc_nm_is_http_handle() is
now defined in all builds, not just builds that have DoH enabled.
Missing code comments were added both for that function and for
isc_nm_is_tlsdns_handle().
This commit adds an isc_nm_set_min_answer_ttl() function which is
intended to to be used to give a hint to the underlying transport
regarding the answer TTL.
The interface is intentionally kept generic because over time more
transports might benefit from this functionality, but currently it is
intended for DoH to set "max-age" value within "Cache-Control" HTTP
header (as recommended in the RFC8484, section 5.1 "Cache
Interaction").
It is no-op for other DNS transports for the time being.
Check to see whether there are outstanding requests in the
httpd receive buffer after sending the response, and if so,
process them.
Test that pipelined requests are handled by sending multiple
minimal HTTP/1.1 using netcat (nc) and checking that we get
back the same number of responses.
Remember the amount of space consumed by the HTTP headers, then
move any trailing data to the start of the httpd->recvbuf once
we have finished processing the request.
if an incoming HTTP request is incomplete, but nothing else is clearly
wrong with it, the stats channel continues reading to see if there's
more coming. the buffer length was not being processed correctly in
this case. also, the server state was not reset correctly when the
request was complete, so that subsequent requests could be appended to
the first buffer instead of being treated as new.
in addition fixing the above problems, this commit also increases the
size of the httpd request buffer from 1024 to 4096, because some
browsers send a lot of headers.
Unless being configured with the `no-deprecated` option, OpenSSL 3.0.0
still has the deprecated APIs present and will throw warnings during
compilation, when using them.
Make sure that the old APIs are being used only with the older versions
of OpenSSL.
The EVP_MD_CTX_new() and EVP_MD_CTX_free() functions are renamed APIs
which were previously available as EVP_MD_CTX_create() and
EVP_MD_CTX_destroy() respectively, which means that we can use them
instead of providing our own shim functions.
OpenSSL 3.0.0 deprecates the EVP_MD_CTX_md() function.
Use EVP_MD_CTX_md() instead of EVP_MD_CTX_get0_md() and create a shim
to use the old variant for the older OpenSSL versions which don't have
the newer EVP_MD_CTX_get0_md().
The isc_time_add() and isc_time_subtract() didn't have a unit test, add
the unit test with couple of edge case vectors to check whether overflow
and underflow is correctly handled.
Use the __builtin_uadd_overflow() and __builtin_usub_overflow() for
overflow checks in isc_time_add() and isc_time_subtract(). This
generates more efficient and safe code.
The isc_time_add() could overflow when t.seconds + i.seconds == UINT_MAX
and t.nanoseconds + i.nanoseconds >= NS_PER_S.
Fix the overflow in isc_time_add(), and simplify the ISC_R_RANGE checks
both in isc_time_add() and isc_time_subtract() functions.
it is possible for udp_recv_cb() to fire after the socket
is already shutting down and statichandle is NULL; we need to
create a temporary handle in this case.
route/netlink sockets don't have stats counters associated with them,
so it's now necessary to check whether socket stats exist before
incrementing or decrementing them. rather than relying on the caller
for this, we now just pass the socket and an index, and the correct
stats counter will be updated if it exists.
isc_nm_routeconnect() opens a route/netlink socket, then calls a
connect callback, much like isc_nm_udpconnect(), with a handle that
can then be monitored for network changes.
Internally the socket is treated as a UDP socket, since route/netlink
sockets follow the datagram contract.
After support for route/netlink sockets is merged, not all sockets
will have stats counters associated with them, so it's now necessary
to check whether socket stats exist before incrementing or decrementing
them. rather than relying on the caller for this, we now just pass the
socket and an index, and the correct stats counter will be updated if
it exists.
The __builtin_expect() can be used to provide the compiler with branch
prediction information. The Gcc manual says[1] on the subject:
In general, you should prefer to use actual profile feedback for
this (-fprofile-arcs), as programmers are notoriously bad at
predicting how their programs actually perform.
Stop using __builtin_expect() and ISC_LIKELY() and ISC_UNLIKELY() macros
to provide the branch prediction information as the performance testing
shows that named performs better when the __builtin_expect() is not
being used.
1. https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005fexpect
due to comparing logfile suffixes as 32 bit rather than 64 bit
integers, logfiles with timestamp suffixes that should have been
removed when rolling could be left in place. this has been fixed.
Unify the header guard style and replace the inconsistent include guards
with #pragma once.
The #pragma once is widely and very well supported in all compilers that
BIND 9 supports, and #pragma once was already in use in several new or
refactored headers.
Using simpler method will also allow us to automate header guard checks
as this is simpler to programatically check.
For reference, here are the reasons for the change taken from
Wikipedia[1]:
> In the C and C++ programming languages, #pragma once is a non-standard
> but widely supported preprocessor directive designed to cause the
> current source file to be included only once in a single compilation.
>
> Thus, #pragma once serves the same purpose as include guards, but with
> several advantages, including: less code, avoidance of name clashes,
> and sometimes improvement in compilation speed. On the other hand,
> #pragma once is not necessarily available in all compilers and its
> implementation is tricky and might not always be reliable.
1. https://en.wikipedia.org/wiki/Pragma_once
Replace some "master/slave" terminology in the code with the preferred
"primary/secondary" keywords. This also changes user output such as
log messages, and fixes a typo ("seconary") in cfg_test.c.
There are still some references to "master" and "slave" for various
reasons:
- The old syntax can still be used as a synonym.
- The master syntax is kept when it refers to master files and formats.
- This commit replaces mainly keywords that are local. If "master" or
"slave" is used in for example a structure that is all over the
place, it is considered out of scope for the moment.
The AX_CHECK_JEMALLOC() m4 macro sets the JEMALLOC_CFLAGS variable, not
JEMALLOC_CPPFLAGS. Furthermore, the JEMALLOC_CFLAGS and JEMALLOC_LIBS
variables should only be included in the build flags if jemalloc was
successfully configured. Tweak lib/isc/Makefile.am accordingly.
Remove the dynamic registration of result codes. Convert isc_result_t
from unsigned + #defines into 32-bit enum type in grand unified
<isc/result.h> header. Keep the existing values of the result codes
even at the expense of the description and identifier tables being
unnecessary large.
Additionally, add couple of:
switch (result) {
[...]
default:
break;
}
statements where compiler now complains about missing enum values in the
switch statement.
Previously, when using compiler without support for static assertions,
the STATIC_ASSERT() macro would be replaced with runtime assertion.
Change the STATIC_ASSERT() macro to a version that's compile time
assertion even when using pre-C11 compilers.
Courtesy of Joseph Quinsey: https://godbolt.org/z/K9RvWS
This commit fixes a crash in DoT code when it was attempting to call a
read callback on the later stages of the connection when it is not
available.
It also fixes [GL #2884] (back-trace provided in the bug report is
exactly the same as was seen when fixing this problem).
This commit makes BIND verify that zone transfers are allowed to be
done over the underlying connection. Currently, it makes sense only
for DoT, but the code is deliberately made to be protocol-agnostic.
The intention of having this function is to have a predicate to check
if a zone transfer could be performed over the given handle. In most
cases we can assume that we can do zone transfers over any stream
transport except DoH, but this assumption will not work for zone
transfers over DoT (XoT), as the RFC9103 requires ALPN to happen,
which might not be the case for all deployments of DoT.
- Responses received by the dispatch are no longer sent to the caller
via a task event, but via a netmgr-style recv callback. the 'action'
parameter to dns_dispatch_addresponse() is now called 'response' and
is called directly from udp_recv() or tcp_recv() when a valid response
has been received.
- All references to isc_task and isc_taskmgr have been removed from
dispatch functions.
- All references to dns_dispatchevent_t have been removed and the type
has been deleted.
- Added a task to the resolver response context, to be used for fctx
events.
- When the caller cancels an operation, the response handler will be
called with ISC_R_CANCELED; it can abort immediately since the caller
will presumably have taken care of cleanup already.
- Cleaned up attach/detach in resquery and request.
- The `timeout_action` parameter to dns_dispatch_addresponse() been
replaced with a netmgr callback that is called when a dispatch read
times out. this callback may optionally reset the read timer and
resume reading.
- Added a function to convert isc_interval to milliseconds; this is used
to translate fctx->interval into a value that can be passed to
dns_dispatch_addresponse() as the timeout.
- Note that netmgr timeouts are accurate to the millisecond, so code to
check whether a timeout has been reached cannot rely on microsecond
accuracy.
- If serve-stale is configured, then a timeout received by the resolver
may trigger it to return stale data, and then resume waiting for the
read timeout. this is no longer based on a separate stale timer.
- The code for canceling requests in request.c has been altered so that
it can run asynchronously.
- TCP timeout events apply to the dispatch, which may be shared by
multiple queries. since in the event of a timeout we have no query ID
to use to identify the resp we wanted, we now just send the timeout to
the oldest query that was pending.
- There was some additional refactoring in the resolver: combining
fctx_join() and fctx_try_events() into one function to reduce code
duplication, and using fixednames in fetchctx and fetchevent.
- Incidental fix: new_adbaddrinfo() can't return NULL anymore, so the
code can be simplified.
The flow of operations in dispatch is changing and will now be similar
for both UDP and TCP queries:
1) Call dns_dispatch_addresponse() to assign a query ID and register
that we'll be listening for a response with that ID soon. the
parameters for this function include callback functions to inform the
caller when the socket is connected and when the message has been
sent, as well as a task action that will be sent when the response
arrives. (later this could become a netmgr callback, but at this
stage to minimize disruption to the calling code, we continue to use
isc_task for the response event.) on successful completion of this
function, a dispatch entry object will be instantiated.
2) Call dns_dispatch_connect() on the dispatch entry. this runs
isc_nm_udpconnect() or isc_nm_tcpdnsconnect(), as needed, and begins
listening for responses. the caller is informed via a callback
function when the connection is established.
3) Call dns_dispatch_send() on the dispatch entry. this runs
isc_nm_send() to send a request.
4) Call dns_dispatch_removeresponse() to terminate listening and close
the connection.
Implementation comments below:
- As we will be using netmgr buffers now. code to send the length in
TCP queries has also been removed as that is handled by the netmgr.
- TCP dispatches can be used by multiple simultaneous queries, so
dns_dispatch_connect() now checks whether the dispatch is already
connected before calling isc_nm_tcpdnsconnect() again.
- Running dns_dispatch_getnext() from a non-network thread caused a
crash due to assertions in the netmgr read functions that appear to be
unnecessary now. the assertions have been removed.
- fctx->nqueries was formerly incremented when the connection was
successful, but is now incremented when the query is started and
decremented if the connection fails.
- It's no longer necessary for each dispatch to have a pool of tasks, so
there's now a single task per dispatch.
- Dispatch code to avoid UDP ports already in use has been removed.
- dns_resolver and dns_request have been modified to use netmgr callback
functions instead of task events. some additional changes were needed
to handle shutdown processing correctly.
- Timeout processing is not yet fully converted to use netmgr timeouts.
- Fixed a lock order cycle reported by TSAN (view -> zone-> adb -> view)
by by calling dns_zt functions without holding the view lock.
- The read timer must always be stopped when reading stops.
- Read callbacks can now call isc_nm_read() again in TCP, TCPDNS and
TLSDNS; previously this caused an assertion.
- The wrong failure code could be sent after a UDP recv failure because
the if statements were in the wrong order. the check for a NULL
address needs to be after the check for an error code, otherwise the
result will always be set to ISC_R_EOF.
- When aborting a read or connect because the netmgr is shutting down,
use ISC_R_SHUTTINGDOWN. (ISC_R_CANCELED is now reserved for when the
read has been canceled by the caller.)
- A new function isc_nmhandle_timer_running() has been added enabling a
callback to check whether the timer has been reset after processing a
timeout.
- Incidental netmgr fix: always use isc__nm_closing() instead of
referencing sock->mgr->closing directly
- Corrected a few comments that used outdated function names.
Previously isc_nm_read() required references on the handle to be at
least 2, under the assumption that it would only ever be called from a
connect or accept callback. however, it can also be called from a read
callback, in which case the reference count might be only 1.
- Many dispatch attributes can be set implicitly instead of being passed
in. we can infer whether to set DNS_DISPATCHATTR_TCP or _UDP from
whether we're calling dns_dispatch_createtcp() or _createudp(). we
can also infer DNS_DISPATCHATTR_IPV4 or _IPV6 from the addresses or
the socket that were passed in.
- We no longer use dup'd sockets in UDP dispatches, so the 'dup_socket'
parameter has been removed from dns_dispatch_createudp(), along with
the code implementing it. also removed isc_socket_dup() since it no
longer has any callers.
- The 'buffersize' parameter was ignored and has now been removed;
buffersize is now fixed at 4096.
- Maxbuffers and maxrequests don't need to be passed in on every call to
dns_dispatch_createtcp() and _createudp().
In all current uses, the value for mgr->maxbuffers will either be
raised once from its default of 20000 to 32768, or else left
alone. (passing in a value lower than 20000 does not lower it.) there
isn't enough difference between these values for there to be any need
to configure this.
The value for disp->maxrequests controls both the quota of concurrent
requests for a dispatch and also the size of the dispatch socket
memory pool. it's not clear that this quota is necessary at all. the
memory pool size currently starts at 32768, but is sometimes lowered
to 4096, which is definitely unnecessary.
This commit sets both values permanently to 32768.
- Previously TCP dispatches allocated their own separate QID table,
which didn't incorporate a port table. this commit removes
per-dispatch QID tables and shares the same table between all
dispatches. since dispatches are created for each TCP socket, this may
speed up the dispatch allocation process. there may be a slight
increase in lock contention since all dispatches are sharing a single
QID table, but since TCP sockets are used less often than UDP
sockets (which were already sharing a QID table), it should not be a
substantial change.
- The dispatch port table was being used to determine whether a port was
already in use; if so, then a UDP socket would be bound with
REUSEADDR. this commit removes the port table, and always binds UDP
sockets that way.
This commit adds the ability to enable or disable stateless TLS
session resumption tickets (see RFC5077). Having this ability is
twofold.
Firstly, these tickets are encrypted by the server, and the algorithm
might be weaker than the algorithm negotiated during the TLS session
establishment (it is in general the case for TLSv1.2, but the generic
principle applies to TLSv1.3 as well, despite it having better ciphers
for session tickets). Thus, they might compromise Perfect Forward
Secrecy.
Secondly, disabling it might be necessary if the same TLS key/cert
pair is supposed to be used by multiple servers to achieve, e.g., load
balancing because the session ticket by default gets generated in
runtime, while to achieve successful session resumption ability, in
this case, would have required using a shared key.
The proper alternative to having the ability to disable stateless TLS
session resumption tickets is to implement a proper session tickets
key rollover mechanism so that key rotation might be performed
often (e.g. once an hour) to not compromise forward secrecy while
retaining the associated performance benefits. That is much more work,
though. On the other hand, having the ability to disable session
tickets allows having a deployable configuration right now in the
cases when either forward secrecy is wanted or sharing the TLS
key/cert pair between multiple servers is needed (or both).
This commit adds support for enforcing the preference of server
ciphers over the client ones. This way, the server attains control
over the ciphers priority and, thus, can choose more strong cyphers
when a client prioritises less strong ciphers over the more strong
ones, which is beneficial when trying to achieve Perfect Forward
Secrecy.
This commit adds support for setting TLS cipher list string in the
format specified in the OpenSSL
documentation (https://www.openssl.org/docs/man1.1.1/man1/ciphers.html).
The syntax of the cipher list is verified so that specifying the wrong
string will prevent the configuration from being loaded.
This commit adds support for loading DH-parameters (Diffie-Hellman
parameters) via the new "dhparam-file" option within "tls" clause. In
particular, Diffie-Hellman parameters are needed to enable the range
of forward-secrecy enabled cyphers for TLSv1.2, which are getting
silently disabled otherwise.
This commit adds the ability to specify allowed TLS protocols versions
within the "tls" clause. If an unsupported TLS protocol version is
specified in a file, the configuration file will not pass
verification.
Also, this commit adds strict checks for "tls" clauses verification,
in particular:
- it ensures that loading configuration files containing duplicated
"tls" clauses is not allowed;
- it ensures that loading configuration files containing "tls" clauses
missing "cert-file" or "key-file" is not allowed;
- it ensures that loading configuration files containing "tls" clauses
named as "ephemeral" or "none" is not allowed.
It was discovered that named could crash due to a segmentation fault
when jemalloc was in use and memory allocation failed. This was not
intended to happen as jemalloc's "xmalloc" option was set to "true" in
the "malloc_conf" configuration variable. However, that variable was
only set after jemalloc was already done with parsing it, which
effectively caused setting that variable to have no effect.
While investigating this issue, it was also discovered that enabling the
"xmalloc" option makes jemalloc use a slow processing path, decreasing
its performance by about 25%. [1]
Additionally, further testing (carried out after fixing the way
"malloc_conf" was set) revealed that the non-default configuration
options do not have any measurable effect on either authoritative or
recursive DNS server performance.
Replace code setting various jemalloc options to non-default values with
assertion checks of mallocx()/rallocx() return values.
[1] https://github.com/jemalloc/jemalloc/pull/523
On TCPDNS/TLSDNS read callback, the socket buffer could be reallocated
if the received contents would be larger than the buffer. The existing
code would not preserve the contents of the existing buffer which lead
to the loss of the already received data.
This commit changes the isc_mem_put()+isc_mem_get() with isc_mem_reget()
to preserve the existing contents of the socket buffer.
The netmgr, has an internal cache for freed active handles. This cache
was allocated using isc_mem_allocate()/isc_mem_free() API because it was
simpler to reallocate the cache when we needed to grow it. The new
isc_mem_reget() function could be used here reducing the need to use
isc_mem_allocate() API which is tad bit slower than isc_mem_get() API.
Previously, we cannot use isc_mem_reallocate() for growing the buffer
dynamically, because the memory was allocated using the
isc_mem_get()/isc_mem_put() API. With the introduction of the
isc_mem_reget() function, we can use grow/shrink the memory directly
without always moving the memory around as the allocator might have
reserved some extra space after the initial allocation.
Previously, the zero-sized allocations would return NULL pointer and the
caller had to make sure to not dereference such pointer. The C standard
defines the zero-sized calls to malloc() as implementation specific and
jemalloc mallocx() with zero size would be undefined behaviour. This
complicated the code as it had to handle such cases in a special manner
in all allocator and deallocator functions.
Now, for realloc(), the situation is even more complicated. In C
standard up to C11, the behavior would be implementation defined, and
actually some implementation would free to orig ptr and some would not.
Since C17 (via DR400) would deprecate such usage and since C23, the
behaviour would be undefined.
This commits changes helper mem_get(), mem_put() and mem_realloc()
functions to grow the zero-allocation from 0 to sizeof(void *).
This way we get a predicable behaviour that all the allocations will
always return valid pointer.
The isc_mem_get() and isc_mem_put() functions are leaving the memory
allocation size tracking to the users of the API, while
isc_mem_allocate() and isc_mem_free() would track the sizes internally.
This allowed to have isc_mem_rellocate() to manipulate the memory
allocations by the later set, but not the former set of the functions.
This commit introduces isc_mem_reget(ctx, old_ptr, old_size, new_size)
function that operates on the memory allocations with external size
tracking completing the API.
The native PKCS#11 support has been removed in favour of better
maintained, more performance and easier to use OpenSSL PKCS#11 engine
from the OpenSC project.
This commit adds new function isc_nm_http_makeuri() which is supposed
to unify DoH URI construction throughout the codebase.
It handles IPv6 addresses, hostnames, and IPv6 addresses given as
hostnames properly, and replaces similar ad-hoc code in the codebase.
The previous versions of BIND 9 exported its internal libraries so that
they can be used by third-party applications more easily. Certain
library functions were altered from specific BIND-only behavior to more
generic behavior when used by other applications.
This commit removes the function isc_lib_register() that was used by
external applications to enable the functionality.
this commit removes isc__nm_tcpdns_keepalive() and
isc__nm_tlsdns_keepalive(); keepalive for these protocols and
for TCP will now be set directly from isc_nmhandle_keepalive().
protocols that have an underlying TCP socket (i.e., TLS stream
and HTTP), now have protocol-specific routines, called by
isc_nmhandle_keeaplive(), to set the keepalive value on the
underlying socket.
previously, receiving a keepalive option had no effect on how
long named would keep the connection open; there was a place to
configure the keepalive timeout but it was never used. this commit
corrects that.
this also fixes an error in isc__nm_{tcp,tls}dns_keepalive()
in which the sense of a REQUIRE test was reversed; previously this
error had not been noticed because the functions were not being
used.
- fix some duplicated and out-of-order prototypes declared in
netmgr-int.h
- rename isc_nm_tcpdns_keepalive to isc__nm_tcpdns_keepalive as
it's for internal use
This commit changes the DoH code in such a way that it makes no
assumptions regarding which headers are expected to be processed
first. In particular, the code expected the :method: pseudo-header to
be processed early, which might not be true.
Add a new function to resize the number of counters in a statistics
counter structure. This will be needed when we keep track of DNSSEC
sign statistics and new keys are introduced due to a rollover.
Add a simple stats unit test that tests the existing library functions
isc_stats_ncounters, isc_stats_increment, isc_stats_decrement,
isc_stats_set, and isc_stats_update_if_greater.
Instead of disabling the fragmentation on the UDP sockets, we now
disable the Path MTU Discovery by setting IP(V6)_MTU_DISCOVER socket
option to IP_PMTUDISC_OMIT on Linux and disabling IP(V6)_DONTFRAG socket
option on FreeBSD. This option sets DF=0 in the IP header and also
ignores the Path MTU Discovery.
As additional mitigation on Linux, we recommend setting
net.ipv4.ip_no_pmtu_disc to Mode 3:
Mode 3 is a hardend pmtu discover mode. The kernel will only accept
fragmentation-needed errors if the underlying protocol can verify
them besides a plain socket lookup. Current protocols for which pmtu
events will be honored are TCP, SCTP and DCCP as they verify
e.g. the sequence number or the association. This mode should not be
enabled globally but is only intended to secure e.g. name servers in
namespaces where TCP path mtu must still work but path MTU
information of other protocols should be discarded. If enabled
globally this mode could break other protocols.
The commit fixes the doh_recv_send() because occasionally it would
fail because it did not wait for all responses to be sent, making the
check for ssends value to nit pass.
This commit changes TLS stream behaviour in such a way, that it is now
optimised for small writes. In the case there is a need to write less
or equal to 512 bytes, we could avoid calling the memory allocator at
the expense of possibly slight increase in memory usage. In case of
larger writes, the behviour remains unchanged.
At least at this point doing memory copying is not required. Probably
it was a workaround for some problem in the earlier days of DoH, at
this point it appears to be a waste of CPU cycles.
This commit significantly simplifies the code in http_send_outgoing()
as it was unnecessary complicated, because it was dealing with
multiple statically and dynamically allocated buffers, making it
extremely hard to follow, as well as making it to do unnecessary
memory copying in some situations. This commit fixes these issues,
while retaining the high level buffering logic.
When an HTTP/2 client terminates a session it means that it is about
to close the underlying connection. However, we were not doing that.
As a result, with the latest changes to the test suite, which made it
to limit amount of requests per a transport connection, the tests
using quota would hang for quite a while. This commit fixes that.
This commit ensures that only a limited number of requests is going to
be sent over a single HTTP/2 connection. Before that change was
introduced, it was possible to complete all of the planned sends via
only one transport connection, which undermines the purpose of the
tests using the quota facility.
The function should not be called here because it is, in general,
supposed to be called at the end of the transport level callbacks to
perform I/O, and thus, calling it here is clearly a mistake because it
breaks other code expectations. As a result of the call to
http_do_bio() from within isc__nm_http_request() the unit tests were
running slower than expected in some situations.
In this particular situation http_do_bio() is going to be called at
the end of the transport_connect_cb() (initially), or http_readcb(),
sending all of the scheduled requests at once.
This change affects only the test suite because it is the only place
in the codebase where isc__nm_http_request() is used in order to
ensure that the server is able to handle multiple HTTP/2 streams at
once.
This commit fixes a crash in DoH caused by transport handle to be
detached too early when sending outgoing data.
We need to attach to the session->handle earlier because as an
indirect result of the nghttp2_session_mem_send() the session might
get closed and the handle detached. However, there is still might be
some outgoing data to handle. Besides, even when the underlying socket
was closed via the handle, we still should try to attempt to send
outgoing data via isc_nm_send() to let it call write callback, passed
to the http_send_outgoing().
This commit gets rid of custom code taking care of response buffering
by replacing the custom code with isc_buffer_t. Also, it gets rid of
an unnecessary memory copying when sending a response.
This commit replaces the ad-hoc 64K buffer for incoming POST data with
isc_buffer_t backed by dynamically allocated buffer sized accordingly
to the value in the "Content-Length" header.
The commit replaces an ad-hoc incoming DNS-message buffer in the
client-side DoH code with isc_buffer_t.
The commit also fixes a timing issue in the unit tests revealed by the
change.
This commit replaces a static ad-hoc HTTP/2 session's temporary buffer
with a realloc-able isc_buffer_t object, which is being allocated on
as needed basis, lowering the memory consumption somewhat. The buffer
is needed in very rare cases, so allocating it prematurely is not
wise.
Also, it fixes a bug in http_readcb() where the ad-hoc buffer appeared
to be improperly used, leading to a situation when the processed data
from the receiving regions can be processed twice, while unprocessed
data will never be processed.
This commit gets rid of RW locks in a hot path of the DoH code. In the
original design, it was implied that we add new endpoints after the
HTTP listener was created. Such a design implies some locking. We do
not need such flexibility, though. Instead, we could build a set of
endpoints before the HTTP listener gets created. Such a design does
not need RW locks at all.
On the isc_mem water change the old water_t structure could be used
after free. Instead of introducing reference counting on the hot-path
we are going to introduce additional constraints on the
isc_mem_setwater. Once it's set for the first time, the additional
calls have to be made with the same water and water_arg arguments.
This commit makes number of concurrent HTTP/2 streams per connection
configurable as a mean to fight DDoS attacks. As soon as the limit is
reached, BIND terminates the whole session.
The commit adds a global configuration
option (http-streams-per-connection) which can be overridden in an
http <name> {...} statement like follows:
http local-http-server {
...
streams-per-connection 100;
...
};
For now the default value is 100, which should be enough (e.g. NGINX
uses 128, but it is a full-featured WEB-server). When using lower
numbers (e.g. ~70), it is possible to hit the limit with
e.g. flamethrower.
This commit adds support for http-listener-clients global options as
well as ability to override the default in an HTTP server description,
like:
http local-http-server {
...
listener-clients 100;
...
};
This way we have ability to specify per-listener active connections
quota globally and then override it when required. This is exactly
what AT&T requested us: they wanted a functionality to specify quota
globally and then override it for specific IPs. This change
functionality makes such a configuration possible.
It makes sense: for example, one could have different quotas for
internal and external clients. Or, for example, one could use BIND's
internal ability to serve encrypted DoH with some sane quota value for
internal clients, while having un-encrypted DoH listener without quota
to put BIND behind a load balancer doing TLS offloading for external
clients.
Moreover, the code no more shares the quota with TCP, which makes
little sense anyway (see tcp-clients option), because of the nature of
interaction of DoH clients: they tend to keep idle opened connections
for longer periods of time, preventing the TCP and TLS client from
being served. Thus, the need to have a separate, generally larger,
quota for them.
Also, the change makes any option within "http <name> { ... };"
statement optional, making it easier to override only required default
options.
By default, the DoH connections are limited to 300 per listener. I
hope that it is a good initial guesstimate.
This commit adds the code (and some tests) which allows verifying
validity of HTTP paths both in incoming HTTP requests and in BIND's
configuration file.
An unhandled code path left GET query string data uninitialised (equal
to NULL) and led to a crash during the requests' base64 data
decoding. This commit fixes that.
It was discovered that setting the thread affinity on both the netmgr
and netthread threads lead to inconsistent recursive performance because
sometimes the netmgr and netthread threads would compete over single
resource and sometimes not.
Removing setting the affinity causes a slight dip in the authoritative
performance around 5% (the measured range was from 3.8% to 7.8%), but
the recursive performance is now consistently good.
On OpenBSD and more generally on platforms without either jemalloc or
malloc_(usable_)size, we need to increase the alignment for the memory
to sizeof(max_align_t) as with plain sizeof(void *), the compiled code
would be crashing when accessing the returned memory.
It was discovered that on some platforms (f.e. Alpine Linux with MUSL)
the result of isc_os_ncpus() call differ when called before and after we
drop privileges. This commit changes the isc_os_ncpus() call to cache
the result from the first call and thus always return the same value
during the runtime of the named. The first call to isc_os_ncpus() is
made as soon as possible on the library initalization.
The isc_mem_get(), isc_mem_allocate() and isc_mem_reallocate() can
return NULL ptr in case where the allocation size is NULL. Remove the
nonnull attribute from the functions' declarations.
This stems from the following definition in the C11 standard:
> If the size of the space requested is zero, the behavior is
> implementation-defined: either a null pointer is returned, or the
> behavior is as if the size were some nonzero value, except that the
> returned pointer shall not be used to access an object.
In this case, we return NULL as it's easier to detect errors when
accessing pointer from zero-sized allocation which should obviously
never happen.
In the rallocx() shim for OpenBSD (that's the only platform that doesn't
have malloc_size() or malloc_usable_size() equivalent), the newly
allocated size was missing the extra size_t member for storing the
allocation size leading to size_t sized overflow at the end of the
reallocated memory chunk.
In the jemalloc merge request, we missed the fact that ah_frees and ah_handles
are reallocated which is not compatible with using isc_mem_get() for allocation
and isc_mem_put() for deallocation. This commit reverts that part and restores
use of isc_mem_allocate() and isc_mem_free().
Previously, the isc_mem_allocate() and isc_mem_free() would be used for
isc_mem_total test, but since we now use the real allocation
size (sallocx, malloc_size, malloc_usable_size) to track the allocation
size, it's impossible to get the test value right. Changing the test to
use isc_mem_get() and isc_mem_put() will use the exact size provided, so
the test would work again on all the platforms even when jemalloc is not
being used.
This commit refactors the water mechanism in the isc_mem API to use
single pointer to a water_t structure that can be swapped with
atomic_exchange operation instead of having four different
values (water, water_arg, hi_water, lo_water) in the flat namespace.
This reduces the need for locking and prevents a race when water and
water_arg could be desynchronized.
Calls to jemalloc extended API with size == 0 ends up in undefined
behaviour. This commit makes the isc_mem_get() and friends calls
more POSIX aligned:
If size is 0, either a null pointer or a unique pointer that can be
successfully passed to free() shall be returned.
We picked the easier route (which have been already supported in the old
code) and return NULL on calls to the API where size == 0.
This commit adds support for systems where the jemalloc library is not
available as a package, here's the quick summary:
* On Linux - the jemalloc is usually available as a package, if
configured --without-jemalloc, the shim would be used around
malloc(), free(), realloc() and malloc_usable_size()
* On macOS - the jemalloc is available from homebrew or macports, if
configured --without-jemalloc, the shim would be used around
malloc(), free(), realloc() and malloc_size()
* On FreeBSD - the jemalloc is *the* system allocator, we just need
to check for <malloc_np.h> header to get access to non-standard API
* On NetBSD - the jemalloc is *the* system allocator, we just need to
check for <jemalloc/jemalloc.h> header to get access to non-standard
API
* On a system hostile to users and developers (read OpenBSD) - the
jemalloc API is emulated by using ((size_t *)ptr)[-1] field to hold
the size information. The OpenBSD developers care only for
themselves, so why should we care about speed on OpenBSD?
- isc_mempool_get() can no longer fail; when there are no more objects
in the pool, more are always allocated. checking for NULL return is
no longer necessary.
- the isc_mempool_setmaxalloc() and isc_mempool_getmaxalloc() functions
are no longer used and have been removed.
Current mempools are kind of hybrid structures - they serve two
purposes:
1. mempool with a lock is basically static sized allocator with
pre-allocated free items
2. mempool without a lock is a doubly-linked list of preallocated items
The first kind of usage could be easily replaced with jemalloc small
sized arena objects and thread-local caches.
The second usage not-so-much and we need to keep this (in
libdns:message.c) for performance reasons.
Previously, we only had capability to trace the mempool gets and puts,
but for debugging, it's sometimes also important to keep track how many
and where do the memory pools get created and destroyed. This commit
adds such tracking capability.
The isc_mem_allocate() comes with additional cost because of the memory
tracking. In this commit, we replace the usage with isc_mem_get()
because we track the allocated sizes anyway, so it's possible to also
replace isc_mem_free() with isc_mem_put().
The jemalloc non-standard API fits nicely with our memory contexts, so
just rewrite the memory context internals to use the non-public API.
There's just one caveat - since we no longer track the size of the
allocation for isc_mem_allocate/isc_mem_free combination, we need to use
sallocx() to get real allocation size in both allocator and deallocator
because otherwise the sizes would not match.
The ISC_MEM_DEBUGSIZE and ISC_MEM_DEBUGCTX did sanity checks on matching
size and memory context on the memory returned to the allocator. Those
will no longer needed when most of the allocator will be replaced with
jemalloc.
There's global variable called `malloc_conf` that can be used to
configure jemalloc behaviour at the program startup. We use following
configuration:
* xmalloc:true - abort-on-out-of-memory enabled.
* background_thread:true - Enable internal background worker threads
to handle purging asynchronously.
* metadata_thp:auto - allow jemalloc to use transparent huge page
(THP) for internal metadata initially, but may begin to do so when
metadata usage reaches certain level.
* dirty_decay_ms:30000 - Approximate time in milliseconds from the
creation of a set of unused dirty pages until an equivalent set of
unused dirty pages is purged and/or reused.
* muzzy_decay_ms:30000 - Approximate time in milliseconds from the
creation of a set of unused muzzy pages until an equivalent set of
unused muzzy pages is purged and/or reused.
More information about the specific meaning can be found in the jemalloc
manpage or online at http://jemalloc.net/jemalloc.3.html
The jemalloc allocator is scalable high performance allocator, this is
the first in the series of commits that will add jemalloc as a memory
allocator for BIND 9.
This commit adds configure.ac check and Makefile modifications to use
jemalloc as BIND 9 allocator.
Previously, we only had capability to trace the memory gets and puts,
but for debugging, it's sometimes also important to keep track how many
and where do the memory contexts get created and destroyed. This commit
adds such tracking capability.
This commit makes BIND return HTTP status codes for malformed or too
small requests.
DNS request processing code would ignore such requests. Such an
approach works well for other DNS transport but does not make much
sense for HTTP, not allowing it to complete the request/response
sequence.
Suppose execution has reached the point where DNS message handling
code has been called. In that case, it means that the HTTP request has
been successfully processed, and, thus, we are expected to respond to
it either with a message containing some DNS payload or at least to
return an error status code. This commit ensures that BIND behaves
this way.
This error code fits better than the more generic "Internal Server
Error" (500) which implies that the problem is on the server.
Also, do not end the whole HTTP/2 session on a bad request.
We were too strict regarding the value and presence of "Accept" HTTP
header, slightly breaking compatibility with the specification.
According to RFC8484 client SHOULD add "Accept" header to the requests
but MUST be able to handle "application/dns-message" media type
regardless of the value of the header. That basically suggests we
ignore its value.
Besides, verifying the value of the "Accept" header is a bit tricky
because it could contain multiple media types, thus requiring proper
parsing. That is doable but does not provide us with any benefits.
Among other things, not verifying the value also fixes compatibility
with clients, which could advertise multiple media types as supported,
which we should accept. For example, it is possible for a perfectly
valid request to contain "application/dns-message", "application/*",
and "*/*" in the "Accept" header value. Still, we would treat such a
request as invalid.
The commit fixes BIND hanging when browsers end HTTP/2 streams
prematurely (for example, by sending RST_STREAM). It ensures that
isc__nmsocket_prep_destroy() will be called for an HTTP/2 stream,
allowing it to be properly disposed.
The problem was impossible to reproduce using dig or DoH benchmarking
software (e.g. flamethrower) because these do not tend to end HTTP/2
streams prematurely.
This commit adds two new autoconf options `--enable-doh` (enabled by
default) and `--with-libnghttp2` (mandatory when DoH is enabled).
When DoH support is disabled the library is not linked-in and support
for http(s) protocol is disabled in the netmgr, named and dig.
The isc/platform.h header was left empty which things either already
moved to config.h or to appropriate headers. This is just the final
cleanup commit.
The last remaining defines needed for platforms without NAME_MAX and
PATH_MAX (I'm looking at you, GNU Hurd) were moved to isc/dir.h where
it's prevalently used.
The ISC_STRERRORSIZE was defined in isc/platform.h header as the
value was different between Windows and POSIX platforms. Now that
Windows is gone, move the define to where it belongs.
The Makefile.tests was modifying global AM_CFLAGS and LDADD and could
accidentally pull /usr/include to be listed before the internal
libraries, which is known to cause problems if the headers from the
previous version of BIND 9 has been installed on the build machine.
In DNS Flag Day 2020, we started setting the DF (Don't Fragment socket
option on the UDP sockets. It turned out, that this code was incomplete
leading to dropping the outgoing UDP packets.
This has been now remedied, so it is possible to disable the
fragmentation on the UDP sockets again as the sending error is now
handled by sending back an empty response with TC (truncated) bit set.
This reverts commit 66eefac78c.
When the fragmentation is disabled on UDP sockets, the uv_udp_send()
call can fail with UV_EMSGSIZE for messages larger than path MTU.
Previously, this error would end with just discarding the response. In
this commit, a proper handling of such case is added and on such error,
a new DNS response with truncated bit set is generated and sent to the
client.
This change allows us to disable the fragmentation on the UDP
sockets again.
Previously, each protocol (TCPDNS, TLSDNS) has specified own function to
disable pipelining on the connection. An oversight would lead to
assertion failure when opcode is not query over non-TCPDNS protocol
because the isc_nm_tcpdns_sequential() function would be called over
non-TCPDNS socket. This commit removes the per-protocol functions and
refactors the code to have and use common isc_nm_sequential() function
that would either disable the pipelining on the socket or would handle
the request in per specific manner. Currently it ignores the call for
HTTP sockets and causes assertion failure for protocols where it doesn't
make sense to call the function at all.
The requirements for BIND 9.17+ now requires C11 support from the
compiler, so we can safely drop most of the stdatomic.h shims from
lib/isc/unix/include/stdatomic.h.
This commit removes support for clang atomic builtins (clang >= 3.6.0
includes stdatomic.h header) and for Gcc __sync builtins.
The only compatibility shim that remains is support for __atomic
builtins for Gcc >= 4.7.0 since CentOS 7 still includes only Gcc 4.8.1
and the proper stdatomic.h header was only introduced in Gcc >= 4.9.
The warning was produced by an ASAN build:
runtime error: null pointer passed as argument 2, which is declared to
never be null
This commit fixes it by checking if nghttp2_session_mem_send() has
actually returned anything.
This change sets the mentioned fields properly and gets rid of klusges
added in the times when we were keeping pointers to isc_sockaddr_t
instead of copies. Among other things it helps to avoid a situation
when garbage instead of an address appears in dig output.
We cannot use DoH for zone transfers. According to RFC8484 a DoH
request contains exactly one DNS message (see Section 6: Definition of
the "application/dns-message" Media Type,
https://datatracker.ietf.org/doc/html/rfc8484#section-6). This makes
DoH unsuitable for zone transfers as often (and usually!) these need
more than one DNS message, especially for larger zones.
As zone transfers over DoH are not (yet) standardised, nor discussed
in RFC8484, the best thing we can do is to return "not implemented."
Technically DoH can be used to transfer small zones which fit in one
message, but that is not enough for the generic case.
Also, this commit makes the server-side DoH code ensure that no
multiple responses could be attempted to be sent over one HTTP/2
stream. In HTTP/2 one stream is mapped to one request/response
transaction. Now the write callback will be called with failure error
code in such a case.
Support a situation in header processing callback when client side
code could receive a belated response or part of it. That could
happen when the HTTP/2 session was already closed, but there were some
response data from server in flight. Other client-side nghttp2
callbacks code already handled this case.
The bug became apparent after HTTP/2 write buffering was supported,
leading to rare unit test failures.
This commit ensures that sock->h2.connect.cstream gets nullified when
the object in question is deleted. This fixes a nasty crash in dig
exposed when receiving large responses leading to double free()ing.
Also, it refactors how the client-side code keeps track of client
streams (hopefully) preventing from similar errors appearing in the
future.
This commit makes NM code to report HTTP as a stream protocol. This
makes it possible to handle large responses properly. Like:
dig +https @127.0.0.1 A cmts1-dhcp.longlines.com
The Windows support has been completely removed from the source tree
and BIND 9 now no longer supports native compilation on Windows.
We might consider reviewing mingw-w64 port if contributed by external
party, but no development efforts will be put into making BIND 9 compile
and run on Windows again.
While cleaning up the usage of HAVE_UV_<func> macros, we forgot to
cleanup the HAVE_UV_UDP_CONNECT in the actual code and
HAVE_UV_TRANSLATE_SYS_ERROR and this was causing Windows build to fail
on uv_udp_send() because the socket was already connected and we were
falsely assuming that it was not.
The platforms with autoconf support were not affected, because we were
still checking for the functions from the configure.
This commit adds the ability to consolidate HTTP/2 write requests if
there is already one in flight. If it is the case, the code will
consolidate multiple subsequent write request into a larger one
allowing to utilise the network in a more efficient way by creating
larger TCP packets as well as by reducing TLS records overhead (by
creating large TLS records instead of multiple small ones).
This optimisation is especially efficient for clients, creating many
concurrent HTTP/2 streams over a transport connection at once. This
way, the code might create a small amount of multi-kilobyte requests
instead of many 50-120 byte ones.
In fact, it turned out to work so well that I had to add a work-around
to the code to ensure compatibility with the flamethrower, which, at
the time of writing, does not support TLS records larger than two
kilobytes. Now the code tries to flush the write buffer after 1.5
kilobyte, which is still pretty adequate for our use case.
Essentially, this commit implements a recommendation given by nghttp2
library:
https://nghttp2.org/documentation/nghttp2_session_mem_send.html
The libuv has a support for running long running tasks in the dedicated
threadpools, so it doesn't affect networking IO.
This commit adds isc_nm_work_enqueue() wrapper that would wraps around
the libuv API and runs it on top of associated worker loop.
The only limitation is that the function must be called from inside
network manager thread, so the call to the function should be wrapped
inside a (bound) task.
Instead of having a configure check for every missing function that has
been added in later version of libuv, we now use UV_VERSION_HEX to
decide whether we need the shim or not.
The uv_req_get_data() and uv_req_set_data() functions were introduced in
libuv >= 1.19.0, so we need to add compatibility shims with older libuv
versions.
On some platforms, the __attribute__ constructor and destructor won't
take priorities and the compilation failed. On such platform would be
macOS. For this reason, the constructor/destructor in the libisc was
reworked to not use priorities, but have a single constructor and
destructor that calls the appropriate routines in correct order.
This commit removes the extra priority because it's now not needed and
it also breaks a compilation on macOS with GCC 10.
configuring with --enable-mutex-atomics flagged these incorrectly
initialised variables on systems where pthread_mutex_init doesn't
just zero out the structure.
The isc_nmiface_t type was holding just a single isc_sockaddr_t,
so we got rid of the datatype and use plain isc_sockaddr_t in place
where isc_nmiface_t was used before. This means less type-casting and
shorter path to access isc_sockaddr_t members.
At the same time, instead of keeping the reference to the isc_sockaddr_t
that was passed to us when we start listening, we will keep a local
copy. This prevents the data race on destruction of the ns_interface_t
objects where pending nmsockets could reference the sockaddr of already
destroyed ns_interface_t object.
Previously, as a way of reducing the contention between threads a
clientmgr object would be created for each interface/IP address.
We tasks being more strictly bound to netmgr workers, this is no longer
needed and we can just create clientmgr object per worker queue (ncpus).
Each clientmgr object than would have a single task and single memory
context.
Since a client object is bound to a netmgr handle, each client
will always be processed by the same netmgr worker, so we can
simplify the code by binding client->task to the same thread as
the client. Since ns__client_request() now runs in the same event
loop as client->task events, is no longer necessary to pause the
task manager before launching them.
Also removed some functions in isc_task that were not used.
Instead of using fixed quantum, this commit adds atomic counter for
number of items on each queue and uses the number of netievents
scheduled to run as the limit of maximum number of netievents for a
single process_queue() run.
This prevents the endless loops when the netievent would schedule more
netievents onto the same loop, but we don't have to pick "magic" number
for the quantum.
This commit adds a new configuration option to set the receive and send
buffer sizes on the TCP and UDP netmgr sockets. The default is `0`
which doesn't set any value and just uses the value set by the operating
system.
There's no magic value here - set it too small and the performance will
drop, set it too large, the buffers can fill-up with queries that have
already timeouted on the client side and nobody is interested for the
answer and this would just make the server clog up even more by making
it produce useless work.
The `netstat -su` can be used on POSIX systems to monitor the receive
and send buffer errors.
The outgoing UDP socket selection would pick unintialized children
socket on Windows, because we have more netmgr workers than we have
listening sockets. This commit fixes the selection by keeping the
outgoing socket the same, so it's always run on existing socket.
The initial intent was to limit the number of concurrent streams by
the value of 100 but due to the error when reading the documentation
it was set to the maximum possible number of streams per session.
This could lead to security issues, e.g. a remote attacker could have
taken down the BIND instance by creating lots of sessions via low
number of transport connections. This commit fixes that.
We should not call nghttp2_session_terminate_session() in server-side
code after all of the active HTTP/2 streams are processed. The
underlying transport connection is expected to remain opened at least
for some time in this case for new HTTP/2 requests to arrive. That is
what flamethrower was expecting and it makes perfect sense from the
HTTP/2 perspective.
During the stress testing, it was discovered that the default netmgr
quantum of 128 is not enough and there was a performance drop for TCP on
FreeBSD. Bumping the default quantum to 1024 solves the performance
issue and is still enough to prevent the endless loops.
We were clearing the pointer to taskmgr as soon as isc_taskmgr_destroy()
would be called and before all tasks were finished. Unfortunately, some
tasks would use global named_g_taskmgr objects from inside the events
and this would cause either a data race or NULL pointer dereference.
This commit fixes the data race by moving the destruction of the
referenced pointer to the time after all tasks are finished.
Network manager events that require interlock (pause, resume, listen)
are now always executed in the same worker thread, mgr->workers[0],
to prevent races.
"stoplistening" events no longer require interlock.
- ensure isc_nm_pause() and isc_nm_resume() work the same whether
run from inside or outside of the netmgr.
- promote 'stop' events to the priority event level so they can
run while the netmgr is pausing or paused.
- when pausing, drain the priority queue before acquiring an
interlock; this prevents a deadlock when another thread is waiting
for us to complete a task.
- release interlock after pausing, reacquire it when resuming, so
that stop events can happen.
some incidental changes:
- use a function to enqueue pause and resume events (this was part of a
different change attempt that didn't work out; I kept it because I
thought was more readable).
- make mgr->nworkers a signed int to remove some annoying integer casts.
The netmgr listening, stoplistening, pausing and resuming functions
now use barriers for synchronization, which makes the code much simpler.
isc/barrier.h defines isc_barrier macros as a front-end for uv_barrier
on platforms where that works, and pthread_barrier where it doesn't
(including TSAN builds).
When isc__nm_http_stoplistening() is run from inside the netmgr, we need
to make sure it's run synchronously. This commit is just a band-aid
though, as the desired behvaior for isc_nm_stoplistening() is not always
the same:
1. When run from outside user of the interface, the call must be
synchronous, e.g. the calling code expects the call to really stop
listening on the interfaces.
2. But if there's a call from listen<proto> when listening fails,
that needs to be scheduled to run asynchronously, because
isc_nm_listen<proto> is being run in a paused (interlocked)
netmgr thread and we could get stuck.
The proper solution would be to make isc_nm_stoplistening()
behave like uv_close(), i.e., to have a proper callback.
all zone loading tasks have the privileged flag, but we only want
them to run as privileged tasks when the server is being initialized;
if we privilege them the rest of the time, the server may hang for a
long time after a reload/reconfig. so now we call isc_taskmgr_setmode()
to turn privileged execution mode on or off in the task manager.
isc_task_privileged() returns true if the task's privilege flag is
set *and* the taskmgr is in privileged execution mode. this is used
to determine in which netmgr event queue the task should be run.
There was a theoretical possibility of clogging up the queue processing
with an endless loop where currently processing netievent would schedule
new netievent that would get processed immediately. This wasn't such a
problem when only netmgr netievents were processed, but with the
addition of the tasks, there are at least two situation where this could
happen:
1. In lib/dns/zone.c:setnsec3param() the task would get re-enqueued
when the zone was not yet fully loaded.
2. Tasks have internal quantum for maximum number of isc_events to be
processed, when the task quantum is reached, the task would get
rescheduled and then immediately processed by the netmgr queue
processing.
As the isc_queue doesn't have a mechanism to atomically move the queue,
this commit adds a mechanism to quantize the queue, so enqueueing new
netievents will never stop processing other uv_loop_t events.
The default quantum size is 128.
Since the queue used in the network manager allows items to be enqueued
more than once, tasks are now reference-counted around task_ready()
and task_run(). task_ready() now has a public API wrapper,
isc_task_ready(), that the netmgr can use to reschedule processing
of a task if the quantum has been reached.
Incidental changes: Cleaned up some unused fields left in isc_task_t
and isc_taskmgr_t after the last refactoring, and changed atomic
flags to atomic_bools for easier manipulation.
With taskmgr running on top of netmgr, the ordering of how the tasks and
netmgr shutdown interacts was wrong as previously isc_taskmgr_destroy()
was waiting until all tasks were properly shutdown and detached. This
responsibility was moved to netmgr, so we now need to do the following:
1. shutdown all the tasks - this schedules all shutdown events onto
the netmgr queue
2. shutdown the netmgr - this also makes sure all the tasks and
events are properly executed
3. Shutdown the taskmgr - this now waits for all the tasks to finish
running before returning
4. Shutdown the netmgr - this call waits for all the netmgr netievents
to finish before returning
This solves the race when the taskmgr object would be destroyed before
all the tasks were finished running in the netmgr loops.
Previously, netmgr, taskmgr, timermgr and socketmgr all had their own
isc_<*>mgr_create() and isc_<*>mgr_destroy() functions. The new
isc_managers_create() and isc_managers_destroy() fold all four into a
single function and makes sure the objects are created and destroy in
correct order.
Especially now, when taskmgr runs on top of netmgr, the correct order is
important and when the code was duplicated at many places it's easy to
make mistake.
The former isc_<*>mgr_create() and isc_<*>mgr_destroy() functions were
made private and a single call to isc_managers_create() and
isc_managers_destroy() is required at the program startup / shutdown.
Under some circumstances a situation might occur when server-side
session gets finished while there are still active HTTP/2
streams. This would lead to isc_nm_httpsocket object leaks.
This commit fixes this behaviour as well as refactors failed_read_cb()
to allow better code reuse.
This commit fixes a situation when a cstream object could get unlinked
from the list as a result of a cstream->read_cb call. Thus, unlinking
it after the call could crash the program.
... the last handle has been detached after calling write
callback. That makes it possible to detach from the underlying socket
and not to keep the socket object alive for too long. This issue was
causing TLS tests with quota to fail because quota might not have been
detached on time (because it was still referenced by the underlying
TCP socket).
One could say that this commit is an ideological continuation of:
513cdb52ec.
This way we create less netievent objects, not bombarding NM with the
messages in case of numerous low-level errors (like too many open
files) in e.g. unit tests.
This change ensures that a TCP connect callback is called from within
the context of a worker thread in case of a low-level error when
descriptors cannot be created (e.g. when there are too many open file
descriptors).
When looking for key files, we could use isdigit rather than checking
if the character is within the range [0-9].
Use (unsigned char) cast to ensure the value is representable in the
unsigned char type (as suggested by the isdigit manpage).
Change " & 0xff" occurrences to the recommended (unsigned char) type
cast.
This commit adds support for generating backtraces on Windows and
refactors the isc_backtrace API to match the Linux/BSD API (without
the isc_ prefix)
* isc_backtrace_gettrace() was renamed to isc_backtrace(), the third
argument was removed and the return type was changed to int
* isc_backtrace_symbols() was added
* isc_backtrace_symbols_fd() was added and used as appropriate
On Windows, the iocompletionport_createthreads() didn't use
isc_thread_create() to create new threads for processing IO, but just a
simple CreateThread() function that completely circumvent the
isc_trampoline mechanism to initialize global isc_tid_v. This lead to
segmentation fault in isc_hp API because '-1' isn't valid index to the
hazard pointer array.
This commit changes the iocompletionport_createthreads() to use
isc_thread_create() instead of CreateThread() to properly initialize
isc_tid_v.
The malloc attribute allows compiler to do some optmizations on
functions that behave like malloc/calloc, like assuming that the
returned pointer do not alias other pointers.
There is no possibility for mpctx->items to be NULL at the point where
the code was removed, since we enforce that fillcount > 0, if
mpctx->items == NULL when isc_mempool_get is called, then we will
allocate fillcount more items and add to the mpctx->items list.
as with TLS, the destruction of a client stream on failed read
needs to be conditional: if we reached failed_read_cb() as a
result of a timeout on a timer which has subsequently been
reset, the stream must not be closed.
the destruction of the socket in tls_failed_read_cb() needs to be
conditional; if reached due to a timeout on a timer that has
subsequently been reset, the socket must not be destroyed.
this is similar in structure to the UDP timeout recovery test.
this commit adds a new mechanism to the netmgr test allowing the
listen socket to accept incoming TCP connections but never send
a response. this forces the client to time out on read.
when running read callbacks, if the event result is not ISC_R_SUCCESS,
the callback is always run asynchronously. this is a problem on timeout,
because there's no chance to reset the timer before the socket has
already been destroyed. this commit allows read callbacks to run
synchronously for both ISC_R_SUCCESS and ISC_R_TIMEDOUT result codes.
this test sets up a server socket that listens for UDP connections
but never responds. the client will always time out; it should retry
five times before giving up.
This commit changes the taskmgr to run the individual tasks on the
netmgr internal workers. While an effort has been put into keeping the
taskmgr interface intact, couple of changes have been made:
* The taskmgr has no concept of universal privileged mode - rather the
tasks are either privileged or unprivileged (normal). The privileged
tasks are run as a first thing when the netmgr is unpaused. There
are now four different queues in in the netmgr:
1. priority queue - netievent on the priority queue are run even when
the taskmgr enter exclusive mode and netmgr is paused. This is
needed to properly start listening on the interfaces, free
resources and resume.
2. privileged task queue - only privileged tasks are queued here and
this is the first queue that gets processed when network manager
is unpaused using isc_nm_resume(). All netmgr workers need to
clean the privileged task queue before they all proceed normal
operation. Both task queues are processed when the workers are
finished.
3. task queue - only (traditional) task are scheduled here and this
queue along with privileged task queues are process when the
netmgr workers are finishing. This is needed to process the task
shutdown events.
4. normal queue - this is the queue with netmgr events, e.g. reading,
sending, callbacks and pretty much everything is processed here.
* The isc_taskmgr_create() now requires initialized netmgr (isc_nm_t)
object.
* The isc_nm_destroy() function now waits for indefinite time, but it
will print out the active objects when in tracing mode
(-DNETMGR_TRACE=1 and -DNETMGR_TRACE_VERBOSE=1), the netmgr has been
made a little bit more asynchronous and it might take longer time to
shutdown all the active networking connections.
* Previously, the isc_nm_stoplistening() was a synchronous operation.
This has been changed and the isc_nm_stoplistening() just schedules
the child sockets to stop listening and exits. This was needed to
prevent a deadlock as the the (traditional) tasks are now executed on
the netmgr threads.
* The socket selection logic in isc__nm_udp_send() was flawed, but
fortunatelly, it was broken, so we never hit the problem where we
created uvreq_t on a socket from nmhandle_t, but then a different
socket could be picked up and then we were trying to run the send
callback on a socket that had different threadid than currently
running.
Since all the libraries are internal now, just cleanup the ISCAPI remnants
in isc_socket, isc_task and isc_timer APIs. This means, there's one less
layer as following changes have been done:
* struct isc_socket and struct isc_socketmgr have been removed
* struct isc__socket and struct isc__socketmgr have been renamed
to struct isc_socket and struct isc_socketmgr
* struct isc_task and struct isc_taskmgr have been removed
* struct isc__task and struct isc__taskmgr have been renamed
to struct isc_task and struct isc_taskmgr
* struct isc_timer and struct isc_timermgr have been removed
* struct isc__timer and struct isc__timermgr have been renamed
to struct isc_timer and struct isc_timermgr
* All the associated code that dealt with typing isc_<foo>
to isc__<foo> and back has been removed.
Previously, the taskmgr, timermgr and socketmgr had a constructor
variant, that would create the mgr on top of existing appctx. This was
no longer true and isc_<*>mgr was just calling isc_<*>mgr_create()
directly without any extra code.
This commit just cleans up the extra function.
It fixes a corner case which was causing dig to print annoying
messages like:
14-Apr-2021 18:48:37.099 SSL error in BIO: 1 TLS error (errno:
0). Arguments: received_data: (nil), send_data: (nil), finish: false
even when all the data was properly processed.
Before this fix underlying TCP sockets could remain opened for longer
than it is actually required, causing unit tests to fail with lots of
ISC_R_TOOMANYOPENFILES errors.
The change also enables graceful SSL shutdown (before that it would
happen only in the case when isc_nm_cancelread() were called).
This commit merges TLS tests into the common Network Manager unit
tests suite and extends the unit test framework to include support for
additional "ping-pong" style tests where all data could be sent via
lesser number of connections (the behaviour of the old test
suite). The tests for TCP and TLS were extended to make use of the new
mode, as this mode better translates to how the code is used in DoH.
Both TLS and TCP tests now share most of the unit tests' code, as they
are expected to function similarly from a users's perspective anyway.
Additionally to the above, the TLS test suite was extended to include
TLS tests using the connections quota facility.
The isc_nm_tlsdnsconnect() call could end up with two connect callbacks
called when the timeout fired and the TCP connection was aborted,
but the TLS handshake was not complete yet. isc__nm_connecttimeout_cb()
forgot to clean up sock->tls.pending_req when the connect callback was
called with ISC_R_TIMEDOUT, leading to a second callback running later.
A new argument has been added to the isc__nm_*_failed_connect_cb and
isc__nm_*_failed_read_cb functions, to indicate whether the callback
needs to run asynchronously or not.
We already skip most of the recv_send tests in CI because they are
too timing-related to be run in overloaded environment. This commit
adds a similar change to tls_test before we merge tls_test into
netmgr_test.
if a test failed at the beginning of nm_teardown(), the function
would abort before isc_nm_destroy() or isc_tlsctx_free() were reached;
we would then abort when nm_setup() was run for the next test case.
rearranging the teardown function prevents this problem.
The isc_nm_*connect() functions were refactored to always return the
connection status via the connect callback instead of sometimes returning
the hard failure directly (for example, when the socket could not be
created, or when the network manager was shutting down).
This commit changes the connect functions in all the network manager
modules, and also makes the necessary refactoring changes in places
where the connect functions are called.
dig previously ran isc_nm_udpconnect() three times before giving
up, to work around a freebsd bug that caused connect() to return
a spurious transient EADDRINUSE. this commit moves the retry code
into the network manager itself, so that isc_nm_udpconnect() no
longer needs to return a result code.
The TCP module has been updated to use the generic functions from
netmgr.c instead of its own local copies. This brings the module
mostly up to par with the TCPDNS and TLSDNS modules.
Serveral problems were discovered and fixed after the change in
the connection timeout in the previous commits:
* In TLSDNS, the connection callback was not called at all under some
circumstances when the TCP connection had been established, but the
TLS handshake hadn't been completed yet. Additional checks have
been put in place so that tls_cycle() will end early when the
nmsocket is invalidated by the isc__nm_tlsdns_shutdown() call.
* In TCP, TCPDNS and TLSDNS, new connections would be established
even when the network manager was shutting down. The new
call isc__nm_closing() has been added and is used to bail out
early even before uv_tcp_connect() is attempted.
Similarly to the read timeout, it's now possible to recover from
ISC_R_TIMEDOUT event by restarting the timer from the connect callback.
The change here also fixes platforms that missing the socket() options
to set the TCP connection timeout, by moving the timeout code into user
space. On platforms that support setting the connect timeout via a
socket option, the timeout has been hardcoded to 2 minutes (the maximum
value of tcp-initial-timeout).
Previously, when the client timed out on read, the client socket would
be automatically closed and destroyed when the nmhandle was detached.
This commit changes the logic so that it's possible for the callback to
recover from the ISC_R_TIMEDOUT event by restarting the timer. This is
done by calling isc_nmhandle_settimeout(), which prevents the timeout
handling code from destroying the socket; instead, it continues to wait
for data.
One specific use case for multiple timeouts is serve-stale - the client
socket could be created with shorter timeout (as specified with
stale-answer-client-timeout), so we can serve the requestor with stale
answer, but keep the original query running for a longer time.
The full netmgr test suite is unstable when run in CI due to various
timing issues. Previously, we enabled the full test suite only when
CI_ENABLE_ALL_TESTS environment variable was set, but that went against
original intent of running the full suite when an individual developer
would run it locally.
This change disables the full test suite only when running in the CI and
the CI_ENABLE_ALL_TESTS is not set.
See "BUGS" section at:
https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html
It is mentioned there that when TLS status equals SSL_ERROR_SYSCALL
AND errno == 0 it means that underlying transport layer returned EOF
prematurely. However, we are managing the transport ourselves, so we
should just resume reading from the TCP socket.
It seems that this case has been handled properly on modern versions
of OpenSSL. That being said, the situation goes in line with the
manual: it is briefly mentioned there that SSL_ERROR_SYSCALL might be
returned not only in a case of low-level errors (like system call
failures).
This commit fixes crash in dig when it encounters non-expected header
value. The bug was introduced at some point late in the last DoH
development cycle. Also, refactors the relevant code a little bit to
ensure better incoming data validation for client-side DoH
connections.
util.h requires ISC_CONSTRUCTOR definition, which depends on config.h
inclusion. It does not include it from isc/util.h (or any other header).
Using isc/util.h fails hard when isc/util.h is used without including
bind's config.h.
Move the check to c file, where ISC_CONSTRUCTOR is used. Ensure config.h
is included there.
The current isc_time_now uses CLOCK_REALTIME_COARSE which only updates
on a timer tick. This clock is generally fine for millisecond accuracy,
but on servers with 100hz clocks, this clock is nowhere near accurate
enough for microsecond accuracy.
This commit adds a new isc_time_now_hires function that uses
CLOCK_REALTIME, which gives the current time, though it is somewhat
expensive to call. When microsecond accuracy is required, it may be
required to use extra resources for higher accuracy.
When NETMGR_TRACE(_VERBOSE) is enabled, the build would fail on some
non-Linux non-glibc platforms because:
* Use <stdint.h> print macros because uint_fast32_t is not always
unsigned long
* The header <execinfo.h> is not available on non-glibc, thus commit
adds dummy backtrace() and backtrace_symbols_fd() functions for
platforms without HAVE_BACKTRACE
The netmgr unit tests were designed to push the system limits to maximum
by sending as many queries as possible in the busy loop from multiple
threads. This mostly works with UDP, but in the stateful protocol where
establishing the connection takes more time, it failed quite often in
the CI. On FreeBSD, this happened more often, because the socket() call
would fail spuriosly making the problem even worse.
This commit does several things to improve reliability:
* return value of isc_nm_<proto>connect() is always checked and retried
when scheduling the connection fails
* The busy while loop has been slowed down with usleep(1000); so the
netmgr threads could schedule the work and get executed.
* The isc_thread_yield() was replaced with usleep(1000); also to allow
the other threads to do any work.
* Instead of waiting on just one variable, we wait for multiple
variables to reach the final value
* We are wrapping the netmgr operations (connects, reads, writes,
accepts) with reference counting and waiting for all the callbacks to
be accounted for.
This has two effects:
a) the isc_nm_t is always clean of active sockets and handles when
destroyed, so it will prevent the spurious INSIST(references == 1)
from isc_nm_destroy()
b) the unit test now ensures that all the callbacks are always called
when they should be called, so any stuck test means that there was
a missing callback call and it is always a real bug
These changes allows us to remove the workaround that would not run
certain tests on systems without port load-balancing.
In tls_error(), we now call isc__nm_tlsdns_failed_read() instead of just
stopping timer and reading from the socket. This allows us to properly
cleanup any pending operation on the socket.
When shutting down, calling the isc__nm_failed_connect_cb() was delayed
until the connect callback would be called. It turned out that the
connect callback might not get called at all when the socket is being
shut down. Call the failed_connect_cb() directly in the
tlsdns_shutdown() instead of waiting for the connect callback to call it.
After a partial write the tls.senddata buffer would be rearranged to
contain only the data tha wasn't sent and the len part would be made
shorter, which would lead to attempt to free only part of a socket's
tls.senddata buffer.
The tlsdns_cycle() might call uv_write() to write data to the socket,
when this happens and the socket is shutdown before the callback
completes, the uvreq structure was not freed because the callback would
be called with non-zero status code.
The RFC7828 specifies the keepalive interval to be 16-bit, specified in
units of 100 milliseconds and the configuration options tcp-*-timeouts
are following the suit. The units of 100 milliseconds are very
unintuitive and while we can't change the configuration and presentation
format, we should not follow this weird unit in the API.
This commit changes the isc_nm_(get|set)timeouts() functions to work
with milliseconds and convert the values to milliseconds before passing
them to the function, not just internally.
The udp, tcpdns and tlsdns contained lot of cut&paste code or code that
was very similar making the stack harder to maintain as any change to
one would have to be copied to the the other protocols.
In this commit, we merge the common parts into the common functions
under isc__nm_<foo> namespace and just keep the little differences based
on the socket type.
After the TCPDNS refactoring the initial and idle timers were broken and
only the tcp-initial-timeout was always applied on the whole TCP
connection.
This broke any TCP connection that took longer than tcp-initial-timeout,
most often this would affect large zone AXFRs.
This commit changes the timeout logic in this way:
* On TCP connection accept the tcp-initial-timeout is applied
and the timer is started
* When we are processing and/or sending any DNS message the timer is
stopped
* When we stop processing all DNS messages, the tcp-idle-timeout
is applied and the timer is started again
This commit fixes loading the certificate chain files so that the full
chain could be sent to the clients which require that for
verification. Before that fix only the top most certificate would be
loaded from the chain and sent to clients preventing some of them to
perform certificate validation (e.g. Windows 10 DoH client).
It is advisable to disable Nagle's algorithm for HTTP/2 connections
because multiple HTTP/2 streams could be multiplexed over one
transport connection. Thus, delays when delivering small packets could
bring down performance for the whole session. HTTP/2 is meant to be
used this way.
when called from within the context of a network thread,
isc_nm_tlsconnect() hangs. it is waiting for the socket's
result code to be updated, but that update is supposed to happen
asynchronously in the network thread, and if we're already blocking
in the network thread, it can never occur.
we can kluge around this by setting the socket result code
early; this works for most clients (including "dig"), but it causes
inconsistent behaviors that manifest as test failures in the DoH unit
test.
so we kluged around it even more by setting the socket result code
early *only when running in the network thread*. we need a better
solution for this problem, but this will do for now.
This commit makes the server-side code polite.
It fixes the error handling code on the server side and fixes
returning error code in responses (there was a nasty bug which could
potentially crash the server).
Also, in this commit we limit max size POST request data to 96K, max
processed data size in headers to 128K (should be enough to handle any
GET requests).
If these limits are surpassed, server will terminate the request with
RST_STREAM without responding with error code. Otherwise it politely
responds with error code.
This commit also limits number of concurrent HTTP/2 streams per
transport connection on server to 100 (as nghttp2 advises by default).
Ideally, these parameters should be configurable both globally and per
every HTTP endpoint description in the configuration file, but for now
putting sane limits should be enough.
- style, cleanup, and removal of unnecessary code.
- combined isc_nm_http_add_endpoint() and isc_nm_http_add_doh_endpoint()
into one function, renamed isc_http_endpoint().
- moved isc_nm_http_connect_send_request() into doh_test.c as a helper
function; remove it from the public API.
- renamed isc_http2 and isc_nm_http2 types and functions to just isc_http
and isc_nm_http, for consistency with other existing names.
- shortened a number of long names.
- the caller is now responsible for determining the peer address.
in isc_nm_httpconnect(); this eliminates the need to parse the URI
and the dependency on an external resolver.
- the caller is also now responsible for creating the SSL client context,
for consistency with isc_nm_tlsdnsconnect().
- added setter functions for HTTP/2 ALPN. instead of setting up ALPN in
isc_tlsctx_createclient(), we now have a function
isc_tlsctx_enable_http2client_alpn() that can be run from
isc_nm_httpconnect().
- refactored isc_nm_httprequest() into separate read and send functions.
isc_nm_send() or isc_nm_read() is called on an http socket, it will
be stored until a corresponding isc_nm_read() or _send() arrives; when
we have both halves of the pair the HTTP request will be initiated.
- isc_nm_httprequest() is renamed isc__nm_http_request() for use as an
internal helper function by the DoH unit test. (eventually doh_test
should be rewritten to use read and send, and this function should
be removed.)
- added implementations of isc__nm_tls_settimeout() and
isc__nm_http_settimeout().
- increased NGHTTP2 header block length for client connections to 128K.
- use isc_mem_t for internal memory allocations inside nghttp2, to
help track memory leaks.
- send "Cache-Control" header in requests and responses. (note:
currently we try to bypass HTTP caching proxies, but ideally we should
interact with them: https://tools.ietf.org/html/rfc8484#section-5.1)
Simple typecast to size_t should be enough to silence the warning on
ARMv7, even though the code is in fact correct, because the readlen is
checked for being < 0 in the block before the warning.
Call the libisc isc__initialize() constructor and isc__shutdown()
destructor from DllMain instead of having duplicate code between
those and DllMain() code.
When AddressSanitizer is in use, disable the internal mempool
implementation and redirect the isc_mempool_get to isc_mem_get
(and similarly for isc_mempool_put). This is the method recommended
by the AddressSanitizer authors for tracking allocations and
deallocations instead of custom poison/unpoison code (see
https://github.com/google/sanitizers/wiki/AddressSanitizerManualPoisoning).
The pthread_self(), thrd_current() or GetCurrentThreadId() could
actually be a pointer, so we should rather convert the value into
uintptr_t instead of unsigned long.
Convert the isc_hp API to use the globally available isc_tid_v instead
of locally defined tid_v. This should solve most of the problems on
machines with many number of cores / CPUs.
The current isc_hp API uses internal tid_v variable that gets
incremented for each new thread using hazard pointers. This tid_v
variable is then used as a index to global shared table with hazard
pointers state. Since the tid_v is only incremented and never
decremented the table could overflow very quickly if we create set of
threads for short period of time, they finish the work and cease to
exist. Then we create identical set of threads and so on and so on.
This is not a problem for a normal `named` operation as the set of
threads is stable, but the problematic place are the unit tests where we
test network manager or other APIs (task, timer) that create threads.
This commits adds a thin wrapper around any function called from
isc_thread_create() that adds unique-but-reusable small digit thread id
that can be used as index to f.e. hazard pointer tables. The trampoline
wrapper ensures that the thread ids will be reused, so the highest
thread_id number doesn't grow indefinitely when threads are created and
destroyed and then created again. This fixes the hazard pointer table
overflow on machines with many cores. [GL #2396]
The BIND 9 libraries on Windows define DllMain() optional entry point
into a dynamic-link library (DLL). When the system starts or terminates
a process or thread, it calls the entry-point function for each loaded
DLL using the first thread of the process.
When the DLL is being loaded into the virtual address space of the
current process as a result of the process starting up, we make a call
to DisableThreadLibraryCalls() which should disable the
DLL_THREAD_ATTACH and DLL_THREAD_DETACH notifications for the specified
dynamic-link library (DLL).
This seems not be the case because we never check the return value of
the DisableThreadLibraryCalls() call, and it could in fact fail. The
DisableThreadLibraryCalls() function fails if the DLL specified by
hModule has active static thread local storage, or if hModule is an
invalid module handle.
In this commit, we remove the safe-guard assertion put in place for the
DLL_THREAD_ATTACH and DLL_THREAD_DETACH events and we just ignore them.
BIND 9 doesn't create/destroy enough threads for it actually to make any
difference, and in fact we do use static thread local storage in the
code.
The addition of lib/isc/tls_p.h to the source tree was not accounted for
in the relevant variable in lib/isc/Makefile.am and thus the former file
is not being included in release tarballs prepared using "make dist".
Fix by tweaking the libisc_la_SOURCES list in lib/isc/Makefile.am
accordingly.
Instead of calling isc_tls_initialize()/isc_tls_destroy() explicitly use
gcc/clang attributes on POSIX and DLLMain on Windows to initialize and
shutdown OpenSSL library.
This resolves the issue when isc_nm_create() / isc_nm_destroy() was
called multiple times and it would call OpenSSL library destructors from
isc_nm_destroy().
At the same time, since we now have introduced the ctor/dtor for libisc,
this commit moves the isc_mem API initialization (the list of the
contexts) and changes the isc_mem_checkdestroyed() to schedule the
checking of memory context on library unload instead of executing the
code immediately.
Disables the DLL_THREAD_ATTACH and DLL_THREAD_DETACH notifications for
the specified dynamic-link library (DLL). This can reduce the size of
the working set for some applications.
Although harmless, the memmove() in tlsdns and tcpdns was guarded by a
current message length variable that was always bigger than 0 instead of
correct current buffer length remainder variable.
Since we now require both libcrypto and libssl to be initialized for
netmgr, we move all the OpenSSL initialization code except the engine
initialization to isc_tls API.
The isc_tls_initialize() and isc_tls_destroy() has been made idempotent,
so they could be called multiple time. However when isc_tls_destroy()
has been called, the isc_tls_initialize() could not be called again.
The ISC_MEM_CHECKOVERRUN would add canary byte at the end of every
allocations and check whether the canary byte hasn't been changed at the
free time. The AddressSanitizer and valgrind memory checks surpases
simple checks like this, so there's no need to actually keep the code
inside the allocator.
Previously, the mem_{get,put} benchmark would pass the allocation size
as thread_create argument. This has been now changed, so the allocation
size is stored and decremented (divided) in atomic variable and the
thread create routing is given a memory context. This will allow to
write tests where each thread is given different memory context and do
the same for mempool benchmarking.
The two memory debugging features: ISC_MEM_DEFAULTFILL
(ISC_MEMFLAG_FILL) and ISC_MEM_TRACKLINES were always enabled in all
builds and the former was only disabled in `named`.
This commits disables those two features in non-developer build to make
the memory allocator significantly faster.
This is yet another step into unlocking some parts of the memory
contexts. All the regularly updated variables has been turned into
atomic types, so we can later remove the locks when updating various
counters.
Also unlock as much code as possible without breaking anything.
On 24-core machine, the tests would crash because we would run out of
the hazard pointers. We now adjust the number of hazard pointers to be
in the <128,256> interval based on the number of available cores.
Note: This is just a band-aid and needs a proper fix.
Previously, the applications using libisc would be able to override the
internal memory methods with own implementation. This was no longer
possible, but the extra level of indirection was not removed. This
commit removes the extra level of indirection for the memory methods and
the default_memalloc() and default_memfree().
The internal memory allocator had an extra code to keep a list of blocks
for small size allocation. This would help to reduce the interactions
with the system malloc as the memory would be already allocated from the
system, but there's an extra cost associated with that - all the
allocations/deallocations must be locked, effectively eliminating any
optimizations in the system allocator targeted at multi-threaded
applications. While the isc_mem API is still using locks pretty heavily,
this is a first step into reducing the memory allocation/deallocation
contention.
In DNS Flag Day 2020, the development branch started setting the
IP_DONTFRAG option on the UDP sockets. It turned out, that this
code was incomplete leading to dropping the outgoing UDP packets.
Henceforth this commit rolls back this setting until we have a
proper fix that would send back empty response with TC flag set.
Commit fa505bfb0e omitted two unit tests
while introducing the SKIP_TEST_EXIT_CODE preprocessor macro. Fix the
outliers to make use of SKIP_TEST_EXIT_CODE consistent across all unit
tests. Also make sure lib/dns/tests/dnstap_test returns an exit code
that indicates a skipped test when dnstap is not enabled.
The <isc/readline.h> header provided a compatibility shim to use when
other non-GNU readline libraries are in use. The two places where
readline library is being used is nslookup and nsupdate, so the header
file has been moved to bin/dig directory and it's directly included from
bin/nsupdate.
This also conceals any readline headers exposed from the libisc headers.
This commit contains fixes to unit tests to make them work well on
various platforms (in particular ones shipping old versions of
OpenSSL) and for different configurations.
It also updates the generated manpage to include DoH configuration
options.
This commit completes the support for DNS-over-HTTP(S) built on top of
nghttp2 and plugs it into the BIND. Support for both GET and POST
requests is present, as required by RFC8484.
Both encrypted (via TLS) and unencrypted HTTP/2 connections are
supported. The latter are mostly there for debugging/troubleshooting
purposes and for the means of encryption offloading to third-party
software (as might be desirable in some environments to simplify TLS
certificates management).
This commit includes work-in-progress implementation of
DNS-over-HTTP(S).
Server-side code remains mostly untested, and there is only support
for POST requests.
This commit resurrects the old TLS code from
8f73c70d23.
It also includes numerous stability fixes and support for
isc_nm_cancelread() for the TLS layer.
The code was resurrected to be used for DoH.
The BIND 9 libraries are considered to be internal only and hence the
API and ABI changes a lot. Keeping track of the API/ABI changes takes
time and it's a complicated matter as the safest way to make everything
stable would be to bump any library in the dependency chain as in theory
if libns links with libdns, and a binary links with both, and we bump
the libdns SOVERSION, but not the libns SOVERSION, the old libns might
be loaded by binary pulling old libdns together with new libdns loaded
by the binary. The situation gets even more complicated with loading
the plugins that have been compiled with few versions old BIND 9
libraries and then dynamically loaded into the named.
We are picking the safest option possible and usable for internal
libraries - instead of using -version-info that has only a weak link to
BIND 9 version number, we are using -release libtool option that will
embed the corresponding BIND 9 version number into the library name.
That means that instead of libisc.so.1701 (as an example) the library
will now be named libisc-9.17.10.so.
* Following the example set in 634bdfb16d, the tlsdns netmgr
module now uses libuv and SSL primitives directly, rather than
opening a TLS socket which opens a TCP socket, as the previous
model was difficult to debug. Closes#2335.
* Remove the netmgr tls layer (we will have to re-add it for DoH)
* Add isc_tls API to wrap the OpenSSL SSL_CTX object into libisc
library; move the OpenSSL initialization/deinitialization from dstapi
needed for OpenSSL 1.0.x to the isc_tls_{initialize,destroy}()
* Add couple of new shims needed for OpenSSL 1.0.x
* When LibreSSL is used, require at least version 2.7.0 that
has the best OpenSSL 1.1.x compatibility and auto init/deinit
* Enforce OpenSSL 1.1.x usage on Windows
* Added a TLSDNS unit test and implemented a simple TLSDNS echo
server and client.
On Windows, we were limiting the number of listening children to just 1,
but we were then iterating on mgr->nworkers. That lead to scheduling
more async_*listen() than actually allocated and out-of-bound read-write
operation on the heap.
When we were in nmthread, the isc__nm_async_<proto>connect() function
executes in the same thread as the isc__nm_<proto>connect() and on a
failure, it would block indefinitely because the failure branch was
setting sock->active to false before the condition around the wait had a
chance to skip the WAIT().
This also fixes the zero system test being stuck on FreeBSD 11, so we
re-enable the test in the commit.
On FreeBSD, the option to configure connection timeout is called
TCP_KEEPINIT, use it to configure the connection timeout there.
This also fixes the dangling socket problems in the unit test, so
re-enable them.
On platforms without load-balancing socket all the queries would be
handle by a single thread. Currently, the support for load-balanced
sockets is present in Linux with SO_REUSEPORT and FreeBSD 12 with
SO_REUSEPORT_LB.
This commit adds workaround for such platforms that:
1. setups single shared listening socket for all listening nmthreads for
UDP, TCP and TCPDNS netmgr transports
2. Calls uv_udp_bind/uv_tcp_bind on the underlying socket just once and
for rest of the nmthreads only copy the internal libuv flags (should
be just UV_HANDLE_BOUND and optionally UV_HANDLE_IPV6).
3. start reading on UDP socket or listening on TCP socket
The load distribution among the nmthreads is uneven, but it's still
better than utilizing just one thread for processing all the incoming
queries
On FreeBSD, the stack is destroyed more aggressively than on Linux and
that revealed a bug where we were allocating the 16-bit len for the
TCPDNS message on the stack and the buffer got garbled before the
uv_write() sendback was executed. Now, the len is part of the uvreq, so
we can safely pass it to the uv_write() as the req gets destroyed after
the sendcb is executed.
On Windows, WSAStartup() needs to be called to initialize Winsock before
any sockets are created or else socket() calls will return error code
10093 (WSANOTINITIALISED). Since BIND's Network Manager is intended to
work as a reusable networking library, it should take care of calling
WSAStartup() - and its cleanup counterpart, WSACleanup() - itself rather
than relying on external code to do it. Add the necessary WSAStartup()
and WSACleanup() calls to isc_nm_start() and isc_nm_destroy(),
respectively.
uv_wrap.h is included in tcp_test.c and udp_test.c and therefore should
be listed in lib/isc/tests/Makefile.am, otherwise unit test run from
distribution tarball fails to compile:
tcp_test.c:37:10: fatal error: uv_wrap.h: No such file or directory
#include "uv_wrap.h"
^~~~~~~~~~~
udp_test.c:37:10: fatal error: uv_wrap.h: No such file or directory
#include "uv_wrap.h"
^~~~~~~~~~~
After turning the users callbacks to be asynchronous, there was a
visible performance drop. This commit prevents the unnecessary
allocations while keeping the code paths same for both asynchronous and
synchronous calls.
The same change was done to the isc__nm_udp_{read,send} as those two
functions are in the hot path.
The new netmgr tests are not-yet fine-tuned for non-Linux platforms.
Disable them now, so we can move forward and fix the tests of *BSD
in the next iteration.
This commit will get reverted when we add support for netmgr
multi-threading.
The isc/util.h header redefine the DbC checks (REQUIRE, INSIST, ...) to
be cmocka "fake" assertions. However that means that cmocka.h needs to
be included after UNIT_TESTING is defined but before isc/util.h is
included. Because isc/util.h is included in most of the project headers
this means that the sequence MUST be:
#define UNIT_TESTING
#include <cmocka.h>
#include <isc/_anything_.h>
See !2204 for other header requirements for including cmocka.h.
This is a part of the works that intends to make the netmgr stable,
testable, maintainable and tested. It contains a numerous changes to
the netmgr code and unfortunately, it was not possible to split this
into smaller chunks as the work here needs to be committed as a complete
works.
NOTE: There's a quite a lot of duplicated code between udp.c, tcp.c and
tcpdns.c and it should be a subject to refactoring in the future.
The changes that are included in this commit are listed here
(extensively, but not exclusively):
* The netmgr_test unit test was split into individual tests (udp_test,
tcp_test, tcpdns_test and newly added tcp_quota_test)
* The udp_test and tcp_test has been extended to allow programatic
failures from the libuv API. Unfortunately, we can't use cmocka
mock() and will_return(), so we emulate the behaviour with #define and
including the netmgr/{udp,tcp}.c source file directly.
* The netievents that we put on the nm queue have variable number of
members, out of these the isc_nmsocket_t and isc_nmhandle_t always
needs to be attached before enqueueing the netievent_<foo> and
detached after we have called the isc_nm_async_<foo> to ensure that
the socket (handle) doesn't disappear between scheduling the event and
actually executing the event.
* Cancelling the in-flight TCP connection using libuv requires to call
uv_close() on the original uv_tcp_t handle which just breaks too many
assumptions we have in the netmgr code. Instead of using uv_timer for
TCP connection timeouts, we use platform specific socket option.
* Fix the synchronization between {nm,async}_{listentcp,tcpconnect}
When isc_nm_listentcp() or isc_nm_tcpconnect() is called it was
waiting for socket to either end up with error (that path was fine) or
to be listening or connected using condition variable and mutex.
Several things could happen:
0. everything is ok
1. the waiting thread would miss the SIGNAL() - because the enqueued
event would be processed faster than we could start WAIT()ing.
In case the operation would end up with error, it would be ok, as
the error variable would be unchanged.
2. the waiting thread miss the sock->{connected,listening} = `true`
would be set to `false` in the tcp_{listen,connect}close_cb() as
the connection would be so short lived that the socket would be
closed before we could even start WAIT()ing
* The tcpdns has been converted to using libuv directly. Previously,
the tcpdns protocol used tcp protocol from netmgr, this proved to be
very complicated to understand, fix and make changes to. The new
tcpdns protocol is modeled in a similar way how tcp netmgr protocol.
Closes: #2194, #2283, #2318, #2266, #2034, #1920
* The tcp and tcpdns is now not using isc_uv_import/isc_uv_export to
pass accepted TCP sockets between netthreads, but instead (similar to
UDP) uses per netthread uv_loop listener. This greatly reduces the
complexity as the socket is always run in the associated nm and uv
loops, and we are also not touching the libuv internals.
There's an unfortunate side effect though, the new code requires
support for load-balanced sockets from the operating system for both
UDP and TCP (see #2137). If the operating system doesn't support the
load balanced sockets (either SO_REUSEPORT on Linux or SO_REUSEPORT_LB
on FreeBSD 12+), the number of netthreads is limited to 1.
* The netmgr has now two debugging #ifdefs:
1. Already existing NETMGR_TRACE prints any dangling nmsockets and
nmhandles before triggering assertion failure. This options would
reduce performance when enabled, but in theory, it could be enabled
on low-performance systems.
2. New NETMGR_TRACE_VERBOSE option has been added that enables
extensive netmgr logging that allows the software engineer to
precisely track any attach/detach operations on the nmsockets and
nmhandles. This is not suitable for any kind of production
machine, only for debugging.
* The tlsdns netmgr protocol has been split from the tcpdns and it still
uses the old method of stacking the netmgr boxes on top of each other.
We will have to refactor the tlsdns netmgr protocol to use the same
approach - build the stack using only libuv and openssl.
* Limit but not assert the tcp buffer size in tcp_alloc_cb
Closes: #2061
Make sure pointer checks in unit tests use cmocka assertion macros
dedicated for use with pointers instead of those dedicated for use with
integers or booleans.
cppcheck 2.2 reports the following false positive:
lib/isc/tests/quota_test.c:71:21: error: Array 'quotas[101]' accessed at index 110, which is out of bounds. [arrayIndexOutOfBounds]
isc_quota_t *quotas[110];
^
The above is not even an array access, so this report is obviously
caused by a cppcheck bug. Yet, it seems to be triggered by the presence
of the add_quota() macro, which should really be a function. Convert
the add_quota() macro to a function in order to make the code cleaner
and to prevent the above cppcheck 2.2 false positive from being
triggered.
When calling the high level netmgr functions, the callback would be
sometimes called synchronously if we catch the failure directly, or
asynchronously if it happens later. The synchronous call to the
callback could create deadlocks as the caller would not expect the
failed callback to be executed directly.
This commit adds couple of additional safeguards against running
sends/reads on inactive sockets. The changes was modeled after the
changes we made to netmgr/tcpdns.c
Parse the configuration of tls objects into SSL_CTX* objects. Listen on
DoT if 'tls' option is setup in listen-on directive. Use DoT/DoH ports
for DoT/DoH.
Add server-side TLS support to netmgr - that includes moving some of the
isc_nm_ functions from tcp.c to a wrapper in netmgr.c calling a proper
tcp or tls function, and a new isc_nm_listentls() function.
Add DoT support to tcpdns - isc_nm_listentlsdns().
there were two failures during observed in testing, both occurring
when 'rndc halt' was run rather than 'rndc stop' - the latter dumps
zone contents to disk and presumably introduced enough delay to
prevent the races:
- a failure when the zone was shut down and called dns_xfrin_detach()
before the xfrin had finished connecting; the connect timeout
terminated without detaching its handle
- a failure when the tcpdns socket timer fired after the outerhandle
had already been cleared.
this commit incidentally addresses a failure observed in mutexatomic
due to a variable having been initialized incorrectly.
This commit extends the perl Configure script to also check for libssl
in addition to libcrypto and change the vcxproj source files to link
with both libcrypto and libssl.
socket() call can return an error - e.g. EMFILE, so we need to handle
this nicely and not crash.
Additionally wrap the socket() call inside a platform independent helper
function as the Socket data type on Windows is unsigned integer:
> This means, for example, that checking for errors when the socket and
> accept functions return should not be done by comparing the return
> value with –1, or seeing if the value is negative (both common and
> legal approaches in UNIX). Instead, an application should use the
> manifest constant INVALID_SOCKET as defined in the Winsock2.h header
> file.
Because we use result earlier for setting the loadbalancing on the
socket, we could be left with a ISC_R_NOTIMPLEMENTED value stored in the
variable and when the UDP connection would succeed, we would
errorneously return this value instead of ISC_R_SUCCESS.
this function sets the read timeout for the socket associated
with a netmgr handle and, if the timer is running, resets it.
for TCPDNS sockets it also sets the read timeout and resets the
timer on the outer TCP socket.
When we are operating on the tcpdns socket, we need to double check
whether the socket or its outerhandle or its listener or its mgr is
still active and when not, bail out early.
If dnslisten_readcb gets a read callback it needs to verify that the
outer socket wasn't closed in the meantime, and issue a CANCELED callback
if it was.
tests of UDP and TCP cases including:
- sending and receiving
- closure sockets without reading or sending
- closure of sockets at various points while sending and receiving
- since the teste is multithreaded, cmocka now aborts tests on the
first failure, so that failures in subthreads are caught and
reported correctly.
There were more races that could happen while connecting to a
socket while closing or shutting down the same socket. This
commit introduces a .closing flag to guard the socket from
being closed twice.
There was a data race where a new event could be scheduled after
isc__nm_async_shutdown() had cleaned up all the dangling UDP/TCP
sockets from the loop.
- more logical code flow.
- propagate errors back to the caller.
- add a 'reading' flag and call the callback from failed_read_cb()
only when it the socket was actively reading.
- don't bother closing sockets that are already closing.
- UDP read timeout timer was not stopped after reading.
- improve handling of TCP connection failures.
- isc_nm_tcpdnsconnect() sets up up an outgoing TCP DNS connection.
- isc_nm_tcpconnect(), _udpconnect() and _tcpdnsconnect() now take a
timeout argument to ensure connections time out and are correctly
cleaned up on failure.
- isc_nm_read() now supports UDP; it reads a single datagram and then
stops until the next time it's called.
- isc_nm_cancelread() now runs asynchronously to prevent assertion
failure if reading is interrupted by a non-network thread (e.g.
a timeout).
- isc_nm_cancelread() can now apply to UDP sockets.
- added shim code to support UDP connection in versions of libuv
prior to 1.27, when uv_udp_connect() was added
all these functions will be used to support outgoing queries in dig,
xfrin, dispatch, etc.
If the connection is closed while we're processing the request
we might access TCPDNS outerhandle which is already reset. Check
for this condition and call the callback with ISC_R_CANCELED result.
While libltdl is a feature-rich library, BIND 9 code only uses its basic
capabilities, which are also provided by libuv and which BIND 9 already
uses for other purposes. As libuv's cross-platform shared library
handling interface is modeled after the POSIX dlopen() interface,
converting code using the latter to the former is simple. Replace
libltdl function calls with their libuv counterparts, refactoring the
code as necessary. Remove all use of libltdl from the BIND 9 source
tree.
When client disconnects before the connection can be accepted, the named
would log a spurious log message:
error: Accepting TCP connection failed: socket is not connected
We now ignore the ISC_R_NOTCONNECTED result code and log only other
errors
1. The isc__nm_tcp_send() and isc__nm_tcp_read() was not checking
whether the socket was still alive and scheduling reads/sends on
closed socket.
2. The isc_nm_read(), isc_nm_send() and isc_nm_resumeread() have been
changed to always return the error conditions via the callbacks, so
they always succeed. This applies to all protocols (UDP, TCP and
TCPDNS).
There were two problems how tcp_send_direct() was used:
1. The tcp_send_direct() can return ISC_R_CANCELED (or translated error
from uv_tcp_send()), but the isc__nm_async_tcpsend() wasn't checking
the error code and not releasing the uvreq in case of an error.
2. In isc__nm_tcp_send(), when the TCP send is already in the right
netthread, it uses tcp_send_direct() to send the TCP packet right
away. When that happened the uvreq was not freed, and the error code
was returned to the caller. We need to return ISC_R_SUCCESS and
rather use the callback to report an error in such case.
When closing the socket that is actively reading from the stream, the
read_cb() could be called between uv_close() and close callback when the
server socket has been already detached hence using sock->statichandle
after it has been already freed.
There were two problems how udp_send_direct() was used:
1. The udp_send_direct() can return ISC_R_CANCELED (or translated error
from uv_udp_send()), but the isc__nm_async_udpsend() wasn't checking
the error code and not releasing the uvreq in case of an error.
2. In isc__nm_udp_send(), when the UDP send is already in the right
netthread, it uses udp_send_direct() to send the UDP packet right
away. When that happened the uvreq was not freed, and the error code
was returned to the caller. We need to return ISC_R_SUCCESS and
rather use the callback to report an error in such case.
When networking statistics was added to the netmgr (in commit
5234a8e00a), two lines were added that
increment the 'STATID_RECVFAIL' statistic: One if 'uv_read_start'
fails and one at the end of the 'read_cb'. The latter happens
if 'nread < 0'.
According to the libuv documentation, I/O read callbacks (such as for
files and sockets) are passed a parameter 'nread'. If 'nread' is less
than 0, there was an error and 'UV_EOF' is the end of file error, which
you may want to handle differently.
In other words, we should not treat EOF as a RECVFAIL error.
isc_nmhandle_detach() needs to complete in the same thread
as shutdown_walk_cb() to avoid a race. Clear the caller's
pointer then pass control to the worker if necessary.
WARNING: ThreadSanitizer: data race
Write of size 8 at 0x000000000001 by thread T1:
#0 isc_nmhandle_detach lib/isc/netmgr/netmgr.c:1258:15
#1 control_command bin/named/controlconf.c:388:3
#2 dispatch lib/isc/task.c:1152:7
#3 run lib/isc/task.c:1344:2
Previous read of size 8 at 0x000000000001 by thread T2:
#0 isc_nm_pauseread lib/isc/netmgr/netmgr.c:1449:33
#1 recv_data lib/isccc/ccmsg.c:109:2
#2 isc__nm_tcp_shutdown lib/isc/netmgr/tcp.c:1157:4
#3 shutdown_walk_cb lib/isc/netmgr/netmgr.c:1515:3
#4 uv_walk <null>
#5 process_queue lib/isc/netmgr/netmgr.c:659:4
#6 process_normal_queue lib/isc/netmgr/netmgr.c:582:10
#7 process_queues lib/isc/netmgr/netmgr.c:590:8
#8 async_cb lib/isc/netmgr/netmgr.c:548:2
#9 <null> <null>
If we clone the csock (children socket) in TCP accept_connection()
instead of passing the ssock (server socket) to the call back and
cloning it there we unbreak the assumption that every socket is handled
inside it's own worker thread and therefore we can get rid of (at least)
callback locking.
The isc__nm_tcpdns_stoplistening() would call isc__nmsocket_clearcb()
that would clear the .accept_cb from non-netmgr thread. Change the
tcpdns_stoplistening to enqueue ievent that would get processed in the
right netmgr thread to avoid locking.
The DNS Flag Day 2020 aims to remove the IP fragmentation problem from
the UDP DNS communication. In this commit, we implement the required
changes and simplify the logic for picking the EDNS Buffer Size.
1. The defaults for `edns-udp-size`, `max-udp-size` and
`nocookie-udp-size` have been changed to `1232` (the value picked by
DNS Flag Day 2020).
2. The probing heuristics that would try 512->4096->1432->1232 buffer
sizes has been removed and the resolver will always use just the
`edns-udp-size` value.
3. Instead of just disabling the PMTUD mechanism on the UDP sockets, we
now set IP_DONTFRAG (IPV6_DONTFRAG) flag. That means that the UDP
packets won't get ever fragmented. If the ICMP packets are lost the
UDP will just timeout and eventually be retried over TCP.
The SO_REUSEADDR, SO_REUSEPORT and SO_REUSEPORT_LB has different meaning
on different platform. In this commit, we split the function to set the
reuse of address/port and setting the load-balancing into separate
functions.
The libuv library already have multiplatform support for setting
SO_REUSEADDR and SO_REUSEPORT that allows binding to the same address
and port, but unfortunately, when used after the load-balancing socket
options have been already set, it overrides the previous setting, so we
need our own helper function to enable the SO_REUSEADDR/SO_REUSEPORT
first and then enable the load-balancing socket option.
On POSIX based systems both uv_os_sock_t and uv_os_fd_t are both typedef
to int. That's not true on Windows, where uv_os_sock_t is SOCKET and
uv_os_fd_t is HANDLE and they differ in level of indirection.
The isc__nm_socket_freebind() has been refactored to match other
isc__nm_socket_...() helper functions and take uv_os_fd_t and
sa_family_t as function arguments.
The isc_nm_pause(), isc_nm_resume() and finishing the nm_thread() from
nm_destroy() has been refactored, so all use the netievents instead of
directly touching the worker structure members. This allows us to
remove most of the locking as the .paused and .finished members are
always accessed from the matching nm_thread.
When shutting down the nm_thread(), instead of issuing uv_stop(), we
just shutdown the .async handler, so all uv_loop_t events are properly
finished first and uv_run() ends gracefully with no outstanding active
handles in the loop.
Since Mac OS X 10.1, Mach-O object files are by default built with a
so-called two-level namespace which prevents symbol lookups in BIND unit
tests that attempt to override the implementations of certain library
functions from working as intended. This feature can be disabled by
passing the "-flat_namespace" flag to the linker. Fix unit tests
affected by this issue on macOS by adding "-flat_namespace" to LDFLAGS
used for building all object files on that operating system (it is not
enough to only set that flag for the unit test executables).
If NETMGR_TRACE is defined, we now maintain a list of active sockets
in the netmgr object and a list of active handles in each socket
object; by walking the list and printing `backtrace` in a debugger
we can see where they were created, to assist in in debugging of
reference counting errors.
On shutdown, if netmgr finds there are still active sockets after
waiting, isc__nm_dump_active() will be called to log the list of
active sockets and their underlying handles, along with some details
about them.
if more than 10 seconds pass while we wait for netmgr events to
finish running on shutdown, something is almost certainly wrong
and we should assert and crash.
Attaching and detaching handle pointers will make it easier to
determine where and why reference counting errors have occurred.
A handle needs to be referenced more than once when multiple
asynchronous operations are in flight, so callers must now maintain
multiple handle pointers for each pending operation. For example,
ns_client objects now contain:
- reqhandle: held while waiting for a request callback (query,
notify, update)
- sendhandle: held while waiting for a send callback
- fetchhandle: held while waiting for a recursive fetch to
complete
- updatehandle: held while waiting for an update-forwarding
task to complete
control channel connection objects now contain:
- readhandle: held while waiting for a read callback
- sendhandle: held while waiting for a send callback
- cmdhandle: held while an rndc command is running
httpd connections contain:
- readhandle: held while waiting for a read callback
- sendhandle: held while waiting for a send callback
- rename isc_nmsocket_t->tcphandle to statichandle
- cancelread functions now take handles instead of sockets
- add a 'client' flag in socket objects, currently unused, to
indicate whether it is to be used as a client or server socket
As generated documentation files are no longer stored in the BIND Git
repository, put a copy of the PDF version of the BIND ARM generated by
the "docs" GitLab CI job into the Windows zips to make it easily
available to the end users on that platform.
Make sure Windows zips also contain certain documentation files included
in source tarballs to make the contents of each release more consistent
across different platforms.
While
if (isc_refcount_decrement() == 1) { // memory_order_release
isc_refcount_destroy(); // memory_order_acquire
...
}
is theoretically the most efficent in practice, using
memory_order_acq_rel produces the same code on x86_64 and doesn't
trigger tsan data races (which use a idealistic model) if
isc_refcount_destroy() is not called immediately. In fact
isc_refcount_destroy() could be removed if we didn't want
to check for the count being 0 when isc_refcount_destroy() is
called.
https://stackoverflow.com/questions/49112732/memory-order-in-shared-pointer-destructor
It was discovered, that some systems might set EPROTO instead of EACCESS
on recvmsg() call causing spurious syslog messages from the socket
code. This commit returns soft handling of EPROTO errno code to the
socket code. [GL #1928]
sockaddr.c:147:49: error: pointer targets in passing argument 2 of ‘isc__buffer_putmem’ differ in signedness
rdata.c:1780:30: error: pointer targets in passing argument 2 of ‘isc__buffer_putmem’ differ in signedness
This commit updates and simplifies the checks for the readline support
in nslookup and nsupdate:
* Change the autoconf checks to pkg-config only, all supported
libraries have accompanying .pc files now.
* Add editline support in addition to libedit and GNU readline
* Add isc/readline.h shim header that defines dummy readline()
function when no readline library is available
base32_decode_char() added a extra zero octet to the output
if the fifth character was a pad character. The length
of octets to copy to the output was set to 3 instead of 2.
When pk11_numbits() is passed a user provided input that contains all
zeroes (via crafted DNS message), it would crash with assertion
failure. Fix that by properly handling such input.
Each worker has a receive buffer with space for 20 DNS messages of up
to 2^16 bytes each, and the allocator function passed to uv_read_start()
or uv_udp_recv_start() will reserve a portion of it for use by sockets.
UDP can use recvmmsg() and so it needs that entire space, but TCP reads
one message at a time.
This commit introduces separate allocator functions for TCP and UDP
setting different buffer size limits, so that libuv will provide the
correct buffer sizes to each of them.
When a new IPv6 interface/address appears it's first in a tentative
state - in which we cannot bind to it, yet it's already being reported
by the route socket. Because of that BIND9 is unable to listen on any
newly detected IPv6 addresses. Fix it by setting IP_FREEBIND option (or
equivalent option on other OSes) and then retrying bind() call.
Created isc_refcount_decrement_expect macro to test conditionally
the return value to ensure it is in expected range. Converted
unchecked isc_refcount_decrement to use isc_refcount_decrement_expect.
Converted INSIST(isc_refcount_decrement()...) to isc_refcount_decrement_expect.
When silencing the Coverity warning in remove_old_tsversions(), the code
was refactored to reduce the indentation levels and break down the long
code into individual functions. This improve fix for [GL #1989].
when building without ISC_BUFFER_USEINLINE (which is the default on
Windows) an assertion failure could occur when setting up a new
isc_httpd_t object for the statistics channel.
Creation of EVP_MD_CTX and EVP_PKEY is quite expensive, so until we fix the code
to reuse the OpenSSL contexts and keys we'll use our own implementation of
siphash instead of trying to integrate with OpenSSL.
If too many versions of log / dnstap files to be saved where requests
the memory after to_keep could be overwritten. Force the number of
versions to be saved to a save level. Additionally the memmove length
was incorrect.
The stdatomic shims for non-C11 compilers (Windows, old gcc, ...) and
mutexatomic implemented only and minimal subset of the atomic types.
This commit adds 16-bit operations for Windows and all atomic types as
defined in standard.
We erroneously tried to destroy a socket after issuing
isc__nm_tcp{,dns}_close. Under some (race) circumstances we could get
nm_socket_cleanup to be called twice for the same socket, causing an
access to a dead memory.
There's a possibility of race in isc__nm_tcpconnect if the asynchronous
connect operation finishes with all the callbacks before we exit the
isc__nm_tcpconnect itself we might access an already freed memory.
Fix it by creating an additional reference to the socket freed at the
end of isc__nm_tcpconnect.
the blackhole ACL was accidentally disabled with respect to client
queries during the netmgr conversion.
in order to make this work for TCP, it was necessary to add a return
code to the accept callback functions passed to isc_nm_listentcp() and
isc_nm_listentcpdns().
I'd like to use the same functionality (pretty print the datetime
of keytime metadata) in the 'rndc dnssec -status' command. So it is
better that this logic is done in a separate function.
Since the stdtime.c code have differernt files for unix and win32,
I think the "#ifdef WIN32" define can be dropped.
isc__nm_tcpdns_send() was not asynchronous and accessed socket
internal fields in an unsafe manner, which could lead to a race
condition and subsequent crash. Fix it by moving tcpdns processing
to a proper netmgr thread.
We need to mark the socket as inactive early (and synchronously)
in the stoplistening process; otherwise we might destroy the
callback argument before we actually stop listening, and call
the callback on bad memory.
Assign and then check node for NULL to address another thread
changing radix->head in the meantime.
Move 'node != NULL' check into while loop test to silence cppcheck
false positive.
Fix pointer != NULL style.
The isc_nm_cancelread() function cancels reading on a connected
socket and calls its read callback function with a 'result'
parameter of ISC_R_CANCELED.
when isc_nm_destroy() is called, there's a loop that waits for
other references to be detached, pausing and unpausing the netmgr
to ensure that all the workers' events are run, followed by a
1-second sleep. this caused a delay on shutdown which will be
noticeable when netmgr is used in tools other than named itself,
so the delay has now been reduced to a hundredth of a second.
the isc_nm_tcpconnect() function establishes a client connection via
TCP. once the connection is esablished, a callback function will be
called with a newly created network manager handle.
A TCPDNS socket creates a handle for each complete DNS message.
Previously, when all the handles were disconnected, the socket
would be closed, but the wrapped TCP socket might still have
more to read.
Now, when a connection is established, the TCPDNS socket creates
a reference to itself by attaching itself to sock->self. This
reference isn't cleared until the connection is closed via
EOF, timeout, or server shutdown. This allows the socket to remain
open even when there are no active handles for it.
- isc__nmhandle_get() now attaches to the sock in the nmhandle object.
the caller is responsible for dereferencing the original socket
pointer when necessary.
- tcpdns listener sockets attach sock->outer to the outer tcp listener
socket. tcpdns connected sockets attach sock->outerhandle to the handle
for the tcp connected socket.
- only listener sockets need to be attached/detached directly. connected
sockets should only be accessed and reference-counted via their
associated handles.
there is no need for a caller to reference-count socket objects.
they need tto be able tto close listener sockets (i.e., those
returned by isc_nm_listen{udp,tcp,tcpdns}), and an isc_nmsocket_close()
function has been added for that. other sockets are only accessed via
handles.
The ThreadSanitizer uses system synchronization primitives to check for
data race. The netmgr handle->references was missing acquire memory
barrier before resetting and reusing the memory occupied by isc_nmhandle_t.
There's a possibility of a race in TCP accepting code:
T1 accepts a connection C1
T2 accepts a connection C2
T1 tries to accept a connection C3, but we hit a quota,
isc_quota_cb_init() sets quota_accept_cb for the socket,
we return from accept_connection
T2 drops C2, but we race in quota_release with accepting C3 so
we don't see quota->waiting is > 0, we don't launch the callback
T1 accepts a connection C4, we are able to get the quota we clear
the quota_accept_cb from sock->quotacb
T1 drops C1, tries to call the callback which is zeroed, sigsegv.
Make various adjustments necessary to enable "make dist" to build a BIND
source tarball whose contents are complete enough to build binaries, run
unit & system tests, and generate documentation on Unix systems.
Known outstanding issues:
- "make distcheck" does not work yet.
- Tests do not work for out-of-tree source-tarball-based builds.
- Source tarballs are not complete enough for building on Windows.
All of the above will be addressed in due course.
Merge lib/isc/unix/ifiter_getifaddrs.c into lib/isc/unix/interfaceiter.c
and lib/isc/xoshiro128starstar.c into lib/isc/random.c. This avoids the
need for extra Automake directives required to process the "helper" *.c
files properly and makes the code more localized.
Turn the static check_bad_bits() function used by both Unix and Windows
systems into a "private" function and extract the "private" parts of
lib/isc/fsaccess.c to lib/isc/fsaccess_common_p.h. Instead of including
lib/isc/fsaccess.c from lib/isc/{unix,win32}/fsaccess.c, make the former
an independent C source file.
Rename lib/isc/fsaccess.c to lib/isc/fsaccess_common.c to prevent build
issues on Windows caused by multiple source files (lib/isc/fsaccess.c,
lib/isc/win32/fsaccess.c) being compiled into the same object file.
These changes improve consistency with the way "private" functions and
macros are treated elsewhere in the source tree.
The release notes were previously built as a separate document
(including the PDF version). It was agreed that this doesn't make much
sense, so the release notes are now included only as an appendix to the
BIND 9 ARM.
As a leftover from old TCP accept code isc_uv_import passed TCP_SERVER
flag when importing a socket on Windows.
Since now we're importing/exporting accepted connections it needs to
pass TCP_CONNECTION flag.
The SO_INCOMING_CPU is available since Linux 3.19 for getting the value,
but only since Linux 4.4 for setting the value (see below for a full
description). BIND 9 should not fail when setting the option on the
socket fails, as this is only an optimization and not hard requirement
to run BIND 9.
SO_INCOMING_CPU (gettable since Linux 3.19, settable since Linux 4.4)
Sets or gets the CPU affinity of a socket. Expects an integer flag.
int cpu = 1;
setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, sizeof(cpu));
Because all of the packets for a single stream (i.e., all
packets for the same 4-tuple) arrive on the single RX queue that
is associated with a particular CPU, the typical use case is to
employ one listening process per RX queue, with the incoming
flow being handled by a listener on the same CPU that is
handling the RX queue. This provides optimal NUMA behavior and
keeps CPU caches hot.
when built with "configure --enable-singletrace", named will produce
detailed query logging at the highest debug level for any query with
query ID zero.
this enables monitoring of the progress of a single query by specifying
the QID using "dig +qid=0". the "client" logging category should be set
to a low severity level to suppress logging of other queries. (the
chance of another query using QID=0 at the same time is only 1 in 2^16.)
"--enable-singletrace" turns on "--enable-querytrace" as well, so if the
logging severity is not lowered, all other queries will be logged
verbosely as well. compiling with either of these options will impair
query performance; they should only be turned on when testing or
troubleshooting.
This adds a unit test driver for BIND with Automake. It runs the unit
test program provided as its sole command line argument and then looks
for a core dump generated by that test program. If one is found, the
driver prints the backtrace into the test log.
In process_fd we lock sock->lock and then internal_accept locks mgr->lock,
in isc_sockmgr_render* functions we lock mgr->lock and then lock sock->lock,
that can cause a deadlock when accessing stats. Unlock sock->lock early in
all the internal_{send,recv,connect,accept} functions instead of late
in process_fd.
Instead of using bind() and passing the listening socket to the children
threads using uv_export/uv_import use one thread that does the accepting,
and then passes the connected socket using uv_export/uv_import to a random
worker. The previous solution had thundering herd problems (all workers
waking up on one connection and trying to accept()), this one avoids this
and is simpler.
The tcp clients quota is simplified with isc_quota_attach_cb - a callback
is issued when the quota is available.
Add recursive "test" and "unit" rules, which execute "make check"
in specific directories - "make test" runs the system tests, and
"make unit" runs the unit tests.
The rewrite of BIND 9 build system is a large work and cannot be reasonable
split into separate merge requests. Addition of the automake has a positive
effect on the readability and maintainability of the build system as it is more
declarative, it allows conditional and we are able to drop all of the custom
make code that BIND 9 developed over the years to overcome the deficiencies of
autoconf + custom Makefile.in files.
This squashed commit contains following changes:
- conversion (or rather fresh rewrite) of all Makefile.in files to Makefile.am
by using automake
- the libtool is now properly integrated with automake (the way we used it
was rather hackish as the only official way how to use libtool is via
automake
- the dynamic module loading was rewritten from a custom patchwork to libtool's
libltdl (which includes the patchwork to support module loading on different
systems internally)
- conversion of the unit test executor from kyua to automake parallel driver
- conversion of the system test executor from custom make/shell to automake
parallel driver
- The GSSAPI has been refactored, the custom SPNEGO on the basis that
all major KRB5/GSSAPI (mit-krb5, heimdal and Windows) implementations
support SPNEGO mechanism.
- The various defunct tests from bin/tests have been removed:
bin/tests/optional and bin/tests/pkcs11
- The text files generated from the MD files have been removed, the
MarkDown has been designed to be readable by both humans and computers
- The xsl header is now generated by a simple sed command instead of
perl helper
- The <irs/platform.h> header has been removed
- cleanups of configure.ac script to make it more simpler, addition of multiple
macros (there's still work to be done though)
- the tarball can now be prepared with `make dist`
- the system tests are partially able to run in oot build
Here's a list of unfinished work that needs to be completed in subsequent merge
requests:
- `make distcheck` doesn't yet work (because of system tests oot run is not yet
finished)
- documentation is not yet built, there's a different merge request with docbook
to sphinx-build rst conversion that needs to be rebased and adapted on top of
the automake
- msvc build is non functional yet and we need to decide whether we will just
cross-compile bind9 using mingw-w64 or fix the msvc build
- contributed dlz modules are not included neither in the autoconf nor automake
With the introduction of dnssec-policy, the aforementioned tools were
either rendered obsolete, or they will be replaced with dnssec-policy
based tools. Remove the tools and the requirement to have Python
installed. Python 3 is still being used for tests, so keep the autoconf
test, but make it much simpler.
The pk11/constants.h header contained static CK_BYTE arrays and
we had to use #defines to pull only those we need. This commit
changes the constants to only define byte arrays with the content
and either use them directly or define the CK_BYTE arrays locally
where used.
All our MSVS Project files share the same intermediate directory. We
know that this doesn't cause any problems, so we can just disable the
detection in the project files.
Example of the warning:
warning MSB8028: The intermediate directory (.\Release\) contains files shared from another project (dnssectool.vcxproj). This can lead to incorrect clean and rebuild behavior.
MSVC documentation states: "This warning can be caused when a pointer to
a const or volatile item is assigned to a pointer not declared as
pointing to const or volatile."
Unfortunately, this happens when we dynamically allocate and deallocate
block of atomic variables using isc_mem_get and isc_mem_put.
Couple of examples:
lib\isc\hp.c(134): warning C4090: 'function': different 'volatile' qualifiers [C:\builds\isc-projects\bind9\lib\isc\win32\libisc.vcxproj]
lib\isc\hp.c(144): warning C4090: 'function': different 'volatile' qualifiers [C:\builds\isc-projects\bind9\lib\isc\win32\libisc.vcxproj]
lib\isc\stats.c(55): warning C4090: 'function': different 'volatile' qualifiers [C:\builds\isc-projects\bind9\lib\isc\win32\libisc.vcxproj]
lib\isc\stats.c(87): warning C4090: 'function': different 'volatile' qualifiers [C:\builds\isc-projects\bind9\lib\isc\win32\libisc.vcxproj]
The InterlockedOr8() and InterlockedAnd8() first argument was cast
to (atomic_int_fast8_t) instead of (atomic_int_fast8_t *), this was
reported by MSVC as:
warning C4024: '_InterlockedOr8': different types for formal and actual parameter 1
warning C4024: '_InterlockedAnd8': different types for formal and actual parameter 1
Our vcxproj files set the WarningLevel to Level3, which is too verbose
for a code that needs to be portable. That basically leads to ignoring
all the errors that MSVC produces. This commits downgrades the
WarningLevel to Level1 and enables treating warnings as errors for
Release builds. For the Debug builds the WarningLevel got upgraded to
Level4, and treating warnings as errors is explicitly disabled.
We should eventually make the code clean of all MSVC warnings, but it's
a long way to go for Level4, so it's more reasonable to start at Level1.
For reference[1], these are the warning levels as described by MSVC
documentation:
* /W0 suppresses all warnings. It's equivalent to /w.
* /W1 displays level 1 (severe) warnings. /W1 is the default setting
in the command-line compiler.
* /W2 displays level 1 and level 2 (significant) warnings.
* /W3 displays level 1, level 2, and level 3 (production quality)
warnings. /W3 is the default setting in the IDE.
* /W4 displays level 1, level 2, and level 3 warnings, and all level 4
(informational) warnings that aren't off by default. We recommend
that you use this option to provide lint-like warnings. For a new
project, it may be best to use /W4 in all compilations. This option
helps ensure the fewest possible hard-to-find code defects.
* /Wall displays all warnings displayed by /W4 and all other warnings
that /W4 doesn't include — for example, warnings that are off by
default.
* /WX treats all compiler warnings as errors. For a new project, it
may be best to use /WX in all compilations; resolving all warnings
ensures the fewest possible hard-to-find code defects.
1. https://docs.microsoft.com/en-us/cpp/build/reference/compiler-option-warning-level?view=vs-2019
The assembly code generated by MSVC for at least some signed comparisons
involving atomic variables incorrectly uses unsigned conditional jumps
instead of signed ones. In particular, the checks in isc_log_wouldlog()
are affected in a way which breaks logging on Windows and thus also all
system tests involving a named instance. Work around the issue by
assigning the values returned by atomic_load_acquire() calls in
isc_log_wouldlog() to local variables before performing comparisons.
The rwlock introduced to protect the .logconfig member of isc_log_t
structure caused a significant performance drop because of the rwlock
contention. It was also found, that the debug_level member of said
structure was not protected from concurrent read/writes.
The .dynamic and .highest_level members of isc_logconfig_t structure
were actually just cached values pulled from the assigned channels.
We introduced an even higher cache level for .dynamic and .highest_level
members directly into the isc_log_t structure, so we don't have to
access the .logconfig member in the isc_log_wouldlog() function.
We could have a race between handle closing and processing async
callback. Deactivate the handle before issuing the callback - we
have the socket referenced anyway so it's not a problem.
We introduce a isc_quota_attach_cb function - if ISC_R_QUOTA is returned
at the time the function is called, then a callback will be called when
there's quota available (with quota already attached). The callbacks are
organized as a LIFO queue in the quota structure.
It's needed for TCP client quota - with old networking code we had one
single place where tcp clients quota was processed so we could resume
accepting when the we had spare slots, but it's gone with netmgr - now
we need to notify the listener/accepter that there's quota available so
that it can resume accepting.
Remove unused isc_quota_force() function.
The isc_quote_reserve and isc_quota_release were used only internally
from the quota.c and the tests. We should not expose API we are not
using.
tcpdns used transport-specific functions to operate on the outer socket.
Use generic ones instead, and select the proper call in netmgr.c.
Make the missing functions (e.g. isc_nm_read) generic and add type-specific
calls (isc__nm_tcp_read). This is the preparation for netmgr TLS layer.
In isc_log_woudlog() the .logconfig member of isc_log_t structure was
accessed unlocked on the merit that there could be just a race when
.logconfig would be NULL, so the message would not be logged. This
turned not to be true, as there's also data race deeper. The accessed
isc_logconfig_t object could be in the middle of destruction, so the
pointer would be still non-NULL, but the structure members could point
to a chunk of memory no longer belonging to the object. Since we are
only accessing integer types (the log level), this would never lead to
a crash, it leads to memory access to memory area no longer belonging to
the object and this a) wrong, b) raises a red flag in thread-safety tools.
The isc_mem API now crashes on memory allocation failure, and this is
the next commit in series to cleanup the code that could fail before,
but cannot fail now, e.g. isc_result_t return type has been changed to
void for the isc_log API functions that could only return ISC_R_SUCCESS.
On Windows, C11 localtime_r() and gmtime_r() functions are not
available. While localtime() and gmtime() functions are already thread
safe because they use Thread Local Storage, it's quite ugly to #ifdef
around every localtime_r() and gmtime_r() usage to make the usage also
thread-safe on POSIX platforms.
The commit adds wrappers around Windows localtime_s() and gmtime_s()
functions.
NOTE: The implementation of localtime_s and gmtime_s in Microsoft CRT
are incompatible with the C standard since it has reversed parameter
order and errno_t return type.
The <isc/md.h> header directly included <openssl/evp.h> header which
enforced all users of the libisc library to explicitly list the include
path to OpenSSL and link with -lcrypto. By hiding the specific
implementation into the private namespace, we no longer enforce this.
In the long run, this might also allow us to switch cryptographic
library implementation without affecting the downstream users.
While making the isc_md_type_t type opaque, the API using the data type
was changed to use the pointer to isc_md_type_t instead of using the
type directly.
The <isc/md.h> header directly included <openssl/hmac.h> header which
enforced all users of the libisc library to explicitly list the include
path to OpenSSL and link with -lcrypto. By hiding the specific
implementation into the private namespace, we no longer enforce this.
In the long run, this might also allow us to switch cryptographic
library implementation without affecting the downstream users.
The two "functions" that isc/safe.h declared before were actually simple
defines to matching OpenSSL functions. The downside of the approach was
enforcing all users of the libisc library to explicitly list the include
path to OpenSSL and link with -lcrypto. By hiding the specific
implementation into the private namespace changing the defines into
simple functions, we no longer enforce this. In the long run, this
might also allow us to switch cryptographic library implementation
without affecting the downstream users.
The previous commit removed the code related to the internal symbol
table. On platforms where available, we can now use backtrace_symbols()
to print more verbose symbols table to the output.
As there's now general availability of backtrace() and
backtrace_symbols() functions (see below), the commit also removes the
usage of glibc internals and the custom stack tracing.
* backtrace(), backtrace_symbols(), and backtrace_symbols_fd() are
provided in glibc since version 2.1.
* backtrace(), backtrace_symbols(), and backtrace_symbols_fd() first
appeared in Mac OS X 10.5.
* The backtrace() library of functions first appeared in NetBSD 7.0 and
FreeBSD 10.0.
A data race was happening while BIND was starting due to
isc_log_wouldlog function accessing lctx->logconfig without a lock.
To prevent that without incurring much costs, that variable was made
atomic.
This commit fixes isc_glob function on windows environments.
The file_list_t * object pointed to by pglob->reserved was missing
ISC_LIST_INIT intialization macro.
There was a very slim chance of a race between isc_socket_detach and
process_fd: isc_socket_detach decrements references to 0, and before it
calls destroy gets preempted. Second thread calls process_fd, increments
socket references temporarily to 1, and then gets preempted, first thread
then hits assertion in destroy() as the reference counter is now 1 and
not 0.
When --with-zlib is passed to ./configure (or when the latter
autodetects zlib's presence), libisc uses certain zlib functions and
thus libisc's users should be linked against zlib in that case. Adjust
Makefile variables appropriately to prevent shared build failures caused
by underlinking.
mem.c:add_trace_entry() -> isc_hash_function() -> isc_siphash24()
129 for (; in != end; in += 8) {
6. byte_swapping: Performing a byte swapping operation on
in implies that it came from an external source, and is
therefore tainted.
130 uint64_t m = U8TO64_LE(in);
We were using our own versions of isc_uv_{export,import} functions
for multithreaded TCP listeners. Upcoming libuv version will
contain proper uv_{export,import} functions - use them if they're
available.
Upcoming version of libuv will suport uv_recvmmsg and uv_sendmmsg. To
use uv_recvmmsg we need to provide a larger buffer and be able to
properly free it.
isc_task_pause/unpause were inherently thread-unsafe - a task
could be paused only once by one thread, if the task was running
while we paused it it led to races. Fix it by making sure that
the task will pause if requested to, and by using a 'pause reference
counter' to count task pause requests - a task will be unpaused
iff all threads unpause it.
Don't remove from queue when pausing task - we lock the queue lock
(expensive), while it's unlikely that the task will be running -
and we'll remove it anyway in dispatcher
this corrects some style glitches such as:
```
long_function_call(arg, arg2, arg3, arg4, arg5, "str"
"ing");
```
...by adjusting the penalties for breaking strings and call
parameter lists.
While testing BIND 9 on arm64 8+ core machine, it was discovered that
the weak variants in fact does spuriously fail, we haven't observed that
on other architectures.
This commit replaces all non-loop usage of atomic_compare_exchange_weak
with atomic_compare_exchange_strong.
This commit simplifies a bit the lock management within dns_resolver_prime()
and prime_done() functions by means of turning resolver's attribute
"priming" into an atomic_bool and by creating only one dependent object on the
lock "primelock", namely the "primefetch" attribute.
By having the attribute "priming" as an atomic type, it save us from having to
use a lock just to test if priming is on or off for the given resolver context
object, within "dns_resolver_prime" function.
The "primelock" lock is still necessary, since dns_resolver_prime() function
internally calls dns_resolver_createfetch(), and whenever this function
succeeds it registers an event in the task manager which could be called by
another thread, namely the "prime_done" function, and this function is
responsible for disposing the "primefetch" attribute in the resolver object,
also for resetting "priming" attribute to false.
It is important that the invariant "priming == false AND primefetch == NULL"
remains constant, so that any thread calling "dns_resolver_prime" knows for sure
that if the "priming" attribute is false, "primefetch" attribute should also be
NULL, so a new fetch context could be created to fulfill this purpose, and
assigned to "primefetch" attribute under the lock protection.
To honor the explanation above, dns_resolver_prime is implemented as follow:
1. Atomically checks the attribute "priming" for the given resolver context.
2. If "priming" is false, assumes that "primefetch" is NULL (this is
ensured by the "prime_done" implementation), acquire "primelock"
lock and create a new fetch context, update "primefetch" pointer to
point to the newly allocated fetch context.
3. If "priming" is true, assumes that the job is already in progress,
no locks are acquired, nothing else to do.
To keep the previous invariant consistent, "prime_done" is implemented as follow:
1. Acquire "primefetch" lock.
2. Keep a reference to the current "primefetch" object;
3. Reset "primefetch" attribute to NULL.
4. Release "primefetch" lock.
5. Atomically update "priming" attribute to false.
6. Destroy the "primefetch" object by using the temporary reference.
This ensures that if "priming" is false, "primefetch" was already reset to NULL.
It doesn't make any difference in having the "priming" attribute not protected
by a lock, since the visible state of this variable would depend on the calling
order of the functions "dns_resolver_prime" and "prime_done".
As an example, suppose that instead of using an atomic for the "priming" attribute
we employed a lock to protect it.
Now suppose that "prime_done" function is called by Thread A, it is then preempted
before acquiring the lock, thus not reseting "priming" to false.
In parallel to that suppose that a Thread B is scheduled and that it calls
"dns_resolver_prime()", it then acquires the lock and check that "priming" is true,
thus it will consider that this resolver object is already priming and it won't do
any more job.
Conversely if the lock order was acquired in the other direction, Thread B would check
that "priming" is false (since prime_done acquired the lock first and set "priming" to false)
and it would initiate a priming fetch for this resolver.
An atomic variable wouldn't change this behavior, since it would behave exactly the
same, depending on the function call order, with the exception that it would avoid
having to use a lock.
There should be no side effects resulting from this change, since the previous
implementation employed use of the more general resolver's "lock" mutex, which
is used in far more contexts, but in the specifics of the "dns_resolver_prime"
and "prime_done" it was only used to protect "primefetch" and "priming" attributes,
which are not used in any of the other critical sections protected by the same lock,
thus having zero dependency on those variables.
Both clang-tidy and uncrustify chokes on statement like this:
for (...)
if (...)
break;
This commit uses a very simple semantic patch (below) to add braces around such
statements.
Semantic patch used:
@@
statement S;
expression E;
@@
while (...)
- if (E) S
+ { if (E) { S } }
@@
statement S;
expression E;
@@
for (...;...;...)
- if (E) S
+ { if (E) { S } }
@@
statement S;
expression E;
@@
if (...)
- if (E) S
+ { if (E) { S } }
There was a hard limit set on number of uvreq and nmhandles
that can be allocated by a pool, but we don't handle a situation
where we can't get an uvreq. Don't limit the number at all,
let the OS deal with it.
The memory ordering in the rwlock was all wrong, I am copying excerpts
from the https://en.cppreference.com/w/c/atomic/memory_order#Relaxed_ordering
for the convenience of the reader:
Relaxed ordering
Atomic operations tagged memory_order_relaxed are not synchronization
operations; they do not impose an order among concurrent memory
accesses. They only guarantee atomicity and modification order
consistency.
Release-Acquire ordering
If an atomic store in thread A is tagged memory_order_release and an
atomic load in thread B from the same variable is tagged
memory_order_acquire, all memory writes (non-atomic and relaxed atomic)
that happened-before the atomic store from the point of view of thread
A, become visible side-effects in thread B. That is, once the atomic
load is completed, thread B is guaranteed to see everything thread A
wrote to memory.
The synchronization is established only between the threads releasing
and acquiring the same atomic variable. Other threads can see different
order of memory accesses than either or both of the synchronized
threads.
Which basically means that we had no or weak synchronization between
threads using the same variables in the rwlock structure. There should
not be a significant performance drop because the critical sections were
already protected by:
while(1) {
if (relaxed_atomic_operation) {
break;
}
LOCK(lock);
if (!relaxed_atomic_operation) {
WAIT(sem, lock);
}
UNLOCK(lock)l
}
I would add one more thing to "Don't do your own crypto, folks.":
- Also don't do your own locking, folks.
Also disable the semantic patch as the code needs tweaks here and there because
some destroy functions might not destroy the object and return early if the
object is still in use.
Creation of EVP_MD_CTX and EVP_PKEY is quite expensive, until
we fix the code to reuse the context and key we'll use our own
implementation of siphash.
In system tests on Windows tool's local port can sometimes clash with
'named'. On Unix the system is poked for the minimal local port,
otherwise is set to 32768 as a sane minimum. For Windows we don't
poke but set a hardcoded limit; this change aligns the limit with
Unix and changes it to 32768.
389 else
CID 1452695 (#1 of 1): Dereference before null check (REVERSE_INULL)
check_after_deref: Null-checking lcfg suggests that it may
be null, but it has already been dereferenced on all paths
leading to the check.
390 if (lcfg != NULL)
391 isc_logconfig_destroy(&lcfg);
The isc_buffer_allocate() function now cannot fail with ISC_R_MEMORY.
This commit removes all the checks on the return code using the semantic
patch from previous commit, as isc_buffer_allocate() now returns void.
The isc_mempool_create() function now cannot fail with ISC_R_MEMORY.
This commit removes all the checks on the return code using the semantic
patch from previous commit, as isc_mempool_create() now returns void.
To reproduce the race - create a task, send two events to it, first one
must take some time. Then, from the outside, pause(), unpause() and detach()
the task.
When the long-running event is processed by the task it is in
task_state_running state. When we called pause() the state changed to
task_state_paused, on unpause we checked that there are events in the task
queue, changed the state to task_state_ready and enqueued the task on the
workers readyq. We then detach the task.
The dispatch() is done with processing the event, it processes the second
event in the queue, and then shuts down the task and frees it (as it's not
referenced anymore). Dispatcher then takes the, already freed, task from
the queue where it was wrongly put, causing an use-after free and,
subsequently, either an assertion failure or a segmentation fault.
The probability of this happening is very slim, yet it might happen under a
very high load, more probably on a recursive resolver than on an
authoritative.
The fix introduces a new 'task_state_pausing' state - to which tasks
are moved if they're being paused while still running. They are moved
to task_state_paused state when dispatcher is done with them, and
if we unpause a task in paused state it's moved back to task_state_running
and not requeued.
When two threads unreferenced handles coming from one socket while
the socket was being destructed we could get a use-after-free:
Having handle H1 coming from socket S1, H2 coming from socket S2,
S0 being a parent socket to S1 and S2:
Thread A Thread B
Unref handle H1 Unref handle H2
Remove H1 from S1 active handles Remove H2 from S2 active handles
nmsocket_maybe_destroy(S1) nmsocket_maybe_destroy(S2)
nmsocket_maybe_destroy(S0) nmsocket_maybe_destroy(S0)
LOCK(S0->lock)
Go through all children, figure
out that we have no more active
handles:
sum of S0->children[i]->ah == 0
UNLOCK(S0->lock)
destroy(S0)
LOCK(S0->lock)
- but S0 is already gone
We weren't consistent about who should unreference the handle in
case of network error. Make it consistent so that it's always the
client code responsibility to unreference the handle - either
in the callback or right away if send function failed and the callback
will never be called.
In tcp and udp stoplistening code we accessed libuv structures
from a different thread, which caused a shutdown crash when named
was under load. Also added additional DbC checks making sure we're
in a proper thread when accessing uv_ functions.
We had a race in which n UDP socket could have been already closing
by libuv but we still sent data to it. Mark socket as not-active
when stopping listening and verify that socket is not active when
trying to send data to it.
We pass interface as an opaque argument to tcpdns listening socket.
If we stop listening on an interface but still have in-flight connections
the opaque 'interface' is not properly reference counted, and we might
hit a dead memory. We put just a single source of truth in a listening
socket and make the child sockets use that instead of copying the
value from listening socket. We clean the callback when we stop listening.
- isc__netievent_storage_t was to small to contain
isc__netievent__socket_streaminfo_t on Windows
- handle isc_uv_export and isc_uv_import errors properly
- rewrite isc_uv_export and isc_uv_import on Windows
hp implementation requires an object for each thread accessing
a hazard pointer. previous implementation had a hardcoded
HP_MAX_THREAD value of 128, which failed on machines with lots of
CPU cores (named uses 3n threads). We make isc__hp_max_threads
configurable at startup, with the value set to 4*named_g_cpus.
It's also important for this value not to be too big as we do
linear searches on a list.
The isc_refcount API that provides reference counting lost DbC checks for
overflows and underflows in the isc_refcount_{increment,decrement} functions.
The commit restores the overflow check in the isc_refcount_increment and
underflows check in the isc_refcount_decrement by checking for the previous
value to not be on the boundary.
This commits removes superfluous checks when using the isc_refcount API.
Examples of superfluous checks:
1. The isc_refcount_decrement function ensures there was not underflow,
so this check is superfluous:
INSIST(isc_refcount_decrement(&r) > 0);
2 .The isc_refcount_destroy() includes check whether the counter
is zero, therefore this is superfluous:
INSIST(isc_refcount_decrement(&r) == 1 && isc_refcount_destroy(&r));
If a connection was closed early (right after accept()) an assertion
that assumed that the connection was still alive could be triggered
in accept_connection. Handle those errors properly and not with
assertions, free all the resources afterwards.
- the socket stat counters have been moved from socket.h to stats.h.
- isc_nm_t now attaches to the same stats counter group as
isc_socketmgr_t, so that both managers can increment the same
set of statistics
- isc__nmsocket_init() now takes an interface as a paramter so that
the address family can be determined when initializing the socket.
- based on the address family and socket type, a group of statistics
counters will be associated with the socket - for example, UDP4Active
with IPv4 UDP sockets and TCP6Active with IPv6 TCP sockets. note
that no counters are currently associated with TCPDNS sockets; those
stats will be handled by the underlying TCP socket.
- the counters are not actually used by netmgr sockets yet; counter
increment and decrement calls will be added in a later commit.
If pktinfo were supported then we could listen on :: for ipv6 and get
the information about the destination address from pktinfo structure passed
in recvmsg but this method is not portable and libuv doesn't support it - so
we need to listen on all interfaces.
We should verify that this doesn't impact performance (we already do it for
ipv4) and either remove all the ipv6pktinfo detection code or think of fixing
libuv.
- use UV_{TC,UD}P_IPV6ONLY for IPv6 sockets, keeping the pre-netmgr
behaviour.
- add a new listening_error bool flag which is set if the child
listener fails to start listening. This fixes a bug where named would
hang if, e.g., we failed to bind to a TCP socket.
For multithreaded TCP listening we need to pass a bound socket to all
listening threads. Instead of using uv_pipe handle passing method which
is quite complex (lots of callbacks, each of them with its own error
handling) we now use isc_uv_export() to export the socket, pass it as a
member of the isc__netievent_tcpchildlisten_t structure, and then
isc_uv_import() it in the child thread, simplifying the process
significantly.
These functions can be used to pass a uv handle between threads in a
safe manner. The other option is to use uv_pipe and pass the uv_handle
via IPC, which is way more complex. uv_export() and uv_import() functions
existed in libuv at some point but were removed later. This code is
based on the original removed code.
The Windows version of the code uses two functions internal to libuv;
a patch for libuv is attached for exporting these functions.
After the network manager rewrite, tcp-higwater stats was only being
updated when a valid DNS query was received over tcp.
It turns out tcp-quota is updated right after a tcp connection is
accepted, before any data is read, so in the event that some client
connect but don't send a valid query, it wouldn't be taken into
account to update tcp-highwater stats, that is wrong.
This commit fix tcp-highwater to update its stats whenever a tcp connection
is established, independent of what happens after (timeout/invalid
request, etc).
- make tcp listening IPC pipe name saner
- put the pipe in /tmp on unices
- add pid to the pipe name to avoid conflicts between processes
- fsync directory in which the pipe resides to make sure that the
child threads will see it and be able to open it
even when worker is paused (e.g. interface reconfiguration). This is
needed to prevent deadlocks when reconfiguring interfaces - as network
manager is paused then, but we still need to stop/start listening.
- Proper handling of TCP listen errors in netmgr - bind to the socket first,
then return the error code.
When listening for TCP connections we create a socket, bind it
and then pass it over IPC to all threads - which then listen on
in and accept connections. This sounds broken, but it's the
official way of dealing with multithreaded TCP listeners in libuv,
and works on all platforms supported by libuv.
For BIND 9.16+, TLS aware compiler is required, and using
ISC_THREAD_LOCAL is preferred way of using Thread Local Storage. The
isc_thread_key API is no longer used anywhere and hence was removed from
BIND 9.
The new ISC_THREAD_LOCAL macro unifies usage of platform dependent
Thread Local Storage definition thread_local vs __thread vs
__declspec(thread) to a single macro.
The commit also unifies the required level of support for TLS as for
some parts of the code it was mandatory and for some parts of the code
it wasn't.
FCTX_ATTR_SHUTTINGDOWN needs to be set and tested while holding the node
lock but the rest of the attributes don't as they are task locked. Making
fctx->attributes atomic allows both behaviours without races.
- restore support for tcp-initial-timeout, tcp-idle-timeout,
tcp-keepalive-timeout and tcp-advertised-timeout configuration
options, which were ineffective previously.
- add timeout support for TCP and TCPDNS connections to protect against
slowloris style attacks. currently, all timeouts are hard-coded.
- rework and simplify the TCPDNS state machine.
isc_mem_traceflag_test messes with stdout/stderr, which can cause
problems with subsequent tests (no output, libuv problems). Moving that
test case to the end ensures there are no side effects.
close the uv_handle for the worker async channel, and call
uv_loop_close() on shutdown to ensure that the event loop's
internal resources are properly freed.
when the TCPDNS_CLIENTS_PER_CONN limit has been exceeded for a TCP
DNS connection, switch to sequential mode to ensure that memory cannot
be exhausted by too many simultaneous queries.
- the netmgr was not correctly being specified when creating the task
manager, and was cleaned up in the wrong order when shutting down.
- on freebsd, timer_test appears to be prone to failure if the
netmgr is set up and torn down before and after ever test case, but
less so if it's only set up once at the beginning and once at the
end.
This commit renames isctest {mctx,lctx} to test_{mctx,lctx} and cleans
up their usage in the individual unit tests. This allows embedding
library .c files directly into the unit tests.
This allows a task to be temporary disabled so that objects won't be
processed simultaneously by libuv events and isc_task events. When a
task is paused, currently running events may complete, but no further
event will added to the run queue will be executed until the task is
unpaused.
When a task manager is created, we can now specify an `isc_nm`
object to associate with it; thereafter when the task manager is
placed into exclusive mode, the network manager will be paused.
This is a replacement for the existing isc_socket and isc_socketmgr
implementation. It uses libuv for asynchronous network communication;
"networker" objects will be distributed across worker threads reading
incoming packets and sending them for processing.
UDP listener sockets automatically create an array of "child" sockets
so each worker can listen separately.
TCP sockets are shared amongst worker threads.
A TCPDNS socket is a wrapper around a TCP socket, which handles the
the two-byte length field at the beginning of DNS messages over TCP.
(Other wrapper socket types can be implemented in the future to handle
DNS over TLS, DNS over HTTPS, etc.)
The double-locked queue implementation is still currently in use
in ns_client, but will be replaced by a fetch-and-add array queue.
This commit moves it from queue.h to list.h so that queue.h can be
used for the new data structure, and clean up dependencies between
list.h and types.h. Later, when the ISC_QUEUE is no longer is use,
it will be removed completely.
glibc 2.30 deprecated the <sys/sysctl.h> header [1]. However, that
header is still used on other Unix-like systems, so only prevent it from
being used on Linux, in order to prevent compiler warnings from being
triggered.
[1] https://sourceware.org/ml/libc-alpha/2019-08/msg00029.html
This variable will report the maximum number of simultaneous tcp clients
that BIND has served while running.
It can be verified by running rndc status, then inspect "tcp high-water:
count", or by generating statistics file, rndc stats, then inspect the
line with "TCP connection high-water" text.
The tcp-highwater variable is atomically updated based on an existing
tcp-quota system handled in ns/client.c.
cppcheck 1.89 enabled certain value flow analysis mechanisms [1] which
trigger null pointer dereference false positives in lib/dns/rpz.c:
lib/dns/rpz.c:582:7: warning: Possible null pointer dereference: tgt_ip [nullPointer]
if (KEY_IS_IPV4(tgt_prefix, tgt_ip)) {
^
lib/dns/rpz.c:1419:44: note: Calling function 'adj_trigger_cnt', 4th argument 'NULL' value is 0
adj_trigger_cnt(rpzs, rpz_num, rpz_type, NULL, 0, true);
^
lib/dns/rpz.c:582:7: note: Null pointer dereference
if (KEY_IS_IPV4(tgt_prefix, tgt_ip)) {
^
lib/dns/rpz.c:596:7: warning: Possible null pointer dereference: tgt_ip [nullPointer]
if (KEY_IS_IPV4(tgt_prefix, tgt_ip)) {
^
lib/dns/rpz.c:1419:44: note: Calling function 'adj_trigger_cnt', 4th argument 'NULL' value is 0
adj_trigger_cnt(rpzs, rpz_num, rpz_type, NULL, 0, true);
^
lib/dns/rpz.c:596:7: note: Null pointer dereference
if (KEY_IS_IPV4(tgt_prefix, tgt_ip)) {
^
lib/dns/rpz.c:610:7: warning: Possible null pointer dereference: tgt_ip [nullPointer]
if (KEY_IS_IPV4(tgt_prefix, tgt_ip)) {
^
lib/dns/rpz.c:1419:44: note: Calling function 'adj_trigger_cnt', 4th argument 'NULL' value is 0
adj_trigger_cnt(rpzs, rpz_num, rpz_type, NULL, 0, true);
^
lib/dns/rpz.c:610:7: note: Null pointer dereference
if (KEY_IS_IPV4(tgt_prefix, tgt_ip)) {
^
It seems that cppcheck no longer treats at least some REQUIRE()
assertion failures as fatal, so add extra assertion macro definitions to
lib/isc/include/isc/util.h that are only used when the CPPCHECK
preprocessor macro is defined; these definitions make cppcheck 1.89
behave as expected.
There is an important requirement for these custom definitions to work:
cppcheck must properly treat abort() as a function which does not
return. In order for that to happen, the __GNUC__ macro must be set to
a high enough number (because system include directories are used and
system headers compile attributes away if __GNUC__ is not high enough).
__GNUC__ is thus set to the major version number of the GCC compiler
used, which is what that latter does itself during compilation.
[1] aaeec462e6
The coccinellery repository provides many little semantic patches to fix common
problems in the code. The number of semantic patches in the coccinellery
repository is high and most of the semantic patches apply only for Linux, so it
doesn't make sense to run them on regular basis as the processing takes a lot of
time.
The list of issue found in BIND 9, by no means complete, includes:
- double assignment to a variable
- `continue` at the end of the loop
- double checks for `NULL`
- useless checks for `NULL` (cannot be `NULL`, because of earlier return)
- using `0` instead of `NULL`
- useless extra condition (`if (foo) return; if (!foo) { ...; }`)
- removing & in front of static functions passed as arguments
Until now, the build process for BIND on Windows involved upgrading the
solution file to the version of Visual Studio used on the build host.
Unfortunately, the executable used for that (devenv.exe) is not part of
Visual Studio Build Tools and thus there is no clean way to make that
executable part of a Windows Server container.
Luckily, the solution upgrade process boils down to just adding XML tags
to Visual Studio project files and modifying certain XML attributes - in
files which we pregenerate anyway using win32utils/Configure. Thus,
extend win32utils/Configure with three new command line parameters that
enable it to mimic what "devenv.exe bind9.sln /upgrade" does. This
makes the devenv.exe build step redundant and thus facilitates building
BIND in Windows Server containers.
* CKR_CRYPTOKI_ALREADY_INITIALIZED: This value can only be returned by
`C_Initialize`. It means that the Cryptoki library has already been
initialized (by a previous call to `C_Initialize` which did not have a
matching `C_Finalize` call).
* CKR_FUNCTION_NOT_SUPPORTED: The requested function is not supported by this
Cryptoki library. Even unsupported functions in the Cryptoki API should have a
“stub” in the library; this stub should simply return the value
CKR_FUNCTION_NOT_SUPPORTED.
* CKR_LIBRARY_LOAD_FAILED: The Cryptoki library could not load a dependent
shared library.
The OASIS pkcs11.h header has a restrictive license. Replace the
pkcs11.h pkcs11f.h and pkcs11t.h headers with pkcs11.h from p11-kit.
For source distribution, the license for the OASIS headers itself
doesn't pose any licensing problem when combined with MPL license, but
it possibly creates problem for downstream distributors of BIND 9.
Previously the libisc allocator had ability to run unlocked when threading was
disabled. As the threading is now always on, remove the ISC_MEMFLAG_NOLOCK
memory flag as it serves no purpose.
The isc_mem_createx() function was only used in the tests to eliminate using the
default flags (which as of writing this commit message was ISC_MEMFLAG_INTERNAL
and ISC_MEMFLAG_FILL). This commit removes the isc_mem_createx() function from
the public API.
Previously, the isc_mem_create() and isc_mem_createx() functions took `max_size`
and `target_size` as first two arguments. Those values were never used in the
BIND 9 code. The refactoring removes those arguments and let BIND 9 always use
the default values.
Previously, the isc_mem_create() and isc_mem_createx() functions could have
failed because of failed memory allocation. As this was no longer true and the
functions have always returned ISC_R_SUCCESS, the have been refactored to return
void.
This commits adds an OpenSSL based isc_siphash24() implementation, which is
preferred when available.
The siphash_test has been modified to test both implementation with a trick that
renames the isc_siphash24() to openssl_ or native_ prefixed name and includes
the ../siphash.c two times (when the OpenSSL implementation is available).
The native implementation's conversion from the uint8_t buffers to uint64_t now
follows the reference implementation that doesn't require aligned buffers.
isc_event_allocate() calls isc_mem_get() to allocate the event structure. As
isc_mem_get() cannot fail softly (e.g. it never returns NULL), the
isc_event_allocate() cannot return NULL, hence we remove the (ret == NULL)
handling blocks using the semantic patch from the previous commit.
The change fixes the following build failure on sparc T3 and older CPUs:
```
sparc-unknown-linux-gnu-gcc ... -O2 -mcpu=niagara2 ... -c rwlock.c
{standard input}: Assembler messages:
{standard input}:398: Error: Architecture mismatch on "pause ".
{standard input}:398: (Requires v9e|v9v|v9m|m8; requested architecture is v9b.)
make[1]: *** [Makefile:280: rwlock.o] Error 1
```
`pause` insutruction exists only on `-mcpu=niagara4` (`T4`) and upper.
The change adds `pause` configure-time autodetection and uses it if available.
config.h.in got new `HAVE_SPARC_PAUSE` knob. Fallback is a fall-through no-op.
Build-tested on:
- sparc-unknown-linux-gnu-gcc (no `pause`, build succeeds)
- sparc-unknown-linux-gnu-gcc -mcpu=niagara4 (`pause`, build succeeds)
Reported-by: Rolf Eike Beer
Bug: https://bugs.gentoo.org/691708
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Previously isc_thread_join() would return ISC_R_UNEXPECTED on a failure to
create new thread. All such occurences were caught and wrapped into assert
function at higher level. The function was simplified to assert directly in the
isc_thread_join() function and all caller level assertions were removed.
Previously isc_thread_create() would return ISC_R_UNEXPECTED on a failure to
create new thread. All such occurences were caught and wrapped into assert
function at higher level. The function was simplified to assert directly in the
isc_thread_create() function and all caller level assertions were removed.
Using isc_mem_put(mctx, ...) + isc_mem_detach(mctx) required juggling with the
local variables when mctx was part of the freed object. The isc_mem_putanddetach
function can handle this case internally, but it wasn't used everywhere. This
commit apply the semantic patching plus bit of manual work to replace all such
occurrences with proper usage of isc_mem_putanddetach().
"PST8PDT" is a legacy time zone name whose use in modern code is
discouraged. It so happens that using this time zone with musl libc
time functions results in different output than for other libc
implementations, which breaks the lib/isc/tests/time_test unit test.
Use the "America/Los_Angeles" time zone instead in order to get
consistent output across all tested libc implementations.
Including <sys/errno.h> instead of <errno.h> raises a compiler warning
when building against musl libc. Always include <errno.h> instead of
<sys/errno.h> to prevent that compilation warning from being triggered
and to achieve consistency in this regard across the entire source tree.
Make sure all unit tests include headers in a similar order:
1. Three headers which must be included before <cmocka.h>.
2. System headers.
3. UNIT_TESTING definition, followed by the <cmocka.h> header.
4. libisc headers.
5. Headers from other BIND libraries.
6. Local headers.
Also make sure header file names are sorted alphabetically within each
block of #include directives.
All unit tests define the UNIT_TESTING macro, which causes <cmocka.h> to
replace malloc(), calloc(), realloc(), and free() with its own functions
tracking memory allocations. In order for this not to break
compilation, the system header declaring the prototypes for these
standard functions must be included before <cmocka.h>.
Normally, these prototypes are only present in <stdlib.h>, so we make
sure it is included before <cmocka.h>. However, musl libc also defines
the prototypes for calloc() and free() in <sched.h>, which is included
by <pthread.h>, which is included e.g. by <isc/mutex.h>. Thus, unit
tests including "dnstest.h" (which includes <isc/mem.h>, which includes
<isc/mutex.h>) after <cmocka.h> will not compile with musl libc as for
these programs, <sched.h> will be included after <cmocka.h>.
Always including <cmocka.h> after all other header files is not a
feasible solution as that causes the mock assertion macros defined in
<isc/util.h> to mangle the contents of <cmocka.h>, thus breaking
compilation. We cannot really use the __noreturn__ or analyzer_noreturn
attributes with cmocka assertion functions because they do return if the
tested condition is true. The problem is that what BIND unit tests do
is incompatible with Clang Static Analyzer's assumptions: since we use
cmocka, our custom assertion handlers are present in a shared library
(i.e. it is the cmocka library that checks the assertion condition, not
a macro in unit test code). Redefining cmocka's assertion macros in
<isc/util.h> is an ugly hack to overcome that problem - unfortunately,
this is the only way we can think of to make Clang Static Analyzer
properly process unit test code. Giving up on Clang Static Analyzer
being able to properly process unit test code is not a satisfactory
solution.
Undefining _GNU_SOURCE for unit test code could work around the problem
(musl libc's <sched.h> only defines the prototypes for calloc() and
free() when _GNU_SOURCE is defined), but doing that could introduce
discrepancies for unit tests including entire *.c files, so it is also
not a good solution.
All in all, including <sched.h> before <cmocka.h> for all affected unit
tests seems to be the most benign way of working around this musl libc
quirk. While quite an ugly solution, it achieves our goals here, which
are to keep the benefit of proper static analysis of unit test code and
to fix compilation against musl libc.
- removed some dead code
- dns_zone_setdbtype is now void as it could no longer return
anything but ISC_R_SUCCESS; calls to it no longer check for a result
- controlkeylist_fromconfig() is also now void
- fixed a whitespace error
This commit changes the BIND cookie algorithms to match
draft-sury-toorop-dnsop-server-cookies-00. Namely, it changes the Client Cookie
algorithm to use SipHash 2-4, adds the new Server Cookie algorithm using SipHash
2-4, and changes the default for the Server Cookie algorithm to be siphash24.
Add siphash24 cookie algorithm, and make it keep legacy aes as