Change the names of the node reference counting functions
and add comments to make the mechanism easier to understand:
- newref() and decref() are now called qpcnode_acquire()/
qpznode_acquire() and qpcnode_release()/qpznode_release()
respectively; this reflects the fact that they modify both
the internal and external reference counters for a node.
- qpcnode_newref() and qpznode_newref() are now called
qpcnode_erefs_increment() and qpznode_erefs_increment(), and
qpcnode_decref() and qpznode_decref() are now called
qpcnode_erefs_decrement() and qpznode_erefs_decrement(),
to reflect that they only increase and decrease the node's
external reference counters, not internal.
This removes the db_nodelock_t structure and changes the node_locks
array to be composed only of isc_rwlock_t pointers. The .reference
member has been moved to qpdb->references in addition to
common.references that's external to dns_db API users. The .exiting
members has been completely removed as it has no use when the reference
counting is used correctly.
The origin_node in qpcache was always NULL, so we can remove the
getoriginode() function and origin_node pointer as the
dns_db_getoriginnode() correctly returns ISC_R_NOTFOUND when the
function is not implemented.
Cleanup the pattern in the decref() functions in both qpcache.c and
qpzone.c, so it follows the similar patter as we already have in
newref() function.
In order to avoid to loop to find the next position to store an EDE in
a dns_edectx_t, add a "nextede" state which holds the next available
position.
Also, in order ot avoid to loop to find if an EDE is already existing in
a dns_edectx_t, and avoid a duplicate, use a bitmap to immediately know
if the EDE is there or not.
Those both changes applies for adding or copying EDE.
Also make the direction of dns_ede_copy more explicit/avoid errors by
making "edectx_from" a const pointer.
Migrate tests cases in client_test code which were exclusively testing
code which is now all wrapped inside ede compilation unit. Those are
testing maximum number of EDE, duplicate EDE as well as truncation of
text of an EDE.
Also add coverage for the copy of EDE from an edectx to another one, as
well as checking the assertion of the maximum EDE info code which can be
used.
Instead of mixing the dns_resolver and dns_validator units directly with
the EDE code, split-out the dns_ede functionality into own separate
compilation unit and hide the implementation details behind abstraction.
Additionally, the EDE codes are directly copied into the ns_client
buffers by passing the EDE context to dns_resolver_createfetch().
This makes the dns_ede implementation simpler to use, although sligtly
more complicated on the inside.
Co-authored-by: Colin Vidal <colin@isc.org>
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Extended DNS error 22 (No reachable authority) was previously detected
when `fctx_expired` fired. It turns out this function is used as a
"safety net" and the timeout detection should be caught earlier.
It was working though, because of another issue fixed by !9927. Since
this change, the recursive request timed out detection occurs before
`fctx_expired` so EDE 22 is not added to the response message anymore.
The fix of the problem is to add the EDE 22 code in two situations:
- When the dispatch code timed out (rctx_timedout) the resolver code
checks various properties to figure out if it needs to make another
fetch attempt. One of the paramters if the fetch expiration time. If
it expires, the whole recursion is canceled, so it now adds the EDE 22
code.
- If the fetch expiration time doesn't expires in the case above (and
other parameters allows it) a new fetch attempt is made (fctx_query).
But before the new request is actually made, the fetch expiration time
is re-checked. It might then has elapsed, and the whole recursion is
canceled. So it now also adds the EDE 22 code here as well.
Add support for EDE codes 1 (Unsupported DNSKEY Algorithm) and 2
(Unsupported DS Digest Type) which might occurs during DNSSEC
validation in case of unsupported DNSKEY algorithm or DS digest type.
Because DNSSEC internally kicks off various fetches, we need to copy
all encountered extended errors from fetch responses to the fetch
context. Upon an event, the errors from the fetch context are copied
to the client response.
the isc_mem allocation functions can no longer fail; as a result,
ISC_R_NOMEMORY is now rarely used: only when an external library
such as libjson-c or libfstrm could return NULL. (even in
these cases, arguably we should assert rather than returning
ISC_R_NOMEMORY.)
code and comments that mentioned ISC_R_NOMEMORY have been
cleaned up, and the following functions have been changed to
type void, since (in most cases) the only value they could
return was ISC_R_SUCCESS:
- dns_dns64_create()
- dns_dyndb_create()
- dns_ipkeylist_resize()
- dns_kasp_create()
- dns_kasp_key_create()
- dns_keystore_create()
- dns_order_create()
- dns_order_add()
- dns_peerlist_new()
- dns_tkeyctx_create()
- dns_view_create()
- dns_zone_setorigin()
- dns_zone_setfile()
- dns_zone_setstream()
- dns_zone_getdbtype()
- dns_zone_setjournal()
- dns_zone_setkeydirectory()
- isc_lex_openstream()
- isc_portset_create()
- isc_symtab_create()
(the exception is dns_view_create(), which could have returned
other error codes in the event of a crypto library failure when
calling isc_file_sanitize(), but that should be a RUNTIME_CHECK
anyway.)
Track inside the dns_dnsseckey structure whether we have seen the
private key, or if this key only has a public key file.
If the key only has a public key file, or a DNSKEY reference in the
zone, mark the key 'pubkey'. In dnssec-signzone, if the key only
has a public key available, consider the key to be offline. Any
signatures that should be refreshed for which the key is not available,
retain the signature.
So in the code, 'expired' becomes 'refresh', and the new 'expired'
is only used to determine whether we need to keep the signature if
the corresponding key is not available (retaining the signature if
it is not expired).
In the 'keysthatsigned' function, we can remove:
- key->force_publish = false;
- key->force_sign = false;
because they are redundant ('dns_dnsseckey_create' already sets these
values to false).
Extended DNS error mechanism (EDE) enables to have several EDE raised
during a DNS resolution (typically, a DNSSEC query will do multiple
fetches which each of them can have an error). Add support to up to 3
EDE errors in an DNS response. If duplicates occur (two EDEs with the
same code, the extra text is not compared), only the first one will be
part of the DNS answer.
Because the maximum number of EDE is statically fixed, `ns_client_t`
object own a static vector of `DNS_DE_MAX_ERRORS` (instead of a linked
list, for instance). The array can be fully filled (all slots point to
an allocated `dns_ednsopt_t` object) or partially filled (or
empty). In such case, the first NULL slot means there is no more EDE
objects.
When TCP is used, 'fctx_query()' adds one second to the rtt
(round-trip time) value, but there's a bug when the decision
about using TCP is made already after the calculation. Move the
block of the code which looks up the peers list to decide
whether to use TCP into a place that's before the rtt calculation
is performed. This commit doesn't add or remove any code, it just
moves the code and adds a comment block.
When 'ISC_R_TIMEDOUT' is received in 'tcp_recv()', it times out the
oldest response in the active responses queue, and only after that it
checks whether other active responses have also timed out. So when
setting a timeout value for a read operation after a successful
connection, it makes sense to take the timeout value from the oldest
response in the active queue too, because, theoretically, the responses
can have different timeout values, e.g. when the TCP dispatch is shared.
Currently 'resp' is always NULL. Previously when connect and read
timeouts were not separated in dispatch this affected only logging, but
now since we are setting a new timeout after a successful connection,
we need to choose a suitable response from the active queue.
Previously, this function always acquires a node write lock if it
might need node cleanup in case the reference decrements to 0. In
fact, the lock is unnecessary if the reference is larger than 1 and it
can be optimized as an "easy" case. This optimization could even be
"necessary". In some extreme cases, many worker threads could repeat
acquring and releasing the reference on the same node, resulting in
severe lock contention for nothing (as the ref wouldn't decrement to 0
in most cases). This change would prevent noticeable performance
drop like query timeout for such cases.
Co-authored-by: JINMEI Tatuya <jtatuya@infoblox.com>
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Currently, the fetch context will continue running even when the last
fetch (response) has been removed from the context, so named can process
and cache the answer. This can lead to a situation where the number of
outgoing recursing clients exceeds the the configured number for
recursive-clients.
Be more stringent about the recursive-clients limit and shutdown the
fetch context immediately after the last fetch has been canceled from
that particular fetch context.
Address Database (ADB) shares the memory for the short lived ADB
objects (finds, fetches, addrinfo) and the long lived ADB
objects (names, entries, namehooks). This could lead to a situation
where the resolver-heavy load would force evict ADB objects from the
database to point where ADB is completely empty, leading to even more
resolver-heavy load.
Make the short lived ADB objects use the other memory context that we
already created for the hashmaps. This makes the ADB overmem condition
to not be triggered by the ongoing resolver fetches.
In some places there was a limitation of the maximum timeout
value of INT16_MAX, which is only about 32 seconds. Refactor
the code to remove the limitation.
The network manager layer has two different timers with their
own timeout values for TCP connections: connect timeout and read
timeout. Separate the connect and the read TCP timeouts in the
dispatch module too.
struct fetchctx does have a list of pending validators as well as a
pointer to the HEAD validator. Remove the validator pointer to avoid
confusion, as there is no perticular reasons to have it directly
accessible outside of the list.
Limit the number of records appended to ADDITIONAL section to the names
that have less than 14 records in the RDATA. This limits the number
of the lookups into the database(s) during single client query.
Also don't append any additional data to ANY queries. The answer to ANY
is already big enough.
All the database implementations share the same names for the methods
implementing the database. That has some advantages like knowing what
to expect, but it turns out that any time such method shows up in any
kind of tracing - be it perf record, backtrace or anything else that
uses symbol names, it is very hard to distinguish whether the find()
belongs to qpcache, qpzone, builtin or sdlz implementation.
Make at least the names for qpzone and qpcache unique.
when adding a new NSEC3 record, dns_nsec3_addnsec3() uses a
dbiterator to seek to the newly created node and then find its
predecessor. dbiterators in the qpzone use snapshots, so changes
to the database are not reflected in an already-existing iterator.
consequently, when we add a new node, we have to create a new iterator
before we can seek to it.
when a requested name is found in the QP trie during a lookup, but its
records have been marked as nonexistent by a previous deletion, then
it's treated as a partial match, but the foundname could be left
pointing to the original qname rather than the parent. this could
lead to an assertion failure in query_findclosestnsec3().
The code in zone_startload() disables RPZ and CATZ for a zone if
dns_master_loadfile() returns anything other than ISC_R_SUCCESS,
which makes sense, but it's an error because zone_startload() can
also return DNS_R_SEENINCLUDE upon success when the zone had an
$INCLUDE statement.
Ensure the log prefixes passed to the dns_message_logpacketfrom()
function by its callers do not include the word "from" as the latter is
now emitted by the logfmtpacket() helper function.
Ensure the log prefixes passed to the dns_message_logpacketfromto()
function by its callers do not include the words "from" or "to" as those
are now emitted by the logfmtpacket() helper function.
Move dns_dispentry_getlocaladdress() calls around so that they are not
only invoked when dnstap support is compiled in. This function calls
isc_nmhandle_localaddr(), which may issue a system call, but only if the
ISC_SOCKET_DETAILS preprocessor macro is set at compile time.
Pass the value extracted by dns_dispentry_getlocaladdress() to
dns_message_logpacketfromto() so that it gets logged, adding useful
information to the relevant debug messages.
Since dns_message_logpacket() only takes a single socket address as a
parameter (and it is always the sending socket's address), rename it to
dns_message_logpacketfrom() so that its name better conveys its purpose
and so that the difference in purpose between this function and
dns_message_logpacketfromto() becomes more apparent.
Since dns_message_logfmtpacket() needs to be provided with both "from"
and "to" socket addresses, rename it to dns_message_logpacketfromto() so
that its name better conveys its purpose. Clean up the code comments
for that function.
Change the function prototype for dns_message_logfmtpacket() so that it
takes two isc_sockaddr_t parameters: one for the sending side and
another one for the receiving side. This enables debug messages to be
more precise.
Also adjust the function prototype for logfmtpacket() accordingly.
Unlike dns_message_logfmtpacket(), this function must not require both
'from' and 'to' parameters to be non-NULL as it is still going to be
used by dns_message_logpacket(), which only provides a single socket
address. Adjust its log format to handle both of these cases properly.
Adjust both dns_message_logfmtpacket() call sites accordingly, without
actually providing the second socket address yet. (This causes the
revised REQUIRE() assertion in dns_message_logfmtpacket() to fail; the
issue will be addressed in a separate commit.)
Both existing callers of the dns_message_logfmtpacket() function set the
argument passed as 'style' to &dns_master_style_comment. To simplify
these call sites, drop the 'style' parameter from the prototype for
dns_message_logfmtpacket() and use a fixed value of
&dns_master_style_comment in the function's body instead.
All callers of the logfmtpacket() helper function require the argument
passed as 'address' to be non-NULL. Meanwhile, the 'newline' and
'space' local variables in logfmtpacket() are only set to values
different than their initial values if the 'address' parameter is NULL.
Replace the 'newline' and 'space' local variables in logfmtpacket() with
fixed strings to improve code readability.
Extracting the exact address that each wildcard/TCP socket is bound to
locally requires issuing the getsockname() system call, which libuv
exposes via its uv_*_getsockname() functions. This is only required for
detailed logging and comes at a noticeable performance cost, so it
should not happen by default. However, it is useful for debugging
certain problems (e.g. cryptic system test failures), so a convenient
way of enabling that behavior should exist.
Update isc_nmhandle_localaddr() so that it calls uv_*_getsockname() when
the ISC_SOCKET_DETAILS preprocessor macro is set at compile time.
Ensure proper handling of sockets that wrap other sockets.
Set the new ISC_SOCKET_DETAILS macro by default when --enable-developer
is passed to ./configure. This enables detailed logging in the system
tests run in GitLab CI without affecting performance in non-development
BIND 9 builds.
Note that setting the ISC_SOCKET_DETAILS preprocessor macro at compile
time enables all callers of isc_nmhandle_localaddr() to extract the
exact address of a given local socket, which results e.g. in dnstap
captures containing more accurate information.
Mention the new preprocessor macro in the section of the ARM that
discusses why exact socket addresses may not be logged by default.
The dns_dispatch_gettcp() function is used for finding an existing TCP
connection that can be reused for sending a query from a specified local
address to a specified remote address. The logic for matching the
provided <local address, remote address> tuple to one of the existing
TCP connections is implemented in the dispatch_match() function:
- if the examined TCP connection already has a libuv handle assigned,
it means the connection has already been established; therefore,
compare the provided <local address, remote address> tuple against
the corresponding address tuple for the libuv handle associated with
the connection,
- if the examined TCP connection does not yet have a libuv handle
assigned, it means the connection has not yet been established;
therefore, compare the provided <local address, remote address>
tuple against the corresponding address tuple that the TCP
connection was originally created for.
This logic limits TCP connection reuse potential as the libuv handle
assigned to an existing dispatch object may have a more specific local
<address, port> tuple associated with it than the local <address, port>
tuple that the dispatch object was originally created for. That's
because the local address for outgoing connections can be set to a
wildcard <address, port> tuple (indicating that the caller does not care
what source <address, port> tuple will be used for establishing the
connection, thereby delegating the task of picking it to the operating
system) and then get "upgraded" to a specific <address, port> tuple when
the socket is bound (and a libuv handle gets associated with it). When
another dns_dispatch_gettcp() caller then tries to look for an existing
TCP connection to the same peer and passes a wildcard address in the
local part of the tuple, the function will not match that request to a
previously-established TCP connection (unless isc_nmhandle_localaddr()
returns a wildcard address as well).
Simplify dispatch_match() so that the libuv handle associated with an
existing dispatch object is not examined for the purpose of matching it
to the provided <local address, remote address> tuple; instead, always
examine the <local address, remote address> tuple that the dispatch
object was originally created for. This enables reuse of TCP
connections created without providing a specific local socket address
while still preventing other connections (created for a specific local
socket address) from being inadvertently shared.
This commit ensures that BIND enables TLS SNI support for outgoing DoT
connections (when possible) in order to improve compatibility with
other DNS server software.
This commit adds support for setting SNI hostnames in outgoing
connections over TLS.
Most of the changes are related to either adapting the code to accept
and extra argument in *connect() functions and a couple of changes to
the TLS Stream to actually make use of the new SNI hostname
information.
Since BIND 9 headers are not longer public, there's no reason to keep
the ISC_LANG_BEGINDECL and ISC_LANG_ENDDECL macros to support including
them from C++ projects.
This is a second attempt to rewrite the GLUE cache to not use per
database version hash table. Instead of keeping a hash table indexed by
the node, use a directly linked list of GLUE records for each
slabheader. This was attempted before, but there was a data race caused
by the fact that the thread cleaning the GLUE records could be slower
than accessing the slab headers again and reinitializing the wait-free
stack.
The improved design builds on the previous design, but adds a new
dns_gluelist structure that has a pointer to the database version.
If a dns_gluelist belonging to a different (old) version is detected, it
is just detached from the slabheader and left for the closeversion() to
clean it up later.