Unify libcrypto initialization and explicit digest fetching in a single
place and move relevant code to the isc__crypto namespace instead of
isc__tls.
It will remove the remaining implicit fetching and deduplicate explicit
fetching inside the codebase.
The <openssl/hmac.h> header was unused and including the
header might cause build failure when OpenSSL doesn't have
Engines support enabled.
See https://fedoraproject.org/wiki/Changes/OpensslDeprecateEngine
Removes unused hmac includes after Remove OpenSSL Engine support
(commit ef7aba7072) removed engine
support.
maxlabels is the suffix length that corresponds to the latest
NXDOMAIN response. minlabels is the suffix length that corresponds
to longest found existing name.
Prior to running the keymgr, first make sure that existing keys
are present in the new keylist. If not, treat this as an operational
error where the keys are made offline (temporarily), possibly unwanted.
Rename check_recursionquota() to acquire_recursionquota(), and
implement a new function called release_recursionquota() to
reverse the action. It helps with decreasing code duplication.
In two places, after linking the client to the manager's
"recursing-clients" list using the check_recursionquota()
function, the query.c module fails to unlink it on error
paths. Fix the bugs by unlinking the client from the list.
Also make sure that unlinking happens before detaching the
client's handle, as it is the logically correct order, e.g.
in case if it's the last handle and ns__client_reset_cb()
can be called because of the detachment.
The dns_zone_getxfrintime() function fails to lock the zone before
accessing its 'xfrintime' structure member, which can cause a data
race between soa_query() and the statistics channel. Add the missing
locking/unlocking pair, like it's done in numerous other similar
functions.
The 'nodetach' member is a leftover from the times when non-zero
'stale-answer-client-timeout' values were supported, and currently
is always 'false'. Clean up the member and its usage.
Currently, the outgoing UDP sockets have enabled
SO_REUSEADDR (SO_REUSEPORT on BSDs) which allows multiple UDP sockets to
bind to the same address+port. There's one caveat though - only a
single (the last one) socket is going to receive all the incoming
traffic. This in turn could lead to incoming DNS message matching to
invalid dns_dispatch and getting dropped.
Disable setting the SO_REUSEADDR on the outgoing UDP sockets. This
needs to be done explicitly because `uv_udp_open()` silently enables the
option on the socket.
When matching the TCP dispatch responses, we should skip the responses
that do not belong to our TCP connection. This can happen with faulty
upstream server that sends invalid QID back to us.
The dns_dispatch_add() function registers the 'resp' entry in
'disp->mgr->qids' hash table with 'resp->port' being 0, but in
tcp_recv_success(), when looking up an entry in the hash table
after a successfully received data the port is used, so if the
local port was set (i.e. it was not 0) it fails to find the
entry and results in an unexpected error.
Set the 'resp->port' to the given local port value extracted from
'disp->local'.
This commit adds support for timestamps in iso8601 format with timezone
when logging. This is exposed through the iso8601-tzinfo printtime
suboption.
It also makes the new logging format the default for -g output,
hopefully removing the need for custom timestamp parsing in scripts.
This commit nulls all type fields for the clausedef lists that are
declared ancient, and removes the corresponding cfg_type_t and parsing
functions when they are found to be unused after the change.
Static-stub address and addresses from other sources where being
mixed together resulting in static-stub queries going to addresses
not specified in the configuration or alternatively static-stub
addresses being used instead of the real addresses.
As the relaxed memory ordering doesn't ensure any memory
synchronization, it is possible that the increment will succeed even
in the case when it should not - there is a race between
atomic_fetch_sub(..., acq_rel) and atomic_fetch_add(..., relaxed).
Only the result is consistent, but the previous value for both calls
could be same when both calls are executed at the same time.
An exit path in the dns_dispatch_add() function fails to get out of
the RCU critical section when returning early. Add the missing
rcu_read_unlock() call.
This change uses uv_get_total_memory() to get the memory available to
BIND 9 with possible modification by uv_get_constrained_memory() if the
libuv version is recent enough to honour constraints created by
f.e. cgroups.
The OpenBSD doesn't have sysctlbyname(), but sysctl() can be used to
read the number of online/available CPUs by reading following MIB(s):
[CTL_HW, HW_NCPUONLINE] with fallback to [CTL_HW, HW_NCPU].
Cleanup various checks and cleanups that are available on the all
platforms like sysctlbyname() and various related <sys/*.h> headers
that are either defined in POSIX or available on Linux and all BSDs.
Instead of cooking up our own code for getting the number of available
CPUs for named to use, make use of uv_available_parallelism() from
libuv >= 1.44.0.
The dns_dispatch_add() call in the dns_xfrin unit had hardcoded 30
second limit. This meant that any incoming transfer would be stopped in
it didn't finish within 30 seconds limit. Additionally, dns_xfrin
callback was ignoring the return value from dns_dispatch_getnext() when
restarting the reading from the TCP stream; this could cause transfers
to get stuck waiting for a callback that would never come due to the
dns_dispatch having already been shut down.
Call the dns_dispatch_add() without a timeout and properly handle the
result code from the dns_dispatch_getnext().
The dns_dispatch_add() has timeout parameter that could not be 0 (for
not timeout). Modify the dns_dispatch implementation to accept a zero
timeout for cases where the timeouts are undesirable because they are
managed externally.
Query and response log shares the same flags. Move flags logging out of
log_query to share it with log_response. Use buffer instead of snprintf
to fill flags a bit faster.
Signed-off-by: Petr Menšík <pemensik@redhat.com>
Remove answer flag from log, log instead count of records for each
message section. Include EDNS version and few flags of response. Add
also status of result.
Still does not include body of responses rrset.
when the QP cache was adapted from the RBT database, some names
weren't changed. this could be confusing, so let's change them now.
also, we no longer need to include rbt.h.
DNSRPS was the API for a commercial implementation of Response-Policy
Zones that was supposedly better. However, it was never open-sourced
and has only ever been available from a single vendor. This goes against
the principle that the open-source edition of BIND 9 should contain only
features that are generally available and universal.
This commit removes the DNSRPS implementation from BIND 9. It may be
reinstated in the subscription edition if there's enough interest from
customers, but it would have to be rewritten as a plugin (hook) instead
of hard-wiring it again in so many places.
If the operating system UDP queue gets full and the outgoing UDP sending
starts to be delayed, BIND 9 could exhibit memory spikes as it tries to
enqueue all the outgoing UDP messages. As those are not going to be
delivered anyway (as we argued when we stopped enlarging the operating
system send and receive buffers), try to send the UDP messages directly
using `uv_udp_try_send()` and if that fails, drop the outgoing UDP
message.
We currently set SO_INCOMING_CPU incorrectly, and testing by Ondrej
shows that fixing the issue and setting affinities is worse than letting
the kernel schedule threads without constraints. So we should not set
SO_INCOMING_CPU anymore.
The 'all_spilled' local variable in resolver.c:fctx_getaddresses()
is 'true' by default, and only becomes false when there is at least
one successfully found NS address. However, when a 'forward only;'
configuration is used, the code jumps over the part where it looks
for NS addresses and doesn't reset the 'all_spilled' to false, which
results in incorretly increased 'serverquota' statistics variable,
and also in invalid return error code from the function. The result
code error didn't make any differences, because all codes other than
'ISC_R_SUCCESS' or 'DNS_R_WAIT' were treated in the same way, and
the result code was never logged anywhere.
Set the default value of 'all_spilled' to 'false', and only make it
'true' before actually starting to look up NS addresses.
Currently, the isc_work API is overloaded. It runs both the
CPU-intensive operations like DNSSEC validations and long-term tasks
like RPZ processing, CATZ processing, zone file loading/dumping and few
others.
Under specific circumstances, when many large zones are being loaded, or
RPZ zones processed, this stops the CPU-intensive tasks and the DNSSEC
validation is practically stopped until the long-running tasks are
finished.
As this is undesireable, this commit moves the CPU-intensive operations
from the isc_work API to the isc_helper API that only runs fast memory
cleanups now.
Add an extra thread that can be used to offload operations that would
affect latency, but are not long-running tasks; those are handled by
isc_work API.
Each isc_loop now has matching isc_helper thread that also built on top
of uv_loop. In fact, it matches most of the isc_loop functionality, but
only the `isc_helper_run()` asynchronous call is exposed.
When verifying a message in an offloaded thread there is a race with
the worker thread which writes to the same buffer. Clone the message
buffer before offloading.
Remove the use of "port" when configuring query-source(-v6),
transfer-source(-v6), notify-source(-v6), parental-source(-v6),
etc. Remove the use of source ports for parental-agents.
Also remove the deprecated options use-{v4,v6}-udp-ports and
avoid-{v4,v6}udp-ports.
This change allows fallback from an IXFR failure to AXFR when the
reason is DNS_R_TOOMANYRECORDS. This is because this error condition
could be temporary only in an intermediate version of IXFR
transactions and it's possible that the latest version of the zone
doesn't have that condition. In such a case, the secondary would never
be able to update the zone (even if it could) without this fallback.
This fallback behavior is particularly useful with the recently
introduced max-records-per-type and max-types-per-name options:
the primary may not have these limitations and may temporarily
introduce "too many" records, breaking IXFR. If the primary side
subsequently deletes these records, this fallback will help recover
the zone transfer failure automatically; without it, the secondary
side would first need to increase the limit, which requires more
operational overhead and has its own adverse effect.
This change also fixes a minor glitch that DNS_R_TOOMANYRECORDS wasn't
logged in xfrin_fail.
The rcu_xchg_pointer() function can be used outside of a critical
section, and usually must be followed by a synchronize_rcu() or
call_rcu() call to detach from the resource, unless if there are
some guarantees in place because of our own reference counting.
named-checkconf now takes "-n" to ignore "not configured" errors.
This allows named-checkconf to check the syntax of configurations
from other builds which have support for more options.
If the ZSK has lifetime unlimited, the timing metadata "Inactive" and
"Delete" cannot be found and is treated as an error. Fix by allowing
these metadata to not exist.
When a validator is already shut down, val->name becomes NULL. We
need to process and keep the ISC_R_CANCELED or ISC_R_SHUTTINGDOWN
result code before calling validate_async_done(), otherwise, when it
is called with the hardcoded DNS_R_NOVALIDSIG result code, it can
cause an assetion failure when val->name (being NULL) is used in
proveunsecure().
Administrators may wish to constrain the set of cores that BIND 9 runs
on via the 'taskset', 'cpuset' or 'numactl' programs (or equivalent on
other O/S), for example to achieve higher (or more stable) performance
by more closely associating threads with individual NIC rx queues. If
the admin has used taskset, it follows that BIND ought to
automatically use the given number of CPUs rather than the system wide
count.
Co-Authored-By: Ray Bellis <ray@isc.org>
Return partial match from dns_db_find/dns_db_find when requested
to short circuit the closest encloser discover process. Most of the
time this will be the actual closest encloser but may not be when
there yet to be committed / cleaned up versions of the zone with
names below the actual closest encloser.
fctx->state should be read with the lock held.
1559 /*
1560 * Caller must be holding the fctx lock.
1561 */
CID 468796: (#1 of 1): Data race condition (MISSING_LOCK)
1. missing_lock: Accessing fctx->state without holding lock fetchctx.lock.
Elsewhere, fetchctx.state is written to with fetchctx.lock held 2 out of 2 times.
1562 REQUIRE(fctx->state == fetchstate_done);
1563
1564 FCTXTRACE("sendevents");
1565
1566 LOCK(&fctx->lock);
1567
Give prefetches a free pass through the quota so that the cache
entries for popular zones could be updated successfully even if the
quota for is already reached.
Give prefetches a free pass through the quota so that the cache entry
for a popular zone could be updated successfully even if the quota for
it is already reached.
Although the nanual page of malloc_usable_size says:
Although the excess bytes can be over‐written by the application
without ill effects, this is not good programming practice: the
number of excess bytes in an allocation depends on the underlying
implementation.
it looks like the premise is broken with _FORTIFY_SOURCE=3 on newer
systems and it might return a value that causes program to stop with
"buffer overflow" detected from the _FORTIFY_SOURCE. As we do have own
implementation that tracks the allocation size that we can use to track
the allocation size, we can stop relying on this introspection function.
Also the newer manual page for malloc_usable_size changed the NOTES to:
The value returned by malloc_usable_size() may be greater than the
requested size of the allocation because of various internal
implementation details, none of which the programmer should rely on.
This function is intended to only be used for diagnostics and
statistics; writing to the excess memory without first calling
realloc(3) to resize the allocation is not supported. The returned
value is only valid at the time of the call.
Remove usage of both malloc_usable_size() and malloc_size() to be on the
safe size and only use the internal size tracking mechanism when
jemalloc is not available.
This limits the maximum number of received incremental zone
transfer differences for a secondary server. Upon reaching the
confgiured limit, the secondary aborts IXFR and initiates a full
zone transfer (AXFR).
If there is an algorithm rollover and two keys of different algorithm
share the same keytags, then there is a possibility that if we check
that a key matches a specific state, we are checking against the wrong
key.
Fix this by not only checking for matching key id but also key
algorithm.
Some things we no longer want to do when we are in offline-ksk mode.
1. Don't check for inactive and private keys if the key is a KSK.
2. Don't update the TTL of DNSKEY, CDS and CDNSKEY RRset, these come
from the SKR.
With offline-ksk enabled, we don't run the keymgr because the key
timings are determined by the SKR. We do update the key states but
we derive them from the timing metadata.
Then, we can skip a other tasks in offline-ksk mode, like DS checking
at the parent and CDS synchronization, because the CDS and CDNSKEY
RRsets also come from the SKR.
This added source code stores SKR data. It is loosely based on:
https://www.iana.org/dnssec/archive/files/draft-icann-dnssec-keymgmt-01.txt
A SKR contains a list of signed DNSKEY RRsets. Each change in data
should be stored in a separate bundle. So if the RRSIG is refreshed that
means it is stored in the next bundle. Likewise, if there is a new ZSK
pre-published, it is in the next bundle.
In addition (not mentioned in the draft), each bundle may contain
signed CDS and CDNSKEY RRsets.
Each bundle has an inception time. These will determine when we need
to re-sign or re-key the zone.
Add a new configuration option to enable Offline KSK key management.
Offline KSK cannot work with CSK because it splits how keys with the
KSK and ZSK role operate. Therefore, one key cannot have both roles.
Add a configuration check to ensure this.
There are few places where we attach/detach from the dns_xfrin object
while running on a different thread than the zone's assigned thread -
xfrin_xmlrender() in the statschannel and dns_zone_stopxfr() to name the
two places where it happens now. In the rare case, when the incoming
transfer completes (or shuts down) in the brief period between the other
thread attaches and detaches from the dns_xfrin, the isc_timer_destroy()
calls would be called by the last thread calling the xfrin_detach().
In the worst case, it would be this other thread causing assertion
failure. Move the isc_timer_destroy() call to xfrin_end() function
which is always called on the right thread and to match this move
isc_timer_create() to xfrin_start() - although this other change makes
no difference.
The new
isc_log_createandusechannel() function combines following calls:
isc_log_createchannel()
isc_log_usechannel()
calls into a single call that cannot fail and therefore can be used in
places where we know this cannot fail thus simplifying the error
handling.
Remove the complicated mechanism that could be (in theory) used by
external libraries to register new categories and modules with
statically defined lists in <isc/log.h>. This is similar to what we
have done for <isc/result.h> result codes. All the libraries are now
internal to BIND 9, so we don't need to provide a mechanism to register
extra categories and modules.
The isc_log_write1() and isc_log_vwrite1() functions were meant to
de-duplicate the messages sent to the isc_log subsystem. However, they
were never used in an entire code base and the whole mechanism around it
was complicated and very inefficient. Just remove those, there are
better ways to deduplicate syslog messages inside syslog daemons now.
Add isc_logconfig_get() function to get the current logconfig and use
the getter to replace most of the little dancing around setting up
logging in the tools. Thus:
isc_log_create(mctx, &lctx, &logconfig);
isc_log_setcontext(lctx);
dns_log_setcontext(lctx);
...
...use lcfg...
...
isc_log_destroy();
is now only:
logconfig = isc_logconfig_get(lctx);
...use lcfg...
For thread-safety, isc_logconfig_get() should be surrounded by RCU read
lock, but since we never use isc_logconfig_get() in threaded context,
the only place where it is actually used (but not really needed) is
named_log_init().
Instead of juggling different logging context, use one single logging
context that gets initialized in the libisc constructor and destroyed in
the libisc destructor.
The application is still responsible for creating the logging
configuration before using the isc_log API.
This patch is first in the series in a way that it is transparent for
the users of the isc_log API as the isc_log_create() and
isc_log_destroy() are now thin shims that emulate the previous
functionality, but it isc_log_create() will always return internal
isc__lctx pointer and isc_log_destroy() will actually not destroy the
internal isc__lctx context.
Signed-off-by: Ondřej Surý <ondrej@isc.org>
Instead of directly using the result of dirfd() in the unlinkat() call,
check whether the returned file descriptor is actually valid. That
doesn't really change the logic as the unlinkat() would fail with
invalid descriptor anyway, but this is cleaner and will report the right
error returned directly by dirfd() instead of EBADF from unlinkat().
The clang-scan 19 has reported that we are ignoring errno after the call
to rewind(). As we don't really care about the result, just silence the
error, the whole code will be removed in the development version anyway
as it is not needed.
The contexpr introduced in C23 standard makes perfect sense to be used
instead of preprocessor macros - the symbols are kept, etc. Define
ISC_CONSTEXPR to be `constexpr` for C23 and `static const` for the older
C standards. Use the newly introduced macro for the NS_PER_SEC and
friends time constants.
New version of clang (19) has introduced a stricter checks when mixing
integer (and float types) with enums. In this case, we used enum {}
as C17 doesn't have constexpr yet. Change the time conversion constants
to be static const unsigned int instead of enum values.
Check if 'lctx->logconfig' is NULL before using it in isc_log_doit(),
because it's possible that isc_log_destroy() was already called, e.g.
when a 'call_rcu' function wants to log a message during shutdown.
When iterating through the old internal hashmap table, skip all the
nodes that have been already migrated to the new table. We know that
all positions with index less than .hiter are NULL.
When the round robin hashing reorders the map entries on deletion, we
were adjusting the iterator table size only when the reordering was
happening at the internal table boundary. The iterator table size had
to be reduced by one to prevent seeing the entry that resized on
position [0] twice because it migrated to [iter->size - 1] position.
However, the same thing could happen when the same entry migrates a
second time from [iter->size - 1] to [iter->size - 2] position (and so
on) because the check that we are manipulating the entry just in the [0]
position was insufficient. Instead of checking the position [pos == 0],
we now check that the [pos % iter->size == 0], thus ignoring all the
entries that might have moved back to the end of the internal table.
As we now setup the logging very early, parsing the default config would
always print warnings about experimental (and possibly deprecated)
options in the default config. This would even mess with commands like
`named -V` and it is also wrong to warn users about using experimental
options in the default config, because they can't do anything about
this. Add CFG_PCTX_NODEPRECATED and CFG_PCTX_NOEXPERIMENTAL options
that we can pass to cfg parser and silence the early warnings caused by
using experimental options in the default config.
OpenSSL has added support for deterministic ECDSA (RFC 6979) with
version 3.2.
Use it by default as derandomization doesn't pose a risk for DNS
usecases and is allowed by FIPS 186-5.
The fcount_incr() was not increasing counter->count when force was set
to true, but fcount_decr() would try to decrease the counter leading to
underflow and assertion failure. Swap the order of the arguments in the
condition, so the !force is evaluated after incrementing the .count.
Since the enable_fips_mode() now resides inside the isc_tls unit, BIND 9
would fail to compile when FIPS mode was enabled as the DST subsystem
logging functions were missing.
Move the crypto library logging functions from the openssl_link unit to
isc_tls unit and enhance it, so it can now be used from both places
keeping the old dst__openssl_toresult* macros alive.
implement, document, and test the 'max-query-restarts' option
which specifies the query restart limit - the number of times
we can follow CNAMEs before terminating resolution.
MAX_RESTARTS is no longer hard-coded; ns_server_setmaxrestarts()
and dns_client_setmaxrestarts() can now be used to modify the
max-restarts value at runtime. in both cases, the default is 11.
the number of steps that can be followed in a CNAME chain
before terminating the lookup has been reduced from 16 to 11.
(this is a hard-coded value, but will be made configurable later.)
there were cases in resolver.c when queries for NS records were
started without passing a pointer to the parent fetch's query counter;
as a result, the max-recursion-queries quota for those queries started
counting from zero, instead of sharing the limit for the parent fetch,
making the quota ineffective in some cases.
Instead of calling dst_lib_init() and dst_lib_destroy() explicitly by
all the programs, create a separate memory context for the DST subsystem
and use the library constructor and destructor to initialize the DST
internals.
When the SSL object was destroyed, it would invalidate all SSL_SESSION
objects including the cached, but not yet used, TLS session objects.
Properly disassociate the SSL object from the SSL_SESSION before we
store it in the TLS session cache, so we can later destroy it without
invalidating the cached TLS sessions.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Artem Boldariev <artem@isc.org>
Co-authored-by: Aram Sargsyan <aram@isc.org>
When TLS connection (TLSstream) connection was accepted, the children
listening socket was not attached to sock->server and thus it could have
been freed before all the accepted connections were actually closed.
In turn, this would cause us to call isc_tls_free() too soon - causing
cascade errors in pending SSL_read_ex() in the accepted connections.
Properly attach and detach the children listening socket when accepting
and closing the server connections.
Since the support for OpenSSL Engines has been removed, we can now also
remove the checks for OPENSSL_API_LEVEL; The OpenSSL 3.x APIs will be
used when compiling with OpenSSL 3.x, and OpenSSL 1.1.xx APIs will be
used only when OpenSSL 1.1.x is used.
The OpenSSL 1.x Engines support has been deprecated in the OpenSSL 3.x
and is going to be removed. Remove the OpenSSL Engine support in favor
of OpenSSL Providers.
When adding glue to the header, we add header to the wait-free stack to
be cleaned up later which sets wfc_node->next to non-NULL value. When
the actual cleaning happens we would only cleanup the .glue_list, but
since the database isn't locked for the time being, the headers could be
reused while cleaning the existing glue entries, which creates a data
race between database versions.
Revert the code back to use per-database-version hashtable where keys
are the node pointers. This allows each database version to have
independent glue cache table that doesn't affect nodes or headers that
could already "belong" to the future database version.
when searching the cache for a node so that we can delete an
rdataset, it is not necessary to set the 'create' flag. if the
node doesn't exist yet, we then we won't be able to delete
anything from it anyway.
dns_difftuple_create() could only return success, so change
its type to void and clean up all the calls to it.
other functions that only returned a result value because of it
have been cleaned up in the same way.
when a priming query is complete, it's currently logged at
level ISC_LOG_DEBUG(1), regardless of success or failure. we
are now raising it to ISC_LOG_NOTICE in the case of failure.
There isn't a realistic reason to ever use e = 4294967297. Fortunately
its codepath wasn't reachable to users and can be safetly removed.
Keep in mind the `dns_key_generate` header comment was outdated. e = 3
hasn't been used since 2006 so there isn't a reason to panic. The
toggle was the public exponents between 65537 and 4294967297.
The previous work in this area was led by the belief that we might be
calling call_rcu() from within call_rcu() callbacks. After carefully
checking all the current callback, it became evident that this is not
the case and the problem isn't enough rcu_barrier() calls, but something
entirely else.
Call the rcu_barrier() just once as that's enough and the multiple
rcu_barrier() calls will not hide the real problem anymore, so we can
find it.
Since the minimal OpenSSL version is now OpenSSL 1.1.1, remove all kind
of OpenSSL shims and checks for functions that are now always present in
the OpenSSL libraries.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Aydın Mercan <aydin@isc.org>
When putting the 48-bit number into a fixed-size buffer that's exactly 6
bytes, the assertion failure would occur as the 48-bit number is
internally represented as 64-bit number and the code was checking if
there is enough space for `sizeof(val)`. This causes assertion failure
when otherwise valid TSIG signature has a bad timing information.
Specify the size of the argument explicitly, so the 48-bit number
doesn't require 8-byte long buffer.
The fcount_incr() was incorrectly skipping the accounting for the
fetches-per-zone if the force argument was set to true. We want to skip
the accounting only when the fetches-per-zone is completely disabled,
but for individual names we need to do the accounting even if we are
forcing the result to be success.
The PTHREAD_MUTEX_ADAPTIVE_NP and PTHREAD_MUTEX_ERRORCHECK_NP are
usually not defines, but enum values, so simple preprocessor check
doesn't work.
Check for PTHREAD_MUTEX_ADAPTIVE_NP from the autoconf AS_COMPILE_IFELSE
block and define HAVE_PTHREAD_MUTEX_ADAPTIVE_NP. This should enable
adaptive mutex on Linux and FreeBSD.
As PTHREAD_MUTEX_ERRORCHECK actually comes from POSIX and Linux glibc
does define it when compatibility macros are being set, we can just use
PTHREAD_MUTEX_ERRORCHECK instead of PTHREAD_MUTEX_ERRORCHECK_NP.
When automatic-interface-scan is disabled, the route socket was still
being opened. Add new API to connect / disconnect from the route socket
only as needed.
Additionally, move the block that disables periodic interface rescans to
a place where it actually have access to the configuration values.
Previously, the values were being checked before the configuration was
loaded.
Decrementing optlen immediately before calling continue is unneccesary
and inconsistent with the rest of dns_message_pseudosectiontoyaml
and dns_message_pseudosectiontotext. Coverity was also reporting
an impossible false positive overflow of optlen (CID 499061).
4176 } else if (optcode == DNS_OPT_CLIENT_TAG) {
4177 uint16_t id;
4178 ADD_STRING(target, "; CLIENT-TAG:");
4179 if (optlen == 2U) {
4180 id = isc_buffer_getuint16(&optbuf);
4181 snprintf(buf, sizeof(buf), " %u\n", id);
4182 ADD_STRING(target, buf);
CID 499061: (#1 of 1): Overflowed constant (INTEGER_OVERFLOW)
overflow_const: Expression optlen, which is equal to 65534, underflows
the type that receives it, an unsigned integer 16 bits wide.
4183 optlen -= 2;
4184 POST(optlen);
4185 continue;
4186 }
4187 } else if (optcode == DNS_OPT_SERVER_TAG) {
There are use cases for which shorter timeout values make sense.
For example if there is a load balancer which sets RD=1 and
forwards queries to a BIND resolver which is then configured to
talk to backend servers which are not visible in the public NS set.
WIth a shorter timeout value the frontend can give back SERVFAIL
early when backends are not available and the ultimate client will
not penalize the BIND-frontend for non-response.
The period between the most significant nibble of the IPv4 address
and the 2.0.0.2.IP6.ARPA suffix was missing resulting in the wrong
name being checked.
In yaml mode we emit a string for each question and record. Certain
names and data could result in invalid yaml being produced. Use single
quote string for all questions and records. This requires that single
quotes get converted to two quotes within the string.
ALPN are defined as 1*255OCTET in RFC 9460. commatxt_fromtext was not
rejecting invalid inputs produces by missing a level of escaping
which where later caught be dns_rdata_fromwire on reception.
These inputs should have been rejected
svcb in svcb 1 1.svcb alpn=\,abc
svcb1 in svcb 1 1.svcb alpn=a\,\,abc
and generated 00 03 61 62 63 and 01 61 00 02 61 62 63 respectively.
The correct inputs to include commas in the alpn requires double
escaping.
svcb in svcb 1 1.svcb alpn=\\,abc
svcb1 in svcb 1 1.svcb alpn=a\\,\\,abc
and generate 04 2C 61 62 63 and 06 61 2C 2C 61 62 63 respectively.
Due to the maximum query restart limitation a long CNAME chain
it is cut after 16 queries but named still returns NOERROR.
Return SERVFAIL instead and the partial answer.
On a 32 bit machine casting to size_t can still lead to an overflow.
Cast to uint64_t. Also detect all possible negative values for
pages and pagesize to silence warning about possible negative value.
39#if defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
1. tainted_data_return: Called function sysconf(_SC_PHYS_PAGES),
and a possible return value may be less than zero.
2. assign: Assigning: pages = sysconf(_SC_PHYS_PAGES).
40 long pages = sysconf(_SC_PHYS_PAGES);
41 long pagesize = sysconf(_SC_PAGESIZE);
42
3. Condition pages == -1, taking false branch.
4. Condition pagesize == -1, taking false branch.
43 if (pages == -1 || pagesize == -1) {
44 return (0);
45 }
46
5. overflow: The expression (size_t)pages * pagesize might be negative,
but is used in a context that treats it as unsigned.
CID 498034: (#1 of 1): Overflowed return value (INTEGER_OVERFLOW)
6. return_overflow: (size_t)pages * pagesize, which might have underflowed,
is returned from the function.
47 return ((size_t)pages * pagesize);
48#endif /* if defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE) */
The key lifetime should no longer be adjusted if the key is being
retired earlier, for example because a manual rollover was started.
This would falsely be seen as a dnssec-policy lifetime reconfiguration,
and would adjust the retire/removed time again.
This also means we should update the status output, and the next
rollover scheduled is now calculated using (retire-active) instead of
key lifetime.
If dnssec-policy is reconfigured and the key lifetime has changed,
update existing keys with the new lifetime and adjust the retire
and removed timing metadata accordingly.
If the key has no lifetime yet, just initialize the lifetime. It
may be that the retire/removed timing metadata has already been set.
Skip keys which goal is not set to omnipresent. These keys are already
in the progress of retiring, or still unused.
This commit ensures that we are not attempting to accept an expired
TCP connection as we are not interested in any data that could have
been accumulated in its internal buffers. Now we just drop them for
good.
When sending fails, the ns__client_request() would not reset the
connection and continue as nothing is happening. This comes from the
model that we don't care about failed UDP sends because datagrams are
unreliable anyway, but it greatly affects TCP connections with
keep-alive.
The worst case scenario is as follows:
1. the 3-way TCP handshake gets completed
2. the libuv calls the "uv_connection_cb" callback
3. the TCP connection gets queue because of the tcp-clients quota
4. the TCP client sends as many DNS messages as the buffers allow
5. the TCP connection gets dropped by the client due to the timeout
6. the TCP connection gets accepted by the server
7. the data already sent by the client gets read
8. all sending fails immediately because the TCP connection is dead
9. we consume all the data in the buffer in a very tight loop
As it doesn't make sense to trying to process more data on the TCP
connection when the sending is failing, drop the connection immediately
on the first sending error.
As ns_query_init() cannot fail now, remove the error paths, especially
in ns__client_setup() where we now don't have to care what to do with
the connection if setting up the client could fail. It couldn't fail
even before, but now it's formal.
Be more aggressive when throttling the reading - when we can't send the
outgoing TCP synchronously with uv_try_write(), we start throttling the
reading immediately instead of waiting for the send buffers to fill up.
This should not affect behaved clients that read the data from the TCP
on the other end.
Instead of outright refusing to add new RR types to the cache, be a bit
smarter:
1. If the new header type is in our priority list, we always add either
positive or negative entry at the beginning of the list.
2. If the new header type is negative entry, and we are over the limit,
we mark it as ancient immediately, so it gets evicted from the cache
as soon as possible.
3. Otherwise add the new header after the priority headers (or at the
head of the list).
4. If we are over the limit, evict the last entry on the normal header
list.
Add HTTPS, SVCB, SRV, PTR, NAPTR, DNSKEY and TXT records to the list of
the priority types that are put at the beginning of the slabheader list
for faster access and to avoid eviction when there are more types than
the max-types-per-name limit.
Due to omission it was possible to un-throttle a TCP connection
previously throttled due to the peer not reading back data we are
sending.
In particular, that affected DoH code, but it could also affect other
transports (the current or future ones) that pause/resume reading
according to its internal state.
Clear qctx->zversion when clearing qctx->zrdataset et al in
lib/ns/query.c:qctx_freedata. The uncleared pointer could lead to
an assertion failure if zone data needed to be re-saved which could
happen with stale data support enabled.
A different solution in the future might be adopted depending
on feedback and other new information, so it makes sense to mark
these options as EXPERIMENTAL until we have more data.
View matching on an incoming query checks the query's signature,
which can be a CPU-heavy task for a SIG(0)-signed message. Implement
an asynchronous mode of the view matching function which uses the
offloaded signature checking facilities, and use it for the incoming
queries.
Add support for using the offload threadpool to perform message
signature verifications. This should allow check SIG(0)-signed
messages without affecting the worker threads.
This is a tiny helper function which is used only once and can be
replaced with two function calls instead. Removing this makes
supporting asynchronous signature checking less complicated.
In order to protect from a malicious DNS client that sends many
queries with a SIG(0)-signed message, add a quota of simultaneously
running SIG(0) checks.
This protection can only help when named is using more than one worker
threads. For example, if named is running with the '-n 4' option, and
'sig0checks-quota 2;' is used, then named will make sure to not use
more than 2 workers for the SIG(0) signature checks in parallel, thus
leaving the other workers to serve the remaining clients which do not
use SIG(0)-signed messages.
That limitation is going to change when SIG(0) signature checks are
offloaded to "slow" threads in a future commit.
The 'sig0checks-quota-exempt' ACL option can be used to exempt certain
clients from the quota requirements using their IP or network addresses.
The 'sig0checks-quota-maxwait-ms' option is used to define a maximum
amount of time for named to wait for a quota to appear. If during that
time no new quota becomes available, named will answer to the client
with DNS_R_REFUSED.
By default we log a rekey failure on debug level. We should probably
change the log level to error. We make an exception for when the zone
is not loaded yet, it often happens at startup that a rekey is
run before the zone is fully loaded.
when signatures were not added because of too many types already
existing at a node, the diff was not being cleaned up; this led to
a memory leak being reported at shutdown.
Previously, the number of RR types for a single owner name was limited
only by the maximum number of the types (64k). As the data structure
that holds the RR types for the database node is just a linked list, and
there are places where we just walk through the whole list (again and
again), adding a large number of RR types for a single owner named with
would slow down processing of such name (database node).
Add a configurable limit to cap the number of the RR types for a single
owner. This is enforced at the database (rbtdb, qpzone, qpcache) level
and configured with new max-types-per-name configuration option that
can be configured globally, per-view and per-zone.
Previously, the number of RRs in the RRSets were internally unlimited.
As the data structure that holds the RRs is just a linked list, and
there are places where we just walk through all of the RRs, adding an
RRSet with huge number of RRs inside would slow down processing of said
RRSets.
Add a configurable limit to cap the number of the RRs in a single RRSet.
This is enforced at the database (rbtdb, qpzone, qpcache) level and
configured with new max-records-per-type configuration option that can
be configured globally, per-view and per-zone.
The changes in this MR prevent the memory used for sending the outgoing
TCP requests to spike so much. That strictly remove the extra need for
own memory context, and thus since we generally prefer simplicity,
remove the extra memory context with own jemalloc arenas just for the
outgoing send buffers.
The single TCP read can create as much as 64k divided by the minimum
size of the DNS message. This can clog the processing thread and trash
the memory allocator because we need to do as much as ~20k allocations in
a single UV loop tick.
Limit the number of the DNS messages processed in a single UV loop tick
to just single DNS message and limit the number of the outstanding DNS
messages back to 23. This effectively limits the number of pipelined
DNS messages to that number (this is the limit we already had before).
As a single thread can process only one TCP send at the time, we don't
really need a memory pool for the TCP buffers, but it's enough to have
a single per-loop (client manager) static buffer that's being used to
assemble the DNS message and then it gets copied into own sending
buffer.
In the future, this should get optimized by exposing the uv_try API
from the network manager, and first try to send the message directly
and allocate the sending buffer only if we need to send the data
asynchronously.
Constantly allocating, reallocating and deallocating 64K TCP send
buffers by 'ns_client' instances takes too much CPU time.
There is an existing mechanism to reuse the ns_clent_t structure
associated with the handle using 'isc_nmhandle_getdata/_setdata'
(see ns_client_request()), but it doesn't work with TCP, because
every time ns_client_request() is called it gets a new handle even
for the same TCP connection, see the comments in
streamdns_on_complete_dnsmessage().
To solve the problem, we introduce an array of available (unused)
TCP buffers stored in ns_clientmgr_t structure so that a 'client'
working via TCP can have a chance to reuse one (if there is one)
instead of allocating a new one every time.
When TCP client would not read the DNS message sent to them, the TCP
sends inside named would accumulate and cause degradation of the
service. Throttle the reading from the TCP socket when we accumulate
enough DNS data to be sent. Currently this is limited in a way that a
single largest possible DNS message can fit into the buffer.
This commit ensures that an HTTP endpoints set reference is stored in
a socket object associated with an HTTP/2 stream instead of
referencing the global set stored inside a listener.
This helps to prevent an issue like follows:
1. BIND is configured to serve DoH clients;
2. A client is connected and one or more HTTP/2 stream is
created. Internal pointers are now pointing to the data on the
associated HTTP endpoints set;
3. BIND is reconfigured - the new endpoints set object is created and
promoted to all listeners;
4. The old pointers to the HTTP endpoints set data are now invalid.
Instead referencing a global object that is updated on
re-configurations we now store a local reference which prevents the
endpoints set objects to go out of scope prematurely.
It was reported that HTTP/2 session might get closed or even deleted
before all async. processing has been completed.
This commit addresses that: now we are avoiding using the object when
we do not need it or specifically check if the pointers used are not
'NULL' and by ensuring that there is at least one reference to the
session object while we are doing incoming data processing.
This commit makes the code more resilient to such issues in the
future.
Replace the ISC_LIST based deadnodes implementation with isc_queue which
is wait-free and we don't have to acquire neither the tree nor node lock
to append nodes to the queue and the cleaning process can also
copy (splice) the list into a local copy without acquiring the list.
Currently, there's little benefit to this as we need to hold those
locks anyway, but in the future as we move to RCU based implementation,
this will be ready.
To align the cleaning with our event loop based model, remove the
hardcoded count for the node locks and use the number of the event loops
instead. This way, each event loop can have its own cleaning as part of
the process. Use uniform random numbers to spread the nodes evenly
between the buckets (instead of hashing the domain name).
Add an isc_queue implementation that hides the gory details of cds_wfcq
into more neat API. The same caveats as with cds_wfcq.
TODO: Add documentation to the API.
When the cache's memory context was in over memory state when the
cache was flushed it resulted in LRU cleaning removing newly entered
data in the new cache straight away until the old cache had been
destroyed enough to take it out of over memory state. When flushing
the cache create a new memory context for the new db to prevent this.
The `axfr_makedb()` didn't set the loop on the newly created database,
effectively killing delayed cleaning on such database. Move the
database creation into dns_zone API that knows all the gory details of
creating new database suitable for the zone.
The rdataslab.c was full of code like this:
length = raw[0] * 256 + raw[1];
and
count2 = *current2++ * 256;
count2 += *current2++;
Refactor code like this into peek_uint16() and get_uint16 macros
to prevent code repetition and possible mistakes when copy and
pasting the same code over and over.
As a side note for an entertainment of a careful reader of the commit
messages: The byte manipulation was changed from multiplication and
addition to shift with or.
The difference in the assembly looks like this:
MUL and ADD:
movzx eax, BYTE PTR [rdi]
movzx edi, BYTE PTR [rdi+1]
sal eax, 8
or edi, eax
SHIFT and OR:
movzx edi, WORD PTR [rdi]
rol di, 8
movzx edi, di
If the result and/or buffer is then being used after the macro call,
there's more differences in favor of the SHIFT+OR solution.
The detach function declaration in `ISC__REFCOUNT_TRACE_DECL` had an
returned an accidental implicit int. While not allowed since C99, it
became an error by default in GCC 14.
`ISC_REFCOUNT_TRACE_IMPL` and `ISC_REFCOUNT_STATIC_TRACE_IMPL` expanded
into the wrong macros, trying to declare it again with the wrong number
of parameters.
- duplicated question
- duplicated answer
- qtype as an answer
- two question types
- question names
- nsec3 bad owner name
- short record
- short question
- mismatching question class
- bad record owner name
- mismatched class in record
- mismatched KEY class
- OPT wrong owner name
- invalid RRSIG "covers" type
- UPDATE malformed delete type
- TSIG wrong class
- TSIG not the last record
there were TSAN error reports because of conflicting uses of
node->dirty and node->nsec, which were in the same qword.
this could be resolved by separating them, but we could also
make them into atomic values and remove some node locking.
The fix_iterator() function had a lot of bugs in it and while fixing
them, the number of corner cases and the complexity of the function
got out of hand. Rewrite the function with the following modifications:
The function now requires that the iterator is pointing to a leaf node.
This removes the cases we have to deal when the iterator was left on a
dead branch.
From the leaf node, pop up the iterator stack until we encounter the
branch where the offset point is before the point where the search key
differs. This will bring us to the right branch, or at the first
unmatched node, in which case we pop up to the parent branch. From
there it is easier to retrieve the predecessor.
Once we are at the right branch, all we have to do is find the right
twig (which is either the twig for the character at the position where
the search key differs, or the previous twig) and walk down from there
to the greatest leaf or, in case there is no good twig, get the
previous twig from the successor and get the greatest leaf from there.
If there is no previous twig to select in this branch, because every
leaf from this branch node is greater than the one we wanted, we need
to pop up the stack again and resume at the parent branch. This is
achieved by calling prevleaf().
Move the fix_iterator out of the loop and only call it when we found
a leaf node. This leaf node may be the wrong leaf node, but fix_iterator
should correct that.
Also, when we don't need to set the iterator, just get any leaf. We
only need to have a leaf for the qpkey_compare and the end result does
not matter if compare was against an ancestor leaf or any leaf below
that point.
An assertion failure would be triggered when sending the TCP data ends
after the TCP reading gets closed. Implement proper reference counting
for the isc_httpd object.
When searching for a requested name in dns_qp_lookup(), we may add
a leaf node to the QP chain, then subsequently determine that the
branch we were on was a dead end. When that happens, the chain can be
left holding a pointer to a node that is *not* an ancestor of the
requested name.
We correct for this by unwinding any chain links with an offset
value greater or equal to that of the node we found.
The code below the if/else construction could only be run if the 'if'
code path was taken. Move the code into the 'if' code block so that
it is more easier to read.
The high-water allows administrators to better tune the recursive
clients limit without having to to poll the statistics channel in high
rates to get this number.
Returning the value allows for better high-water tracking without
running into edge cases like the following:
0. The counter is at value X
1. Increment the value (X+1)
2. The value is decreased multiple times in another threads (X+1-Y)
3. Get the value (X+1-Y)
4. Update-if-greater misses the X+1 value which should have been the
high-water
When parsing a zonefile named-checkzone (and others) could loop
infinitely if a directory was $INCLUDED. Record the error and treat
as EOF when looking for multiple errors.
This was found by Eric Sesterhenn from X41.
In nibble mode if the value to be converted was negative the parser
would loop forever. Process the value as an unsigned int instead
of as an int to prevent sign extension when shifting.
This was found by Eric Sesterhenn from X41.
An RPZ response's SOA record TTL is set to 1 instead of the SOA TTL,
a boolean value is passed on to query_addsoa, which is supposed to be
a TTL value. I don't see what value is appropriate to be used for
overriding, so we will pass UINT32_MAX.
The expireheader() call in the expire_ttl_headers() function
is erroneous as it passes the 'nlocktypep' and 'tlocktypep'
arguments in a wrong order, which then causes an assertion
failure.
Fix the order of the arguments so it corresponds to the function's
prototype.
in QP keys, characters that are not common in DNS names are
encoded as two-octet sequences. this caused a glitch in iterator
positioning when some lookups failed.
consider the case where we're searching for "\009" (represented
in a QP key as {0x03, 0x0c}) and a branch exists for "\000"
(represented as {0x03, 0x03}). we match on the 0x03, and continue
to search down. at the point where we find we have no match,
we need to pop back up to the branch before the 0x03 - which may
be multiple levels up the stack - before we position the iterator.
there were some structure names used in qpcache.c and qpzone.c that
were too similar to each other and could be confusing when debugging.
they have been changed as follows:
in qcache.c:
- changed_t was unused, and has been removed
- search_t -> qpc_search_t
- qpdb_rdatasetiter_t -> qpc_rditer_t
- qpdb_dbiterator_t -> qpc_dbiter_t
in qpzone.c:
- qpdb_changed_t -> qpz_changed_t
- qpdb_changedlist_t -> qpz_changedlist_t
- qpdb_version_t -> qpz_version_t
- qpdb_versionlist_t -> qpz_versionlist_t
- qpdb_search_t -> qpz_search_t
- qpdb_load_t -> qpz_search_t
when calling dns_qp_lookup() from qpcache, instead of passing
'foundname' so that a name would be constructed from the QP key,
we now just use the name field in the node data. this makes
dns_qp_lookup() run faster.
the same optimization has also been added to qpzone.
the documentation for dns_qp_lookup() has been updated to
discuss this performance consideration.
when the cache is over memory, we purge from the LRU list until
we've freed the approximate amount of memory to be added. this
approximation could fail because the memory allocated for nodenames
wasn't being counted.
add a dns_name_size() function so we can look up the size of nodenames,
then add that to the purgesize calculation.
in a cache database, unlike zones, NSEC3 records are stored in
the main tree. it is not necessary to maintain a separate 'nsec3'
tree, nor to have code in the dbiterator implementation to traverse
from one tree to another.
(if we ever implement synth-from-dnssec using NSEC3 records, we'll
need to revert this change. in the meantime, simpler code is better.)
the QP database doesn't support relative names as the RBTDB did, so
there's no need for a 'new_origin' flag or to handle `DNS_R_NEWORIGIN`
result codes.
- remove unneeded struct members and misleading comments.
- remove unused parameters for static functions.
- rename 'find_callback' to 'delegating' for consistency with qpzone;
the find callback mechanism is not used in QP databases.
- change dns_qpdata_t to qpcnode_t (QP cache node), and dns_qpdb_t to
qpcache_t, as these types are only accessed locally.
- also change qpdata_t in qpzone.c to qpznode_t (QP zone node), for
consistency.
- make the refcount declarations for qpcnode_t and qpznode_t static,
using the new ISC_REFCOUNT_STATIC macros.
In qpcache (and rbtdb), there are some functions that acquire
neither the tree lock nor the node lock when calling newref().
In theory, this could lead to a race in which a new reference
is added to a node that was about to be deleted.
We now detect this condition by passing the current tree and node
lock status to newref(). If the node was previously unreferenced
and we don't hold at least one read lock, we will assert.
Because some tests don't have a legtimate handle, provide a temporary
return early that should be fixed and removed before squashing. This
short circuiting is still correct until DoQ/DoH3 support is introduced.
GCC might fail to compile because it expects a return after UNREACHABLE.
It should ideally just work anyway since UNREACHABLE is either a
noreturn or UB (__builtin_unreachable / C23 unreachable).
Either way, it should be optimized almost always so the fallback is
free or basically free anyway when it isn't optimized out.
the previous commit introduced a possible race in getsigningtime()
where the rdataset header could change between being found on the
heap and being bound.
getsigningtime() now looks at the first element of the heap, gathers the
locknum, locks the respective lock, and retrieves the header from the
heap again. If the locknum has changed, it will rinse and repeat.
Theoretically, this could spin forever, but practically, it almost never
will as the heap changes on the zone are very rare.
we simplify matters further by changing the dns_db_getsigningtime()
API call. instead of passing back a bound rdataset, we pass back the
information the caller actually needed: the resigning time, owner name
and type of the rdataset that was first on the heap.
in RBTDB, the heap was used by zone databases for resigning, and
by the cache for TTL-based cache cleaning. the cache use case required
very frequent updates, so there was a separate heap for each of the
node lock buckets.
qpzone is for zones only, so it doesn't need to support the cache
use case; the heap will only be touched when the zone is updated or
incrementally signed. we can simplify the code by using only a single
heap.
fix_iterator() and related functions are quite difficult to read.
perhaps it would be a little clearer if we didn't assign values
to variables that won't subsequently be used, or unnecessarily
pop the stack and then push the same value back onto it.
also, in dns_qp_lookup() we previously called fix_iterator(),
removed the leaf from the top of the iterator stack, and then
added it back on. this would be clearer if we just push the leaf
onto the stack when we need to, but leave the stack alone when
it's already complete.
under some circumstances it was possible for the iterator to
be set to the first leaf in a set of twigs, when it should have
been set to the last.
a unit test has been added to test this scenario. if there is a
a tree containing the following values: {".", "abb.", "abc."}, and
we query for "acb.", previously the iterator would be positioned at
"abb." instead of "abc.".
the tree structure is:
branch (offset 1, ".")
branch (offset 3, ".ab")
leaf (".abb")
leaf (".abc")
we find the branch with offset 3 (indicating that its twigs differ
from each other in the third position of the label, "abB" vs "abC").
but the search key differs from the found keys at position 2
("aC" vs "aB"). we look up the bit value in position 3 of the
search key ("B"), and incorrectly follow it onto the wrong twig
("abB").
to correct for this, we need to check for the case where the search
key is greater than the found key in a position earlier than the
branch offset. if it is, then we need to pop from the current leaf
to its parent, and get the greatest leaf from there.
a further change is needed to ensure that we don't do this twice;
when we've moved to a new leaf and the point of difference between
it and the search key even earlier than before, then we're definitely
at a predecessor node and there's no need to continue the loop.
The previous value of 30 minutes used to cache the ADB names and entries
was quite long. Change the value to 60 seconds for faster recovery
after cached intermittent failure of the remote nameservers.
The algorithm from the previous commit[1] is now used to calculate all
the expiration values through the code (ncache results, cname/dname
targets).
1. ISC_MIN(cur, ISC_MAX(now + ADB_ENTRY_WINDOW, now + rdataset->ttl))
Correct the logic to set the expiration period of expire_{v4,v6} as
follows:
1. If the trust is ultimate (local entry), immediately set the entry as
expired, so the changes to the local zones have immediate effect.
3. If the expiration is already set and smaller than the new value, then
leave the expiration value as it is.
2. Otherwise pick larger of `now + ADB_ENTRY_WINDOW` and `now + TTL` as
the new expiration value.
When ADB entry was created it was set to never expire. If we never
called any of the functions that adjust the expiration, it could linger
in the ADB forever.
Set the expiration (.expires) to now + ADB_ENTRY_WINDOW when creating
the new ADB entry to ensure the ADB entry will always expire.
Previously, only a single controlconf message would be processed from a
single TCP read even if the TCP read buffer contained multiple messages.
Refactor the isccc_ccmsg unit to store the extra buffer in the internal
buffer and use the already read data first before reading from the
network again.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Dominik Thalhammer <dominik@thalhammer.it>
Prepare the statistics channel data in the offloaded worker thread, so
the networking thread is not blocked by the process gathering data from
various data structures. Only the netmgr send is then run on the
networkin thread when all the data is already there.
Protect the access to the trust byte in the ncache data with relaxed
atomic operation to mimick the current behaviour. This will teach
TSAN that the concurrent access is fine.
This prevents TSAN errors with the ncache code where the trust byte
access needs to be protected by a lock. The old code copied the
entire region before determining where the name ended. We now
determine where the name ends then copy just that data and in doing
so avoid reading the trust byte.
The source address field of 'newnotify' was not updated from the
default (0.0.0.0) when the destination address was an IPv6 address.
This resulted in the messages failing to be sent. Set the source
address to :: when the destination address is an IPv6 address.
Since the dns_validator_destroy() function doesn't guarantee that
it destroys the validator, rename it to dns_validator_shutdown()
and require explicit dns_validator_detach() to follow.
Enforce the documented function requirement that the validator must
be completed when the function is called.
Make sure to set val->name to NULL when the function is called,
so that the owner of the validator may destroy the name, even if
the validator is not destroyed immediately. This should be safe,
because the name can be used further only for logging by the
offloaded work callbacks when they detect that the validator is
already canceled/complete, and the logging function has a condition
to use the name only when it is non-NULL.
If val->result is not ISC_R_SUCCESS, a similar message is logged
further down in the function. Remove the redundant log message.
Also remove an unnecessary code comment line.
isc_loop() can now take its place.
This also requires changes to the test harness - instead of running the
setup and teardown outside of th main loop, we now schedule the setup
and teardown to run on the loop (via isc_loop_setup() and
isc_loop_teardown()) - this is needed because the new the isc_loop()
call has to be run on the active event loop, but previously the
isc_loop_current() (and the variants like isc_loop_main()) would work
even outside of the loop because it needed just isc_tid() to work, but
not the full loop (which was mainly true for the main thread).
if we had a method to get the running loop, similar to how
isc_tid() gets the current thread ID, we can simplify loop
and loopmgr initialization.
remove most uses of isc_loop_current() in favor of isc_loop().
in some places where that was the only reason to pass loopmgr,
remove loopmgr from the function parameters.
an assertion could be triggered in the QPDB cache if a DNAME
was found above a queried NS, because the 'foundname' value was
not correctly updated to point to the zone cut.
the same mistake existed in qpzone and has been fixed there as well.
If there are no more previous leaves, it means the queried name
precedes the entire range of names in the database, so we should just
move the iterator one step back and return, instead of continuing our
search for the predecessor.
This is similar to an earlier bug fixed in an earlier commit:
ea9a8cb392
every node of a QP database contains a copy of the nodename,
which is used as the key for the QP-trie. previously, the name
was stored as a dns_fixedname object, which has room for up to
255 characters. we can reduce the space consumed by dynamically
allocating a dns_name object that's just long enough for the name
to be stored.
The dns_qpiter_next() was called without checking the return value. If
we cannot move the iterator forward, there is no use in calling the
step() function.
/lib/dns/qpzone.c: 2804 in activeempty()
2798 * of the name we were searching for. Step the iterator
2799 * forward, then step() will continue forward until it
2800 * finds a node with active data. If that node is a
2801 * subdomain of the one we were looking for, then we're
2802 * at an active empty nonterminal node.
2803 */
>>> CID 487882: Error handling issues (CHECKED_RETURN)
>>> Calling "dns_qpiter_next" without checking return value (as is done elsewhere 26 out of 27 times).
2804 dns_qpiter_next(it, NULL, NULL, NULL);
2805 return (step(search, it, FORWARD, next) &&
2806 dns_name_issubdomain(next, current));
2807 }
Be stricter in durations that are accepted. Basically we accept ISO 8601
formats, but fail to detect garbage after the integers in such strings.
For example, 'P7.5D' will be treated as 7 days. Pass 'endptr' to
'strtoll' and check if the endptr is at the correct suffix.
dns_db_addrdataset() enforces a requirement that version can only
be NULL for a cache database. code that checks for zone semantics
and version == NULL can never be reached.
qpzone does not support cache semantics, so dns_db_addrdataset(),
_deleterdataset() and _subtractrdataset() can't be run with
version == NULL; there's no need to check for it.
we can also clean up free_qpdb() a bit since current_version
is always non-NULL.
The Depends relation refers to types of rollovers in which a certain
record type is going to be swapped. Specifically, the Depends relation
says there should be no dependency on the predecessor key (the set
Dep(x, T) must be empty).
But if the key is phased out (all its states are in HIDDEN), there is
no longer a dependency. Since the relationship is still maintained
(Predecessor and Successor metadata), the keymgr_dep function still
returned true. In other words, the set Dep(x, T) is not considered
empty.
This slows down key rollovers, only retiring keys when the successor
key has been fully propagated.
When there is a secure chain of trust with a KSK that is not actively
signing the DNSKEY RRset, the code for validating the DNSKEY RRset
against the DS RRset could potentially skip DS records, thinking the
chain of trust is broken while there is a valid DS with corresponding
DNSKEY record present.
This is because we pass the result ISC_R_NOMORE on when we are done
checking for signatures, but then treat it as "no more DS records".
Chaning the return value to something else (DNS_R_NOVALIDSIG seems the
most appropriate here) fixes the issue.
the code in qpdb.c was previously shared by qp-cachedb.c and
qp-zonedb.c. since qp-zonedb.c no longer exists, it's not necessary
to keep these separate any longer. the two files have been merged,
and functions that were previously globally accessible have been
changed to static and renamed.
now that "qpzone" databases are available for use in zones, we no
longer need to retain the zone semantics in the "qp" database.
all zone-specific code has been removed from QPDB, and "configure
--with-zonedb" once again takes two values, rbt and qp.
some database API methods that are never used with a cache have
been removed from qpdb.c and qp-cachedb.c; these include newversion,
closeversion, subtractrdataset, and nodefullname.
because dns_qpmulti_commit() can be time consuming, it's inefficient
to open and commit a qpmulti transaction for each rdataset being loaded
into a database. we can improve load time by opening a qpmulti
transaction before adding a group of rdatasets and then committing it
afterward.
this commit adds 'setup' and 'commit' functions to dns_rdatacallbacks_t,
which can be called before and after the loops in which 'add' is
called in dns_master_load() and axfr_apply().
when copying the non-dnssec records in receive_secure_db(),
use DNS_DB_NONSEC3 so we don't accidentally create nodes in
the main tree for NSEC3 records. this was a long-standing error
in the code, but was harmless in the RBTDB.
QP database node data is not reference counted the same way RBT nodes
were: in the RBT, node->references could be zero if the node was in the
tree but was not in use by any caller, whereas in the QP trie, the
database itself uses reference counting of nodes internally.
this caused some subtle errors. in RBTDB, when the newref() function is
called and the node reference count was zero, the node lock reference
counter would also be incremented. in the QP trie, this can never
happen - because as long as the node is in the database its reference
count cannot be zero - and so the node lock reference counter was never
incremented.
this has been addressed by maintaining a separate "erefs" counter for
external references to the node. this is the same approach used in the
"qpdb-lite" database in commit e91fbd8dea.
while troubleshooting this issue, some compile errors were discovered
when building with DNS_DB_NODETRACE; those have also been fixed.
use the dns_qpmulti-based "qpzone" by default throughout BIND,
instead of the existing dns_qp-based "qp", when creating zone
databases. (cache databases still use "qp".)
the "--with-zonedb" option has been updated in configure.ac to permit
the use of both "qp" and "qpzone" databases.
in zone.c there was a test that prevented any database type other than
"qp" from hosting an RPZ. this was outdated, and has been removed.
previously, an RCU critical section was held open for the duration
of a snapshot. this should not be necessary, as the snapshot makes
local copies of QP trie metadata, and it causes problems when a
DB iterator is held open between two loop events. we now call
rcu_read_unlock() after setting up the snapshot.
finish importing the database API methods from RBTDB to qpzone:
issecure, nodecount, getnsec3parameters, findnsec3node, setsigningtime,
getsigningtime, getsize, setgluecachestats, locknode, unlocknode, and
addglue.
add database API methods needed to apply updates to an existing zone
database (newversion, addrdataset, subtractrdataset and deleterdataset).
it is now possible to apply journals to zone databases after loading, so
named-checkzone -J works correctly.
add database API method implementations needed to iterate and dump
a qpzone database to a file (createiterator, allrdatasets and
attachversion, plus dbiterator and rdatasetiter methods).
named-checkzone -D can now dump the contents of most zones,
but zone cuts are not correctly detected.
add database API methods needed for loading rdatasets into memory
(currentversion, beginload, endload), plus the methods used by
zone_postload() for zone consistency checks (getoriginnode, find,
findnode, findrdataset, attachnode, detachnode, deletedata).
the QP trie doesn't support the find callback mechanism available
in dns_rbt_findnode() which allows examination of intermediate nodes
while searching, so the detection of wildcard and delegation nodes
is now done by scanning QP chains after calling dns_qp_lookup().
Note that the lookup in previous_closest_nsec() cannot return
ISC_R_NOTFOUND. In RBTDB, we checked for this return value and
ovewrote the result with ISC_R_NOMORE if it occurred. In the
qpzone implementation, we insist that this return value cannot happen.
dns_qp_lookup() would only return ISC_R_NOTFOUND if we asked for a
name outside the zone's authoritative domain, and we never do that
when looking up a predecessor NSEC record.
named-checkzone is now able to load a zone and check it for errors,
but cannot dump it.
The dns_cache_flush() drops the old database and creates a new one, but
it forgets to pass the loop that runs the node pruning and cleaning
the rbtdb when flushing it next time. This causes the cleaning to skip
cleaning the parent nodes (with .down == NULL) leading to increased
memory usage over time until the database is unable to keep up and just
stays overmem all the time.
Reconstruct the variant of the prune_tree() parent cleaning to consider
all elibible parents in a single loop as we were doing before all the
changes that led to this commit.
Update code comments so that they more precisely describe what the
relevant bits of code actually do.
by default, QPDB is the database used by named and all tools and
unit tests. the old default of RBTDB can now be restored by using
"configure --with-zonedb=rbt --with-cachedb=rbt".
some tests have been fixed so they will work correctly with either
database.
CHANGES and release notes have been updated to reflect this change.
When running resolver benchmark pipeline, a crash occurred:
https://gitlab.isc.org/isc-projects/bind9-shotgun-ci/-/pipelines/163946
In the code we are doing a lookup, it fails (meaning there is no node
with lookup name), we create the node and insert it and it fails.
But dns_qp_insert can only return ISC_R_SUCCESS or ISC_R_EXISTS.
So it must have been inserted in between. This is a race condition bug.
The first lookup only requires a write lock and if the lookup failed
the lock gets upgraded to a write lock and we insert the missing data.
To fix the race condition bug, we need to do a lookup again after we
have upgraded the lock to make sure it wasn't inserted in the mean
time.
the dyndb test requires a mechanism to retrieve the name associated
with a database node, and since the database no longer uses RBT for
its underlying storage, dns_rbt_fullnamefromnode() doesn't work.
addressed this by adding dns_db_nodefullname() to the database API.
If the iterator is paused, the tree is unlocked and may change.
In an RBT tree it's always possible to resume iteration as long
as a valid node pointer was still held, but now that the underlying
database structure is a QP trie, the iterator needs to be initialized
based on the existing structure of the trie or it will return
inconsistent results. We now call dns_qp_lookup() to reinitialize
the QP iterator whenever dbiterator_next() or dbiterator_prev() is
called on a paused iterator.
QP database node data is not reference counted the same way RBT nodes
were: in the RBT, node->references could be zero if the node was in the
tree but was not in use by any caller, whereas in the QP trie, the
database itself uses reference counting of nodes internally.
this caused some subtle errors. in RBTDB, when the newref() function is
called and the node reference count was zero, the node lock reference
counter would also be incremented. in the QP trie, this can never
happen - because as long as the node is in the database its reference
count cannot be zero - and so the node lock reference counter was never
incremented.
reference counting will probably need to be refactored in more detail
later; the node lock reference count may not be needed at all. but
for now, as a temporary measure, we add a third reference counter,
'erefs' (external references), to the dns_qpdata structure. this is
counted separately from the main reference counter, and should match
the node reference count as it would have been in RBTDB.
this change revealed a number of places where the node reference counter
was being incremented on behalf of a caller without newref() being
called; those were cleaned up as well.
This is an adaptation of commit 3dd686261d2c4bcd15a96ebfea10baffa277732b
Fix reference counting: unreference nodes that are succesfully inserted
in the tree, detach created nodes, and cleanup the interior data in
dns_qpdata_destroy().
The name will be stored inside the node now so we can just copy it.
These are leftovers, most of the namefromnode code has been replaced
already in previous commits.
The qp approach pulled apart the chain and iterator into two separate
things. Replace the rbtnodechain with qpchain and qpiter. Most of the
times we are interested in the iterator only, the rbtnodechain was
mainly used as an an iterator to get the previous and next name in the
DNS canonical order.
Since dns_qpiter_prev() and dns_qpiter_next() store the name, origin,
and node in the provided parameters, often there is no need to call
a current() function anymore.
Getting the first or last item from the iterator is done by
re-initializing the iterator and then call dns_qpiter_next() or
dns_qpiter_prev() respectively.
The dbiterator no longer needs to maintain a chain, only an iterator.
All dns_qp_lookup() calls assume it is okay to find empty data, so
we don't need to do anything special for the DNS_RBTFIND_EMPTYDATA.
You can pass a callback function to dns_rbt_findnode(), something that
qp does not support. Instead, call the function afterwards. This has
the drawback that we do more lookup work if there was a zonecut.
With dns_qp_lookup() we also don't pass any options. In this case,
when DNS_RBTFIND_NOEXACT was set, we adapt the result after the lookup.
Replace dns_rbt_deletenode calls with dns_qp_deletename. For removing
the name from the nsec tree, we no longer first have to find it: we can
just remove the key (retrieved by name).
Replace dns_rbt_addnode calls with dns_qp_insert. With QP, it sometimes
makes more sense to first lookup the name and see if there is an
existing node (rather than create new data, insert, find out a node
already exists, and destroy the data again). This is done with
dns_qp_getname(), which is more lightweight than dns_qp_lookup(),
and we are only interested in if there is already a leaf node for this
name or not.
replace the string "rbt" throughout BIND with "qp" so that
qpdb databases will be used by default instead of rbtdb.
rbtdb databases can still be used by specifying "database rbt;"
in a zone statement.
- Copy rbtdb.c, rbt-zonedb.c and rbt-cachedb.c to qp-*.
- Added qpmethods.
- Added a new structure dns_qpdata that will replace dns_rbtnode.
- Replaced normal, nsec, and nsec3 dns_rbt trees with dns_qp tries.
- Replaced dns_rbt_create() calls with dns_qp_create().
- Replaced the dns_rbt_destroy() call with dns_qp_destroy().
- Create a dns_qpdata struct and create/destroy methods.
This commit will not build.
[GL #3709] reordered the dns_rdataset_disassociate call to after
the dns_resolver_createfetch call resulting in qctx->nsrrset still
being associated when dns_resolver_createfetch is called in
resume_dslookup (7e4e125e). Revert that part of the change and add
comments as to why the multiple dns_rdataset_disassociate calls are
where they are.
The TCP dispatch connected callbacks could be called synchronously which
in turn could destroy xfrin before we return from dns_xfrin_create().
Delay the calling the callback called from tcp_dispatch_connect() by
calling it always asynchronously.
Instead of getting the loop from the zone every time, attach the xfrin
directly to the loop. This also allows to remove the extra safety tid
checks from the dns_xfrin unit.
It was discovered that the TTL-based cleaning could build up
a significant backlog of the rdataset headers during the periods where
the top of the TTL heap isn't expired yet. Make the TTL-based cleaning
more aggressive by cleaning more headers from the heap when we are
adding new header into the RBTDB.
It was discovered that an expired header could sit on top of the heap
a little longer than desireable. Remove expired headers (headers with
rdh_ttl set to 0) from the heap completely, so they don't block the next
TTL-based cleaning.
Instead of juggling with node locks in a cycle, cleanup the node we are
just pruning and send any the parent that's also subject to the pruning
to the prune tree via normal way (e.g. enqueue pruning on the parent).
This simplifies the code and also spreads the pruning load across more
event loop ticks which is better for lock contention as less things run
in a tight loop.
The log message for commit 24381cc36d
explained:
In some older BIND 9 branches, the extra queuing overhead eliminated by
this change could be remotely exploited to cause excessive memory use.
Due to architectural shift, this branch is not vulnerable to that issue,
but applying the fix to the latter is nevertheless deemed prudent for
consistency and to make the code future-proof.
However, it turned out that having a single queue for the nodes to be
pruned increased lock contention to a level where cleaning up nodes from
the RBTDB took too long, causing the amount of memory used by the cache
to grow indefinitely over time.
This commit reverts the change to the pruning mechanism introduced by
commit 24381cc36d as BIND branches newer
than 9.16 were not affected by the excessive event queueing overhead
issue mentioned in the log message for the above commit.
dns__cacherbt_expireheader can unlink / free header_prev underneath
it. Use ISC_LIST_TAIL after calling dns__cacherbt_expireheader
instead to get the next pointer to be processed.
This commit ensures that worker threads are not sleeping (by using
select()) when '-T transferslowly/transferstuck' test options are
used. This commit converts synchronous implementation of the code into
an asynchronous one based on timers.
The resolver's code was not ready to failures when trying to establish
a connection via TCP-based transports (e.g. when creating TLS contexts
before establishing a TLS connection).
This commit fixes that.
The warnings (see below) seem to be false-positives. Address them
by adding runtime checks.
resolver.c:1627:10: warning: Access to field 'tid' results in a dereference of a null pointer (loaded from variable 'fctx') [core.NullDereference]
1627 | REQUIRE(fctx->tid == isc_tid());
| ^~~~~~~~~
../../lib/isc/include/isc/util.h:332:34: note: expanded from macro 'REQUIRE'
332 | #define REQUIRE(e) ISC_REQUIRE(e)
| ^
../../lib/isc/include/isc/assertions.h:45:11: note: expanded from macro 'ISC_REQUIRE'
45 | ((void)((cond) || \
| ^~~~
resolver.c:10335:6: warning: Access to field 'depth' results in a dereference of a null pointer (loaded from variable 'fctx') [core.NullDereference]
10335 | if (fctx->depth > depth) {
| ^~~~~~~~~~~
2 warnings generated.
- the DNS_DB_NSEC3ONLY and DNS_DB_NONSEC3 flags are mutually
exclusive; it never made sense to set both at the same time.
to enforce this, it is now a fatal error to do so. the
dbiterator implementation has been cleaned up to remove
code that treated the two as independent: if nonsec3 is
true, we can be certain nsec3only is false, and vice versa.
- previously, iterating a database backwards omitted
NSEC3 records even if DNS_DB_NONSEC3 had not been set. this
has been corrected.
- when an iterator reaches the origin node of the NSEC3 tree, we
need to skip over it and go to the next node in the sequence.
the NSEC3 origin node is there for housekeeping purposes and
never contains data.
- the dbiterator_test unit test has been expanded, several
incorrect expectations have been fixed. (for example, the
expected number of iterations has been reduced by one; we were
previously counting the NSEC3 origin node and we should not
have been doing so.)
- create_node() in rbt.c cannot fail
- the dns_rbt_*name() functions, which are wrappers around
dns_rbt_[add|find|delete]node(), were never used except in tests.
this change isn't really necessary since RBT is likely to go away
eventually anyway. but keeping the API as simple as possible while it
persists is a good thing, and may reduce confusion while QPDB is being
developed from RBTDB code.
these values pertain to whether a node is in the main, nsec, or nsec3
tree of an RBTDB. they need to be moved to a more generic location so
they can also be used by QPDB.
(this is in db.h rather than db_p.h because rbt.c needs access to it.
technically, that's a layer violation, but it's a long-existing one;
refactoring to get rid of it would be a large hassle, and eventually
we expect to remove rbt.c anyway.)
when the QPDB is implemented, we will need to have both qpdb_p.h and
rbtdb_p.h. in order to prevent name collisions or code duplication,
this commit adds a generic private header file, db_p.h, containing
structures and macros that will be used by both databases.
some functions and structs have been renamed to more specifically refer
to the RBT database, in order to avoid namespace collision with similar
things that will be needed by the QPDB later.
refactor the wildcard matching code to make it a bit easier to
understand, in hopes that it will reduce the difficulty of converting
from RBTDB to QPDB later.
there are also some minor optimizations: previously, after stepping
backward to find the predecessor, we stepped back foward *from* the
predecessor to find the successor. we now reset the rbtnode chain to
its original starting point before stepping forward; this eliminates
some unnecessary processing. and, if neither predecessor nor successor
is found, we return early rather than carrying on with an unnecessary
effort to match labels.
Coverity detected that address->type.sa was too small when copying
a struct sockaddr_sin6, use the alterative union element
address->type.sin6 instead.