there was another edge case in which an iterator could be positioned at
the wrong node after dns_qp_lookup(). when searching for a key, it's
possible to reach a dead-end branch that doesn't match, because the
branch offset point is *after* the point where the search key differs
from the branch's contents.
for example, if searching for the key "mop", we could reach a branch
containing "moon" and "moor". the branch offset point - i.e., the
point after which the branch's leaves differ from each other - is the
fourth character ("n" or "r"). however, both leaves differ from the
search key at position *three* ("o" or "p"). the old code failed to
detect this condition, and would have incorrectly left the iterator
pointing at some lower value and not at "moor".
this has been fixed and the unit test now includes this scenario.
in some cases it was possible for the iterator to be positioned in the
wrong place by dns_qp_lookup(). previously, when a leaf node was found
which matched the search key at its parent branch's offset point, but
did not match after that point, the code incorrectly assumed the leaf
it had found was a successor to the searched-for name, and stepped the
iterator back to find a predecessor. however, it was possible for the
non-matching leaf to be the predecessor, in which case stepping the
iterator back was wrong.
(for example: a branch contains "aba" and "abcd", and we are searching
for "abcde". we step down to the twig matching the letter "c" in
position 3. "abcd" is the predecessor of "abcde", so the iterator is
already correctly positioned, but because the twig was an exact match,
we would have moved it back one step to "aba".)
this previously went unnoticed due to a mistake in the qp_test unit
test, which had the wrong expected result for the test case that should
have detected the error. both the code and the test have been fixed.
the 'predecessor' argument to dns_qp_lookup() turns out not to
be sufficient for our needs: the predecessor node in a QP database
could have become "empty" (for the current version) because of an
update or because cache data expired, and in that case the caller
would have to iterate more than one step back to find the predecessor
node that it needs.
it may also be necessary for a caller to iterate forward, in
order to determine whether a node has any children.
for both of these reasons, we now replace the 'predecessor'
argument with an 'iter' argument. if set, this points to memory
with enough space for a dns_qpiter object.
when an exact match is found by the lookup, the iterator will be
pointing to the matching node. if not, it will be pointing to the
lexical predecessor of the nae that was searched for.
a dns_qpiter_current() method has been added for examining
the current value of the iterator without moving it in either
direction.
During initialisation or reconfiguration, it is possible that multiple
threads are trying to create a TLS context and associated data (like
TLS certs store) concurrently. In some cases, a thread might be too
late to add newly created data to the TLS contexts cache, in which
case it needs to be discarded. In the code that handles that case, it
was not taken into account that, in some cases, the TLS certs store
could not have been created or should not be deleted, as it is being
managed by the TLS contexts cache already. Deleting the store in such
cases might lead to crashes.
This commit fixes the issue.
The main intention of PROXY protocol is to pass endpoints information
to a back-end server (in our case - BIND). That means that it is a
valid way to spoof endpoints information, as the addresses and ports
extracted from PROXYv2 headers, from the point of view of BIND, are
used instead of the real connection addresses.
Of course, an ability to easily spoof endpoints information can be
considered a security issue when used uncontrollably. To resolve that,
we introduce 'allow-proxy' and 'allow-proxy-on' ACL options. These are
the only ACL options in BIND that work with real PROXY connections
addresses, allowing a DNS server operator to specify from what clients
and on which interfaces he or she is willing to accept PROXY
headers. By default, for security reasons we do not allow to accept
them.
BIND 9 will now treat the response as insecure when processing NSEC3
records with iterations larger than 50.
Earlier, we limited the number of iterations to 150 (in #2445).
RFC 9276 says: Because there has been a large growth of open (public)
DNSSEC validating resolvers that are subject to compute resource
constraints when handling requests from anonymous clients, this
document recommends that validating resolvers reduce their iteration
count limits over time. Specifically, validating resolver operators and
validating resolver software implementers are encouraged to continue
evaluating NSEC3 iteration count deployment trends and lower their
acceptable iteration limits over time.
After evaluation, we decided that the next major BIND release should
lower the maximum allowed NSEC3 iterations to 50, which should be
fine for 99,87% of the domain names.
With shared name memory pools (f5af981831)
the message needs to be destroyed before the view is detached which
in turn detaches the resolver which checks that all resources have
been returned.
Previously, there were two methods of working with the overmem
condition:
1. hi/lo water callback - when the overmem condition was reached
for the first time, the water callback was called with HIWATER
mark and .is_overmem boolean was set internally. Similarly,
when the used memory went below the lo water mark, the water
callback would be called with LOWATER mark and .is_overmem
was reset. This check would be called **every** time memory
was allocated or freed.
2. isc_mem_isovermem() - a simple getter for the internal
.is_overmem flag
This commit refactors removes the first method and move the hi/lo water
checks to the isc_mem_isovermem() function, thus we now have only a
single method of checking overmem condition and the check for hi/lo
water is removed from the hot path for memory contexts that doesn't use
overmem checks.
when transferring in a non-inline-signing secondary for the first time,
we previously never set the value of zone->loadtime, so it remained
zero. this caused a test failure in the statschannel system test,
and that test case was temporarily disabled. the value is now set
correctly and the test case has been reinstated.
The maximum DNS message size is 65535 octets. Check that the buffer
being passed to dns_message_renderbegin does not exceed this as the
compression code assumes that all offsets are no bigger than this.
Please see the 998765fea5 commit for
the description of the original issue. The commit had fixed the
logic error, but it was reintroduced again later with the
a1afa31a5a commit, where the check of
the 'db_registered' flag was removed in dns__catz_update_cb(). The
check was removed, because the registration function was made
idempotent, so double registration is not an issue, but the check
also prevented from unneeded registration, on which the original
fix relied.
This commit just removes the update callback registration code from
the dns__catz_update_cb() function instead of bringing back the check,
because after code flow analysis, it is now clear that it's not required
at all. The "call onupdate() artificially" comment (which was mentioned
by the removed code) is speaking about the dns_catz_dbupdate_callback()
function, which is called by server.c on (re)configuration, and that
function already takes care of update callback's registration since the
998765fea5 commit was applied, so there
is no need to do that here again.
The 'dns_tsigkeyring_t' structure has a read/write lock to protect
its 'keys' member, which is a 'isc_hashmap_t' pointer and needs to
be protected.
The dns_tsigkeyring_dump() function, however, doesn't use the lock,
which can introduce a race with another thread, if the other thread
tries to modify the hashmap.
Add a read lock around the code, which iterates over the hashmap.
EXPERIMENT: when DNS_DB_GLUEOK is set, dns_view_find() will now return
glue if it is found it a local zone database, without checking to see
if a better answer has been cached previously.
If a child zone is served by the same servers as a parent zone and
a NS query is made for the zone name then the addresses of the
nameservers are returned in the additional section are tagged as
trust additional.
Commit 4db150437e incorrectly removed the
call to isc_mem_setwater() from dns_cache_setcachesize(). The water()
function is a no-op, but we still need to set high- and low-water marks
in the memory context, otherwise overmem conditions will not be
detected.
Increasing the initial and freemax sizes for dns_message memory pools
restores the root zone performance. The former sizes were suited for
per-dns_message memory pools and we need to bump the sizes up for
per-thread memory pools.
The `.reader` member of dns_qpmulti_t was accessed without RCU
protection; reader_open() calls rcu_dereference() on it, and this
call needs to be inside an RCU critical section.
A similar problem was identified in the dns_qpmulti_snapshot() - the
RCU critical section was completely missing.
These are relicts of the isc_qsbr - in the QSBR mode the rcu_read_lock()
and rcu_read_unlock() are no-ops and whole event loop is a critical section.
Do a light refactoring and cleanups that replaces common list walking
patterns with ISC_LIST_FOREACH macros and split some nested loops into
separate static functions to reduce the nesting depth.
There was a lot of internal code looking like this:
INSIST(dns_rdataset_isassociated(rdataset));
dns_rdataset_disassociated(rdataset)
isc_mempool_put(msg->rdspool, rdataset);
Deduplicate the code into local dns__message_puttemprdataset() routine,
and drop the INSIST() which is checked in dns_rdataset_disassociate().
The current dispatch code could reuse the TCP connection when
dns_dispatch_gettcp() would be used first. This is problematic as the
dns_resolver doesn't use TCP connection sharing, but dns_request could
get the TCP stream that was created outside of the dns_request.
Add new DNS_DISPATCHOPT_UNSHARED option to dns_dispatch_createtcp() that
would prevent the TCP stream to be reused. Use that option in the
dns_resolver call to dns_dispatch_createtcp() to prevent dns_request
from reusing the TCP connections created by dns_resolver.
Additionally, the dns_xfrin unit added TCP connection sharing for
incoming transfers. While interleaving *xfr streams on a TCP connection
should work this should be a deliberate change and be property of the
server that can be controlled. Additionally some level of parallel TCP
streams is desirable. Revert to the old behaviour by removing the
dns_dispatch_gettcp() calls from dns_xfrin and use the new option to
prevent from sharing the transfer streams with dns_request.
The xfrin_recv_done() was accessing xfr->result where we stored the
result of the offloaded work from a thread that could receive data while
processing the transfer on the offloaded thread.
Completely remove the offloaded result from the dns_xfrin_t structure
and keep it local for *xfr_apply() and *xfr_apply_done() as the failure
is already recorded in .shutdown_result and we now that the processing
has failed because .shuttingdown has been already set.
The dns__catz_update_cb() does not expect that 'catzs->zones'
can become NULL during shutdown.
Add similar checks in the dns__catz_update_cb() and dns_catz_zone_get()
functions to protect from such a case. Also add an INSIST in the
dns_catz_zone_add() function to explicitly state that such a case
is not expected there, because that function is called only during a
reconfiguration.
To reduce the amount of log spam when root servers change their
addresses keep a table of upcoming changes by expected date and time
and suppress reporting differences for them until then.
Add initial entry for B.ROOT-SERVERS.NET, Nov 27, 2023.
Instead of processing received data synchronously, store the incoming
differences in the list and process them asynchronously when we need to
commit the data into the database and/or journal.
Instead of locking the struct dns_xfrin members that get accessed from
the statistics, convert those into atomic types and use atomic accesses
to prevent ThreadSanitizer from blowing up.
In fact, even the atomic operations are not really needed here, because
all writes are done from a single thread and we don't really require
consistency from the statistics. It's easier to use atomics here, but
it is slightly confusing as it suggests there might be multithreaded
accesses to those variables while in fact, the only off-thread access
happens when collecting the statistics.
The ixfr_putdata() and axfr_putdata() had a logic to apply dns_diff when
the number of pending tuples went over 100. Since we are going to
offload the XFR data processing, we don't need to do that anymore.
Drop timeout before resending a UDP request from 15 seconds to 5
seconds and add 1 second to the total time to allow for the reply
to the third request to arrive. This will speed up the time it
takes for named to recover from a lost packet when refreshing a
zone and for it to determine that a primary is down.
Update the function 'set_resigntime()' so that raw versions of
inline-signing zones are not scheduled to be resigned.
Also update the check in the same function for zone is dynamic, there
exists a function 'dns_zone_isdynamic()' that does a similar thing
and is more complete.
Also in 'zone_postload()' check whether the zone is not the raw
version of an inline-signing zone, preventing calculating the next
resign time.
The dns_aclenv_t contains two dns_acl_t - localhost and localnets that
can be swapped with a different ACLs as we configure BIND 9. Instead of
protecting those two pointers with heavyweight read-write lock, use RCU
mechanism to dereference and swap the pointers.
In units that support detailed reference tracing via ISC_REFCOUNT
macros, we were doing:
/* Define to 1 for detailed reference tracing */
#undef <unit>_TRACE
This would prevent using -D<unit>_TRACE=1 in the CFLAGS.
Convert the above mentioned snippet with just a comment how to enable
the detailed reference tracing:
/* Add -D<unit>_TRACE=1 to CFLAGS for detailed reference tracing */
It was possible to reach add_link() without visiting an
intermediate node first, and the check for a duplicate entry
could then cause a crash.
Credit to OSS-Fuzz for discovering this error.
There was a microoptimization for smoothing srtt with bitshifts. Revert
the code to use * 98 / 100, it doesn't really make that difference on
modern CPUs, for comparison here:
muldiv:
imul eax, edi, 98
imul rax, rax, 1374389535
shr rax, 37
ret
shift:
mov eax, edi
sal eax, 9
sub eax, edi
shr eax, 9
ret