If we are in the process of looking for the A records as part of
dns64 processing and the server-stale timeout triggers, redo the
dns64 changes that had been made to the orignal qctx.
(cherry picked from commit 1fcc483df1)
The wrong result value was being saved for resumption with
nxdomain-redirect when performing the fetch. This lead to an assert
when checking that RFC 1918 reverse queries where not leaking to
the global internet.
(cherry picked from commit 9d0fa07c5e)
When serve-stale is enabled and recursive resolution fails, the fallback
to lookup stale data always happens in the cache database. Any
authoritative data is ignored, and only information learned through
recursive resolution is examined.
If there is data in the cache that could lead to an answer, and this can
be just the root delegation, the resolver will iterate further, getting
closer to the answer that can be found by recursing down the root, and
eventually puts the final response in the cache.
Change the fallback to serve-stale to use 'query_getdb()', that finds
out the best matching database for the given query.
(cherry picked from commit 2322425016)
when synthesizing a new CNAME, we now check whether the target
matches the query already being processed. if so, we do not
restart the query; this prevents a waste of resources.
(cherry picked from commit 0ae8b2e056)
The 'refresh_rrset' variable is used to determine if we can detach from
the client. This can cause a hang on shutdown. To fix this, move setting
of the 'nodetach' variable up to where 'refresh_rrset' is set (in
query_lookup(), and thus not in ns_query_done()), and set it to false
when actually refreshing the RRset, so that when this lookup is
completed, the client will be detached.
When a query was aborted because of the recursion quota being exceeded,
but triggered a stale answer response and a stale data refresh query,
it could cause named to loop back where we are iterating and following
a delegation. Having no good answer in cache, we would fall back to
using serve-stale again, use the stale data, try to refresh the RRset,
and loop back again, without ever terminating until crashing due to
stack overflow.
This happens because in the functions 'query_notfound()' and
'query_delegation_recurse()', we check whether we can fall back to
serving stale data. We shouldn't do so if we are already refreshing
an RRset due to having prioritized stale data in cache.
In other words, we need to add an extra check to 'query_usestale()' to
disallow serving stale data if we are currently refreshing a stale
RRset.
As an additional mitigation to prevent looping, we now use the result
code ISC_R_ALREADYRUNNING rather than ISC_R_FAILURE when a recursion
loop is encountered, and we check for that condition in
'query_usestale()' as well.
We recently fixed a bug where in some cases (when following an
expired CNAME for example), named could return SERVFAIL if the target
record is still valid (see isc-projects/bind9#3678, and
isc-projects/bind9!7096). We fixed this by considering non-stale
RRsets as well during the stale lookup.
However, this triggered a new bug because despite the answer from
cache not being stale, the lookup may be triggered by serve-stale.
If the answer from database is not stale, the fix in
isc-projects/bind9!7096 erroneously skips the serve-stale logic.
Add 'answer_found' checks to the serve-stale logic to fix this issue.
(cherry picked from commit bbd163acf6)
Although 'dns_fetch_t' fetch can have two associated events, one for
each of 'DNS_EVENT_FETCHDONE' and 'DNS_EVENT_TRYSTALE' types, the
dns_resolver_cancelfetch() function is designed in a way that it
expects only one existing event, which it must cancel, and when it
happens so that 'stale-answer-client-timeout' is enabled and there
are two events, only one of them is canceled, and it results in an
assertion in dns_resolver_destroyfetch(), when it finds a dangling
event.
Change the logic of dns_resolver_cancelfetch() function so that it
cancels both the events (if they exist), and in the right order.
(cherry picked from commit ec2098ca35)
dns_db_findext() asserts if RRSIG is passed to it and
query_lookup_stale() failed to map RRSIG to ANY to prevent this. To
avoid cases like this in the future, move the mapping of SIG and RRSIG
to ANY for qctx->type to qctx_init().
(cherry picked from commit 56eae06418)
check allow-update, update-policy, and allow-update-forwarding before
consuming quota slots, so that unauthorized clients can't fill the
quota.
(this moves the access check before the prerequisite check, which
violates the precise wording of RFC 2136. however, RFC co-author Paul
Vixie has stated that the RFC is mistaken on this point; it should have
said that access checking must happen *no later than* the completion of
prerequisite checks, not that it must happen exactly then.)
(cherry picked from commit 964f559edb)
limit the number of simultaneous DNS UPDATE events that can be
processed by adding a quota for update and update forwarding.
this quota currently, arbitrarily, defaults to 100.
also add a statistics counter to record when the update quota
has been exceeded.
(cherry picked from commit 7c47254a14)
Commit 9ffb4a7ba1 causes Clang Static
Analyzer to flag a potential NULL dereference in query_nxdomain():
query.c:9394:26: warning: Dereference of null pointer [core.NullDereference]
if (!qctx->nxrewrite || qctx->rpz_st->m.rpz->addsoa) {
^~~~~~~~~~~~~~~~~~~
1 warning generated.
The warning above is for qctx->rpz_st potentially being a NULL pointer
when query_nxdomain() is called from query_resume(). This is a false
positive because none of the database lookup result codes currently
causing query_nxdomain() to be called (DNS_R_EMPTYWILD, DNS_R_NXDOMAIN)
can be returned by a database lookup following a recursive resolution
attempt. Add a NULL check nevertheless in order to future-proof the
code and silence Clang Static Analyzer.
(cherry picked from commit 07592d1315)
(cherry picked from commit a4547a1093)
With 'stale-answer-enable yes;' and 'stale-answer-client-timeout off;',
consider the following situation:
A CNAME record and its target record are in the cache, then the CNAME
record expires, but the target record is still valid.
When a new query for the CNAME record arrives, and the query fails,
the stale record is used, and then the query "restarts" to follow
the CNAME target. The problem is that the query's multiple stale
options (like DNS_DBFIND_STALEOK) are not reset, so 'query_lookup()'
treats the restarted query as a lookup following a failed lookup,
and returns a SERVFAIL answer when there is no stale data found in the
cache, even if there is valid non-stale data there available.
With this change, query_lookup() now considers non-stale data in the
cache in the first place, and returns it if it is available.
(cherry picked from commit 91a1a8efc5)
Add an options parameter to control what rdatasets are returned when
iteratating over the node. Specific modes will be added later.
(cherry picked from commit 7695c36a5d)
Send the ns_query_cancel() on the recursing clients when we initiate the
named shutdown for faster shutdown.
When we are shutting down the resolver, we cancel all the outstanding
fetches, and the ISC_R_CANCEL events doesn't propagate to the ns_client
callback.
In the future, the better solution how to fix this would be to look at
the shutdown paths and let them all propagate from bottom (loopmgr) to
top (f.e. ns_client).
(cherry picked from commit d861d403bb9a7912e29a06aba6caf6d502839f1b)
Commit 9ffb4a7ba1 causes Clang Static
Analyzer to flag a potential NULL dereference in query_nxdomain():
query.c:9394:26: warning: Dereference of null pointer [core.NullDereference]
if (!qctx->nxrewrite || qctx->rpz_st->m.rpz->addsoa) {
^~~~~~~~~~~~~~~~~~~
1 warning generated.
The warning above is for qctx->rpz_st potentially being a NULL pointer
when query_nxdomain() is called from query_resume(). This is a false
positive because none of the database lookup result codes currently
causing query_nxdomain() to be called (DNS_R_EMPTYWILD, DNS_R_NXDOMAIN)
can be returned by a database lookup following a recursive resolution
attempt. Add a NULL check nevertheless in order to future-proof the
code and silence Clang Static Analyzer.
(cherry picked from commit 07592d1315)
With 'stale-answer-enable yes;' and 'stale-answer-client-timeout off;',
consider the following situation:
A CNAME record and its target record are in the cache, then the CNAME
record expires, but the target record is still valid.
When a new query for the CNAME record arrives, and the query fails,
the stale record is used, and then the query "restarts" to follow
the CNAME target. The problem is that the query's multiple stale
options (like DNS_DBFIND_STALEOK) are not reset, so 'query_lookup()'
treats the restarted query as a lookup following a failed lookup,
and returns a SERVFAIL answer when there is no stale data found in the
cache, even if there is valid non-stale data there available.
With this change, query_lookup() now considers non-stale data in the
cache in the first place, and returns it if it is available.
(cherry picked from commit 86a80e723f)
RPZ rewrites called dns_db_findext() without passing through the
client database options; as as result, if the client set CD=1,
DNS_DBFIND_PENDINGOK was not used as it should have been, and
cache lookups failed, resulting in failure of the rewrite.
(cherry picked from commit 305a50dbe1)
The incrementing and decrementing of 'ns_statscounter_recursclients'
were not properly balanced: for example, it would be incremented for
a prefetch query but not decremented if the query failed.
This commit ensures that the recursion quota and the recursive clients
counter are always in sync with each other.
(cherry picked from commit 82991451b4)
Don't attempt to resolve DNS responses for intermediate results. This
may create multiple refreshes and can cause a crash.
One scenario is where for the query there is a CNAME and canonical
answer in cache that are both stale. This will trigger a refresh of
the RRsets because we encountered stale data and we prioritized it over
the lookup. It will trigger a refresh of both RRsets. When we start
recursing, it will detect a recursion loop because the recursion
parameters will eventually be the same. In 'dns_resolver_destroyfetch'
the sanity check fails, one of the callers did not get its event back
before trying to destroy the fetch.
Move the call to 'query_refresh_rrset' to 'ns_query_done', so that it
is only called once per client request.
Another scenario is where for the query there is a stale CNAME in the
cache that points to a record that is also in cache but not stale. This
will trigger a refresh of the RRset (because we encountered stale data
and we prioritized it over the lookup).
We mark RRsets that we add to the message with
DNS_RDATASETATTR_STALE_ADDED to prevent adding a duplicate RRset when
a stale lookup and a normal lookup conflict with each other. However,
the other non-stale RRset when following a CNAME chain will be added to
the message without setting that attribute, because it is not stale.
This is a variant of the bug in #2594. The fix covered the same crash
but for stale-answer-client-timeout > 0.
Fix this by clearing all RRsets from the message before refreshing.
This requires the refresh to happen after the query is send back to
the client.
(cherry picked from commit d939d2ecde)
It is possible to bypass Response Rate Limiting (RRL)
`responses-per-second` limitation using specially crafted wildcard
names, because the current implementation, when encountering a found
DNS name generated from a wildcard record, just strips the leftmost
label of the name before making a key for the bucket.
While that technique helps with limiting random requests like
<random>.example.com (because all those requests will be accounted
as belonging to a bucket constructed from "example.com" name), it does
not help with random names like subdomain.<random>.example.com.
The best solution would have been to strip not just the leftmost
label, but as many labels as necessary until reaching the suffix part
of the wildcard record from which the found name is generated, however,
we do not have that information readily available in the context of RRL
processing code.
Fix the issue by interpreting all valid wildcard domain names as
the zone's origin name concatenated to the "*" name, so they all will
be put into the same bucket.
(cherry picked from commit baa9698c9d)
previously, if ns_clientmgr_create() failed, the interface was not
cleaned up correctly and an assertion or segmentation fault could
follow. this has been fixed.
When checking if we should enable serve-stale, add an early out case
when the result is an error signalling a duplicate query or a query
that would be dropped.
(cherry picked from commit 059a4c2f4d9d3cff371842f43208d021509314fa)
The BUFSIZ value varies between platforms, it could be 8K on Linux and
512 bytes on mingw. Make sure the buffers are always big enough for the
output data to prevent truncation of the output by appropriately
enlarging or sizing the buffers.
(cherry picked from commit b19d932262e84608174cb89eeed32ae0212f8a87)
messages indicating the reason for a fallback to AXFR (i.e, because
the requested serial number is not present in the journal, or because
the size of the IXFR response would exceeed "max-ixfr-ratio") are now
logged at level info instead of debug(4).
(cherry picked from commit df1d81cf96)
The current logic for determining the address of the socket to which a
client sent its query is:
1. Get the address:port tuple from the netmgr handle using
isc_nmhandle_localaddr() or from the ns_interface_t structure.
2. Convert the address:port tuple from step 1 into an isc_netaddr_t
using isc_netaddr_fromsockaddr().
3. Convert the address from step 2 back into a socket address with the
port set to 0 using isc_sockaddr_fromnetaddr().
Note that the port number (readily available in the netmgr handle or in
the ns_interface_t structure) is needlessly lost in the process,
preventing it from being recorded in dnstap captures of client traffic
produced by named.
Fix by first storing the address:port tuple in client->destsockaddr and
then creating an isc_netaddr_t from that structure. This allows the
port number to be retained in client->destsockaddr, which is what
subsequently gets passed to dns_dt_send().
Remove an outdated code comment.
(cherry picked from commit 2f945703f2)
The ns_statscounter_recursclients counter was previously only
incremented or decremented if client->recursionquota was non-NULL.
This was harmless, because that value should always be non-NULL if
recursion is enabled, but it made the code slightly confusing.
(cherry picked from commit 0201eab655)
Ensure the update zone name is mentioned in the NOTAUTH error message
in the server log, so that it is easier to track down problematic
update clients. There are two cases: either the update zone is
unrelated to any of the server's zones (previously no zone was
mentioned); or the update zone is a subdomain of one or more of the
server's zones (previously the name of the irrelevant parent zone was
misleadingly logged).
Closes#3209
(cherry picked from commit 84c4eb02e7)
In couple places, we have missed INSIST(0) or ISC_UNREACHABLE()
replacement on some branches with UNREACHABLE(). Replace all
ISC_UNREACHABLE() or INSIST(0) calls with UNREACHABLE().
Historically, the inline keyword was a strong suggestion to the compiler
that it should inline the function marked inline. As compilers became
better at optimising, this functionality has receded, and using inline
as a suggestion to inline a function is obsolete. The compiler will
happily ignore it and inline something else entirely if it finds that's
a better optimisation.
Therefore, remove all the occurences of the inline keyword with static
functions inside single compilation unit and leave the decision whether
to inline a function or not entirely on the compiler
NOTE: We keep the usage the inline keyword when the purpose is to change
the linkage behaviour.
(cherry picked from commit 20f0936cf2)
Previously, the unreachable code paths would have to be tagged with:
INSIST(0);
ISC_UNREACHABLE();
There was also older parts of the code that used comment annotation:
/* NOTREACHED */
Unify the handling of unreachable code paths to just use:
UNREACHABLE();
The UNREACHABLE() macro now asserts when reached and also uses
__builtin_unreachable(); when such builtin is available in the compiler.
(cherry picked from commit 584f0d7a7e)
Gcc 7+ and Clang 10+ have implemented __attribute__((fallthrough)) which
is explicit version of the /* FALLTHROUGH */ comment we are currently
using.
Add and apply FALLTHROUGH macro that uses the attribute if available,
but does nothing on older compilers.
In one case (lib/dns/zone.c), using the macro revealed that we were
using the /* FALLTHROUGH */ comment in wrong place, remove that comment.
(cherry picked from commit fe7ce629f4)
Upcoming LLVM/Clang 15 has marked the ATOMIC_VAR_INIT() as deprecated
breaking the build. In the previous commit, we have removed the use of
ATOMIC_VAR_INIT(), but as that was a prerequisite to using the
--enable-mutexatomic debugging mode, we have to remove the debugging
mode.
The C17 standard deprecated ATOMIC_VAR_INIT() macro (see [1]). Follow
the suite and remove the ATOMIC_VAR_INIT() usage in favor of simple
assignment of the value as this is what all supported stdatomic.h
implementations do anyway:
* MacOSX.plaform: #define ATOMIC_VAR_INIT(__v) {__v}
* Gcc stdatomic.h: #define ATOMIC_VAR_INIT(VALUE) (VALUE)
1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1138r0.pdf
(cherry picked from commit f251d69eba)
When max-transfer-*-out timeouts were reintroduced, the log message
about starting the timer was errorneously left as ISC_LOG_ERROR.
Change the log level of said message to ISC_LOG_DEBUG(1).
(cherry picked from commit 8f6e4dfa15)