Commit graph

203 commits

Author SHA1 Message Date
Michał Kępień
ba1306bfb4 Check for NULL before dereferencing qctx->rpz_st
Commit 9ffb4a7ba1 causes Clang Static
Analyzer to flag a potential NULL dereference in query_nxdomain():

    query.c:9394:26: warning: Dereference of null pointer [core.NullDereference]
            if (!qctx->nxrewrite || qctx->rpz_st->m.rpz->addsoa) {
                                    ^~~~~~~~~~~~~~~~~~~
    1 warning generated.

The warning above is for qctx->rpz_st potentially being a NULL pointer
when query_nxdomain() is called from query_resume().  This is a false
positive because none of the database lookup result codes currently
causing query_nxdomain() to be called (DNS_R_EMPTYWILD, DNS_R_NXDOMAIN)
can be returned by a database lookup following a recursive resolution
attempt.  Add a NULL check nevertheless in order to future-proof the
code and silence Clang Static Analyzer.

(cherry picked from commit 07592d1315)
(cherry picked from commit a4547a1093)
2023-01-09 13:57:44 +01:00
Matthijs Mekking
2696267b1f Consider non-stale data when in serve-stale mode
With 'stale-answer-enable yes;' and 'stale-answer-client-timeout off;',
consider the following situation:

A CNAME record and its target record are in the cache, then the CNAME
record expires, but the target record is still valid.

When a new query for the CNAME record arrives, and the query fails,
the stale record is used, and then the query "restarts" to follow
the CNAME target. The problem is that the query's multiple stale
options (like DNS_DBFIND_STALEOK) are not reset, so 'query_lookup()'
treats the restarted query as a lookup following a failed lookup,
and returns a SERVFAIL answer when there is no stale data found in the
cache, even if there is valid non-stale data there available.

With this change, query_lookup() now considers non-stale data in the
cache in the first place, and returns it if it is available.

(cherry picked from commit 91a1a8efc5)
2023-01-09 13:57:43 +01:00
Tom Krizek
da42fa7622
Revert "Merge branch '3678-serve-stale-servfailing-unexpectedly-v9_16' into 'v9_16'"
This reverts commit b2a4447af8, reversing
changes made to 8924f92956.

It also removes release note 6038, since the fix is reverted.
2022-12-08 10:23:40 +01:00
Mark Andrews
4f3327cd41 Extend dns_db_allrdatasets to control interation results
Add an options parameter to control what rdatasets are returned when
iteratating over the node.  Specific modes will be added later.

(cherry picked from commit 7695c36a5d)
2022-12-08 11:20:35 +11:00
Michał Kępień
148608c7b2 Check for NULL before dereferencing qctx->rpz_st
Commit 9ffb4a7ba1 causes Clang Static
Analyzer to flag a potential NULL dereference in query_nxdomain():

    query.c:9394:26: warning: Dereference of null pointer [core.NullDereference]
            if (!qctx->nxrewrite || qctx->rpz_st->m.rpz->addsoa) {
                                    ^~~~~~~~~~~~~~~~~~~
    1 warning generated.

The warning above is for qctx->rpz_st potentially being a NULL pointer
when query_nxdomain() is called from query_resume().  This is a false
positive because none of the database lookup result codes currently
causing query_nxdomain() to be called (DNS_R_EMPTYWILD, DNS_R_NXDOMAIN)
can be returned by a database lookup following a recursive resolution
attempt.  Add a NULL check nevertheless in order to future-proof the
code and silence Clang Static Analyzer.

(cherry picked from commit 07592d1315)
2022-12-06 13:51:30 +00:00
Matthijs Mekking
e6e13c3e62 Consider non-stale data when in serve-stale mode
With 'stale-answer-enable yes;' and 'stale-answer-client-timeout off;',
consider the following situation:

A CNAME record and its target record are in the cache, then the CNAME
record expires, but the target record is still valid.

When a new query for the CNAME record arrives, and the query fails,
the stale record is used, and then the query "restarts" to follow
the CNAME target. The problem is that the query's multiple stale
options (like DNS_DBFIND_STALEOK) are not reset, so 'query_lookup()'
treats the restarted query as a lookup following a failed lookup,
and returns a SERVFAIL answer when there is no stale data found in the
cache, even if there is valid non-stale data there available.

With this change, query_lookup() now considers non-stale data in the
cache in the first place, and returns it if it is available.

(cherry picked from commit 86a80e723f)
2022-12-06 13:51:07 +00:00
Michal Nowak
771fed4a14
Update sources to Clang 15 formatting 2022-11-29 10:30:34 +01:00
Evan Hunt
8e4a1f3483 ensure RPZ lookups handle CD=1 correctly
RPZ rewrites called dns_db_findext() without passing through the
client database options; as as result, if the client set CD=1,
DNS_DBFIND_PENDINGOK was not used as it should have been, and
cache lookups failed, resulting in failure of the rewrite.

(cherry picked from commit 305a50dbe1)
2022-10-19 13:16:51 -07:00
Aram Sargsyan
b6aeccf697 Fix ns_statscounter_recursclients counting bug
The incrementing and decrementing of 'ns_statscounter_recursclients'
were not properly balanced: for example, it would be incremented for
a prefetch query but not decremented if the query failed.

This commit ensures that the recursion quota and the recursive clients
counter are always in sync with each other.

(cherry picked from commit 82991451b4)
2022-10-18 10:38:04 +00:00
Matthijs Mekking
3f68e2ad83 Only refresh RRset once
Don't attempt to resolve DNS responses for intermediate results. This
may create multiple refreshes and can cause a crash.

One scenario is where for the query there is a CNAME and canonical
answer in cache that are both stale. This will trigger a refresh of
the RRsets because we encountered stale data and we prioritized it over
the lookup. It will trigger a refresh of both RRsets. When we start
recursing, it will detect a recursion loop because the recursion
parameters will eventually be the same. In 'dns_resolver_destroyfetch'
the sanity check fails, one of the callers did not get its event back
before trying to destroy the fetch.

Move the call to 'query_refresh_rrset' to 'ns_query_done', so that it
is only called once per client request.

Another scenario is where for the query there is a stale CNAME in the
cache that points to a record that is also in cache but not stale. This
will trigger a refresh of the RRset (because we encountered stale data
and we prioritized it over the lookup).

We mark RRsets that we add to the message with
DNS_RDATASETATTR_STALE_ADDED to prevent adding a duplicate RRset when
a stale lookup and a normal lookup conflict with each other. However,
the other non-stale RRset when following a CNAME chain will be added to
the message without setting that attribute, because it is not stale.

This is a variant of the bug in #2594. The fix covered the same crash
but for stale-answer-client-timeout > 0.

Fix this by clearing all RRsets from the message before refreshing.
This requires the refresh to happen after the query is send back to
the client.

(cherry picked from commit d939d2ecde)
2022-09-08 12:08:28 +02:00
Aram Sargsyan
3ad0f165ab Fix RRL responses-per-second bypass using wildcard names
It is possible to bypass Response Rate Limiting (RRL)
`responses-per-second` limitation using specially crafted wildcard
names, because the current implementation, when encountering a found
DNS name generated from a wildcard record, just strips the leftmost
label of the name before making a key for the bucket.

While that technique helps with limiting random requests like
<random>.example.com (because all those requests will be accounted
as belonging to a bucket constructed from "example.com" name), it does
not help with random names like subdomain.<random>.example.com.

The best solution would have been to strip not just the leftmost
label, but as many labels as necessary until reaching the suffix part
of the wildcard record from which the found name is generated, however,
we do not have that information readily available in the context of RRL
processing code.

Fix the issue by interpreting all valid wildcard domain names as
the zone's origin name concatenated to the "*" name, so they all will
be put into the same bucket.

(cherry picked from commit baa9698c9d)
2022-09-08 09:41:15 +02:00
Matthijs Mekking
dd7dde5743 Don't enable serve-stale on duplicate queries
When checking if we should enable serve-stale, add an early out case
when the result is an error signalling a duplicate query or a query
that would be dropped.

(cherry picked from commit 059a4c2f4d9d3cff371842f43208d021509314fa)
2022-08-09 09:37:49 +02:00
Evan Hunt
82c197d93b Cleanup: always count ns_statscounter_recursclients
The ns_statscounter_recursclients counter was previously only
incremented or decremented if client->recursionquota was non-NULL.
This was harmless, because that value should always be non-NULL if
recursion is enabled, but it made the code slightly confusing.

(cherry picked from commit 0201eab655)
2022-05-14 00:58:26 -07:00
Mark Andrews
8f23d56fba Check the cache as well when glue NS are returned processing RPZ
(cherry picked from commit 8fb72012e3)
2022-05-04 23:53:21 +10:00
Mark Andrews
8c2ede6edc Process learned records as well as glue
(cherry picked from commit 07c828531c)
2022-05-04 23:53:21 +10:00
Mark Andrews
13129872eb Process the delegating NS RRset when checking rpz rules
(cherry picked from commit cf97c61f48)
2022-05-04 23:53:21 +10:00
Ondřej Surý
79b7804ce8 Consistenly use UNREACHABLE() instead of ISC_UNREACHABLE()
In couple places, we have missed INSIST(0) or ISC_UNREACHABLE()
replacement on some branches with UNREACHABLE().  Replace all
ISC_UNREACHABLE() or INSIST(0) calls with UNREACHABLE().
2022-03-28 23:28:05 +02:00
Ondřej Surý
b624be2544 Remove use of the inline keyword used as suggestion to compiler
Historically, the inline keyword was a strong suggestion to the compiler
that it should inline the function marked inline.  As compilers became
better at optimising, this functionality has receded, and using inline
as a suggestion to inline a function is obsolete.  The compiler will
happily ignore it and inline something else entirely if it finds that's
a better optimisation.

Therefore, remove all the occurences of the inline keyword with static
functions inside single compilation unit and leave the decision whether
to inline a function or not entirely on the compiler

NOTE: We keep the usage the inline keyword when the purpose is to change
the linkage behaviour.

(cherry picked from commit 20f0936cf2)
2022-03-25 09:37:18 +01:00
Ondřej Surý
75f9dd8e82 Simplify way we tag unreachable code with only ISC_UNREACHABLE()
Previously, the unreachable code paths would have to be tagged with:

    INSIST(0);
    ISC_UNREACHABLE();

There was also older parts of the code that used comment annotation:

    /* NOTREACHED */

Unify the handling of unreachable code paths to just use:

    UNREACHABLE();

The UNREACHABLE() macro now asserts when reached and also uses
__builtin_unreachable(); when such builtin is available in the compiler.

(cherry picked from commit 584f0d7a7e)
2022-03-25 09:33:51 +01:00
Ondřej Surý
673e53f81d Add FALLTHROUGH macro for __attribute__((fallthrough))
Gcc 7+ and Clang 10+ have implemented __attribute__((fallthrough)) which
is explicit version of the /* FALLTHROUGH */ comment we are currently
using.

Add and apply FALLTHROUGH macro that uses the attribute if available,
but does nothing on older compilers.

In one case (lib/dns/zone.c), using the macro revealed that we were
using the /* FALLTHROUGH */ comment in wrong place, remove that comment.

(cherry picked from commit fe7ce629f4)
2022-03-25 09:30:16 +01:00
Ondřej Surý
2c86bd4ed9 Remove debugging implementation of stdatomic using mutexes
Upcoming LLVM/Clang 15 has marked the ATOMIC_VAR_INIT() as deprecated
breaking the build.  In the previous commit, we have removed the use of
ATOMIC_VAR_INIT(), but as that was a prerequisite to using the
--enable-mutexatomic debugging mode, we have to remove the debugging
mode.
2022-03-17 21:44:04 +01:00
Michał Kępień
60e82835ec Fix more ns_statscounter_recursclients underflows
Commit aab691d512 did not fix all possible
scenarios in which the ns_statscounter_recursclients counter underflows.
The solution implemented therein can be ineffective e.g. when CNAME
chaining happens with prefetching enabled.

Here is an example recursive resolution scenario in which the
ns_statscounter_recursclients counter can underflow with the current
logic in effect:

 1. Query processing starts, the answer is not found in the cache, so
    recursion is started.  The NS_CLIENTATTR_RECURSING attribute is set.
    ns_statscounter_recursclients is incremented (Δ = +1).

 2. Recursion completes, returning a CNAME.  client->recursionquota is
    non-NULL, so the NS_CLIENTATTR_RECURSING attribute remains set.
    ns_statscounter_recursclients is decremented (Δ = 0).

 3. Query processing restarts.

 4. The current QNAME (the target of the CNAME from step 2) is found in
    the cache, with a TTL low enough to trigger a prefetch.

 5. query_prefetch() attaches to client->recursionquota.
    ns_statscounter_recursclients is not incremented because
    query_prefetch() does not do that (Δ = 0).

 6. Query processing restarts.

 7. The current QNAME (the target of the CNAME from step 4) is not found
    in the cache, so recursion is started.  client->recursionquota is
    already attached to (since step 5) and the NS_CLIENTATTR_RECURSING
    attribute is set (since step 1), so ns_statscounter_recursclients is
    not incremented (Δ = 0).

 8. The prefetch from step 5 completes.  client->recursionquota is
    detached from in prefetch_done().  ns_statscounter_recursclients is
    not decremented because prefetch_done() does not do that (Δ = 0).

 9. Recursion for the current QNAME completes.  client->recursionquota
    is already detached from, i.e. set to NULL (since step 8), and the
    NS_CLIENTATTR_RECURSING attribute is set (since step 1), so
    ns_statscounter_recursclients is decremented (Δ = -1).

Another possible scenario is that after step 7, recursion for the target
of the CNAME from step 4 completes before the prefetch for the CNAME
itself.  fetch_callback() then notices that client->recursionquota is
non-NULL and decrements ns_statscounter_recursclients, even though
client->recursionquota was attached to by query_prefetch() and therefore
not accompanied by an incrementation of ns_statscounter_recursclients.
The net result is also an underflow.

Instead of trying to properly handle all possible orderings of events
set into motion by normal recursion and prefetch-triggered recursion,
adjust ns_statscounter_recursclients whenever the recursive clients
quota is successfully attached to or detached from.  Remove the
NS_CLIENTATTR_RECURSING attribute altogether as its only purpose is made
obsolete by this change.

(cherry picked from commit f7482b68b9)
2022-02-23 14:45:06 +01:00
Evan Hunt
558b060de5 allow dns_clientinfo to store client ECS data
this brings DNS_CLIENTINFO_VERSION into line with the subscription
branch so that fixes applied to clientinfo processing can also be
applied to the main branch without diverging.

(cherry picked from commit 737e658602)
2022-01-27 16:08:57 -08:00
Ondřej Surý
2bf7921c7e Update the copyright information in all files in the repository
This commit converts the license handling to adhere to the REUSE
specification.  It specifically:

1. Adds used licnses to LICENSES/ directory

2. Add "isc" template for adding the copyright boilerplate

3. Changes all source files to include copyright and SPDX license
   header, this includes all the C sources, documentation, zone files,
   configuration files.  There are notes in the doc/dev/copyrights file
   on how to add correct headers to the new files.

4. Handle the rest that can't be modified via .reuse/dep5 file.  The
   binary (or otherwise unmodifiable) files could have license places
   next to them in <foo>.license file, but this would lead to cluttered
   repository and most of the files handled in the .reuse/dep5 file are
   system test files.

(cherry picked from commit 58bd26b6cf)
2022-01-11 12:22:09 +01:00
Evan Hunt
ddc677ae64 rename dns_zone_master and dns_zone_slave
dns_zone_master and dns_zone_slave are renamed as dns_zone_primary
and dns_zone_secondary.

(cherry picked from commit 916760ae46)
2021-08-30 11:58:29 -07:00
Mark Andrews
24e5e3ffd6 Make whether to follow additional data records generic
Adds dns_rdatatype_followadditional() and
DNS_RDATATYPEATTR_FOLLOWADDITIONAL

(cherry picked from commit f0265b8fa6)
2021-08-18 14:59:20 +10:00
Michał Kępień
f81c8e3e73 Tweak query_addds() comments to avoid confusion
It has been noticed that commit f88c90f47f
did not only fix NSEC record handling in signed, insecure delegations
prepared using both wildcard expansion and CNAME chaining - it also
inadvertently fixed DS record handling in signed, secure delegations
of that flavor.  This is because the 'rdataset' variable in the relevant
location in query_addds() can be either a DS RRset or an NSEC RRset.
Update a code comment in query_addds() to avoid confusion.

Update the comments describing the purpose of query_addds() so that they
also mention NSEC(3) records.

(cherry picked from commit 29d8d35869)
2021-07-16 07:24:34 +02:00
Michał Kępień
f88c90f47f Fix "no DS" proofs for wildcard+CNAME delegations
When answering a query requires wildcard expansion, the AUTHORITY
section of the response needs to include NSEC(3) record(s) proving that
the QNAME does not exist.

When a response to a query is an insecure delegation, the AUTHORITY
section needs to include an NSEC(3) proof that no DS record exists at
the parent side of the zone cut.

These two conditions combined trip up the NSEC part of the logic
contained in query_addds(), which expects the NS RRset to be owned by
the first name found in the AUTHORITY section of a delegation response.
This may not always be true, for example if wildcard expansion causes an
NSEC record proving QNAME nonexistence to be added to the AUTHORITY
section before the delegation is added to the response.  In such a case,
named incorrectly omits the NSEC record proving nonexistence of QNAME
from the AUTHORITY section.

The same block of code is affected by another flaw: if the same NSEC
record proves nonexistence of both the QNAME and the DS record at the
parent side of the zone cut, this NSEC record will be added to the
AUTHORITY section twice.

Fix by looking for the NS RRset in the entire AUTHORITY section and
adding the NSEC record to the delegation using query_addrrset() (which
handles duplicate RRset detection).

(cherry picked from commit 7a87bf468b)
2021-06-10 10:26:51 +02:00
Kevin Chen
3924f78748 Several serve-stale improvements
Commit a83c8cb0af updated masterdump so
that stale records in "rndc dumpdb" output no longer shows 0 TTLs.  In
this commit we change the name of the `rdataset->stale_ttl` field to
`rdataset->expired` to make its purpose clearer, and set it to zero in
cases where it's unused.

Add 'rbtdb->serve_stale_ttl' to various checks so that stale records
are not purged from the cache when they've been stale for RBTDB_VIRTUAL
(300) seconds.

Increment 'ns_statscounter_usedstale' when a stale answer is used.

Note: There was a question of whether 'overmem_purge' should be
purging ancient records, instead of stale ones.  It is left as purging
stale records, since stale records could take up the majority of the
cache.

This submission is copyrighted Akamai Technologies, Inc. and provided
under an MPL 2.0 license.

This commit was originally authored by Kevin Chen, and was updated by
Matthijs Mekking to match recent serve-stale developments.

(cherry picked from commit 0cdf85d204)
2021-05-30 12:30:36 -07:00
Matthijs Mekking
a2ec3a1e4c Reset DNS_FETCHOPT_TRYSTALE_ONTIMEOUT on resume
Once we resume a query, we should clear DNS_FETCHOPT_TRYSTALE_ONTIMEOUT
from the options to prevent triggering the stale-answer-client-timeout
on subsequent fetches.

If we don't this may cause a crash when for example when prefetch is
triggered after a query restart.

(cherry picked from commit c0dc5937c7)
2021-05-30 00:33:42 -07:00
Evan Hunt
85ca29e5e2 clean up query correctly if already answered by serve-stale
when a serve-stale answer has been sent, the client continues waiting
for a proper answer. if a final completion event for the client does
arrive, it can just be cleaned up without sending a response, similar
to a canceled fetch.

(cherry picked from commit 8bd8e995f1)
2021-05-27 12:09:43 -07:00
Ondřej Surý
ac25fb9439 Use dns_name_copynf() with dns_message_gettempname() when needed
dns_message_gettempname() returns an initialized name with a dedicated
buffer, associated with a dns_fixedname object.  Using dns_name_copynf()
to write a name into this object will actually copy the name data
from a source name. dns_name_clone() merely points target->ndata to
source->ndata, so it is faster, but it can lead to a use-after-free if
the source is freed before the target object is released via
dns_message_puttempname().

In a few places, clone was being used where copynf should have been;
this is now fixed.

As a side note, no memory was lost, because the ndata buffer used in
the dns_fixedname_t is internal to the structure, and is freed when
the dns_fixedname_t is freed regardless of the .ndata contents.

(cherry picked from commit ce3e1abc1d)
2021-05-22 07:17:01 +02:00
Evan Hunt
dccdb492ef use a fixedname buffer in dns_message_gettempname()
dns_message_gettempname() now returns a pointer to an initialized
name associated with a dns_fixedname_t object. it is no longer
necessary to allocate a buffer for temporary names associated with
the message object.

(cherry picked from commit e31cc1eeb4)
2021-05-22 07:13:57 +02:00
Matthijs Mekking
72deed194d Use isdigit instead of checking character range
When looking for key files, we could use isdigit rather than checking
if the character is within the range [0-9].

Use (unsigned char) cast to ensure the value is representable in the
unsigned char type (as suggested by the isdigit manpage).

Change " & 0xff" occurrences to the recommended (unsigned char) type
cast.

(cherry picked from commit 1998ad6c776a9c17c27788b17765dee90d9e25df)
2021-05-05 18:23:53 +02:00
Mark Andrews
ea443fa9ba Handle DNAME lookup via itself
When answering a query, named should never attempt to add the same RRset
to the ANSWER section more than once.  However, such a situation may
arise when chasing DNAME records: one of the DNAME records placed in the
ANSWER section may turn out to be the final answer to a client query,
but there is no way to know that in advance.  Tweak the relevant INSIST
assertion in query_respond() so that it handles this case properly.
qctx->rdataset is freed later anyway, so there is no need to clean it up
in query_respond().
2021-04-29 11:12:38 +02:00
Matthijs Mekking
4615cbb597 Serve-stale nit fixes
While working on the serve-stale backports, I noticed the following
oddities:

1. In the serve-stale system test, in one case we keep track of the
   time how long it took for dig to complete. In commit
   aaed7f9d8c, the code removed the
   exception to check for result == ISC_R_SUCCESS on stale found
   answers, and adjusted the test accordingly. This failed to update
   the time tracking accordingly. Move the t1/t2 time track variables
   back around the two dig commands to ensure the lookups resolved
   faster than the resolver-query-timeout.

2. We can remove the setting of NS_QUERYATTR_STALEOK and
   DNS_RDATASETATTR_STALE_ADDED on the "else if (stale_timeout)"
   code path, because they are added later when we know we have
   actually found a stale answer on a stale timeout lookup.

3. We should clear the NS_QUERYATTR_STALEOK flag from the client
   query attributes instead of DNS_RDATASETATTR_STALE_ADDED (that
   flag is set on the rdataset attributes).

4. In 'bin/named/config.c' we should set the configuration options
   in alpabetical order.

5. In the ARM, in the backports we have added "(stale)" between
   "cached" and "RRset" to make more clear a stale RRset may be
   returned in this scenario.

(cherry picked from commit 104b676235)
2021-04-28 13:53:52 +02:00
Matthijs Mekking
194a72b3f1 If RPZ config'd, bail stale-answer-client-timeout
When we are recursing, RPZ processing is not allowed. But when we are
performing a lookup due to "stale-answer-client-timeout", we are still
recursing. This effectively means that RPZ processing is disabled on
such a lookup.

In this case, bail the "stale-answer-client-timeout" lookup and wait
for recursion to complete, as we we can't perform the RPZ rewrite
rules reliably.

(cherry picked from commit 3d3a6415f7)
2021-04-02 13:29:27 +02:00
Matthijs Mekking
29bcd113ea Rename "staleonly"
The dboption DNS_DBFIND_STALEONLY caused confusion because it implies
we are looking for stale data **only** and ignore any active RRsets in
the cache. Rename it to DNS_DBFIND_STALETIMEOUT as it is more clear
the option is related to a lookup due to "stale-answer-client-timeout".

Rename other usages of "staleonly", instead use "lookup due to...".
Also rename related function and variable names.

(cherry picked from commit 839df94190)
2021-04-02 13:29:17 +02:00
Matthijs Mekking
34dd6521b1 Restore the RECURSIONOK attribute after staleonly
When doing a staleonly lookup we don't want to fallback to recursion.
After all, there are obviously problems with recursion, otherwise we
wouldn't do a staleonly lookup.

When resuming from recursion however, we should restore the
RECURSIONOK flag, allowing future required lookups for this client
to recurse.

(cherry picked from commit 3f81d79ffb)
2021-04-02 13:29:09 +02:00
Matthijs Mekking
114dc7888a Remove result exception on staleonly lookup
When implementing "stale-answer-client-timeout", we decided that
we should only return positive answers prematurely to clients. A
negative response is not useful, and in that case it is better to
wait for the recursion to complete.

To do so, we check the result and if it is not ISC_R_SUCCESS, we
decide that it is not good enough. However, there are more return
codes that could lead to a positive answer (e.g. CNAME chains).

This commit removes the exception and now uses the same logic that
other stale lookups use to determine if we found a useful stale
answer (stale_found == true).

This means we can simplify two test cases in the serve-stale system
test: nodata.example is no longer treated differently than data.example.

(cherry picked from commit aaed7f9d8c)
2021-04-02 13:28:59 +02:00
Matthijs Mekking
06823aa255 Remove INSIST on NS_QUERYATTR_ANSWERED
The NS_QUERYATTR_ANSWERED attribute is to prevent sending a response
twice. Without the attribute, this may happen if a staleonly lookup
found a useful answer and sends a response to the client, and later
recursion ends and also tries to send a response.

The attribute was also used to mask adding a duplicate RRset. This is
considered harmful. When we created a response to the client with a
stale only lookup (regardless if we actually have send the response),
we should clear the rdatasets that were added during that lookup.

Mark such rdatasets with the a new attribute,
DNS_RDATASETATTR_STALE_ADDED. Set a query attribute
NS_QUERYATTR_STALEOK if we may have added rdatasets during a stale
only lookup. Before creating a response on a normal lookup, check if
we can expect rdatasets to have been added during a staleonly lookup.
If so, clear the rdatasets from the message with the attribute
DNS_RDATASETATTR_STALE_ADDED set.

(cherry picked from commit 3d5429f61f)
2021-04-02 13:28:08 +02:00
Matthijs Mekking
33d61b9651 Simplify when to detach the client
With stale-answer-client-timeout, we may send a response to the client,
but we may want to hold on to the network manager handle, because
recursion is going on in the background, or we need to refresh a
stale RRset.

Simplify the setting of 'nodetach':
* During a staleonly lookup we should not detach the nmhandle, so just
  set it prior to 'query_lookup()'.
* During a staleonly "stalefirst" lookup set the 'nodetach' to true
  if we are going to refresh the RRset.

Now there is no longer the need to clear the 'nodetach' if we go
through the "dbfind_stale", "stale_refresh_window", or "stale_only"
paths.

(cherry picked from commit 48b0dc159b)
2021-04-02 13:28:01 +02:00
Matthijs Mekking
b1496d19d5 Refactor stale lookups, ignore active RRsets
When doing a staleonly lookup, ignore active RRsets from cache. If we
don't, we may add a duplicate RRset to the message, and hit an
assertion failure in query.c because adding the duplicate RRset to the
ANSWER section failed.

This can happen on a race condition. When a client query is received,
the recursion is started. When 'stale-answer-client-timeout' triggers
around the same time the recursion completes, the following sequence
of events may happen:
1. Queue the "try stale" fetch_callback() event to the client task.
2. Add the RRsets from the authoritative response to the cache.
3. Queue the "fetch complete" fetch_callback() event to the client task.
4. Execute the "try stale" fetch_callback(), which retrieves the
   just-inserted RRset from the database.
5. In "ns_query_done()" we are still recursing, but the "staleonly"
   query attribute has already been cleared. In other words, the
   query will resume when recursion ends (it already has ended but is
   still on the task queue).
6. Execute the "fetch complete" fetch_callback(). It finds the answer
   from recursion in the cache again and tries to add the duplicate to
   the answer section.

This commit changes the logic for finding stale answers in the cache,
such that on "stale_only" lookups actually only stale RRsets are
considered. It refactors the code so that code paths for "dbfind_stale",
"stale_refresh_window", and "stale_only" are more clear.

First we call some generic code that applies in all three cases,
formatting the domain name for logging purposes, increment the
trystale stats, and check if we actually found stale data that we can
use.

The "dbfind_stale" lookup will return SERVFAIL if we didn't found a
usable answer, otherwise we will continue with the lookup
(query_gotanswer()). This is no different as before the introduction of
"stale-answer-client-timeout" and "stale-refresh-time".

The "stale_refresh_window" lookup is similar to the "dbfind_stale"
lookup: return SERVFAIL if we didn't found a usable answer, otherwise
continue with the lookup (query_gotanswer()).

Finally the "stale_only" lookup.

If the "stale_only" lookup was triggered because of an actual client
timeout (stale-answer-client-timeout > 0), and if database lookup
returned a stale usable RRset, trigger a response to the client.
Otherwise return and wait until the recursion completes (or the
resolver query times out).

If the "stale_only" lookup is a "stale-anwer-client-timeout 0" lookup,
preferring stale data over a lookup. In this case if there was no stale
data, or the data was not a positive answer, retry the lookup with the
stale options cleared, a.k.a. a normal lookup. Otherwise, continue
with the lookup (query_gotanswer()) and refresh the stale RRset. This
will trigger a response to the client, but will not detach the handle
because a fetch will be created to refresh the RRset.

(cherry picked from commit 92f7a67892)
2021-04-02 13:27:52 +02:00
Matthijs Mekking
fcf8fb4f39 Keep track of allow client detach
The stale-answer-client-timeout feature introduced a dependancy on
when a client may be detached from the handle. The dboption
DNS_DBFIND_STALEONLY was reused to track this attribute. This overloads
the meaning of this database option, and actually introduced a bug
because the option was checked in other places. In particular, in
'ns_query_done()' there is a check for 'RECURSING(qctx->client) &&
(!QUERY_STALEONLY(&qctx->client->query) || ...' and the condition is
satisfied because recursion has not completed yet and
DNS_DBFIND_STALEONLY is already cleared by that time (in
query_lookup()), because we found a useful answer and we should detach
the client from the handle after sending the response.

Add a new boolean to the client structure to keep track of client
detach from handle is allowed or not. It is only disallowed if we are
in a staleonly lookup and we didn't found a useful answer.

(cherry picked from commit fee164243f)
2021-04-02 13:27:43 +02:00
Matthijs Mekking
96953fc293
Fix servestale fetchlimits crash
When we query the resolver for a domain name that is in the same zone
for which is already one or more fetches outstanding, we could
potentially hit the fetch limits. If so, recursion fails immediately
for the incoming query and if serve-stale is enabled, we may try to
return a stale answer.

If the resolver is also is authoritative for the parent zone (for
example the root zone), first a delegation is found, but we first
check the cache for a better response.

Nothing is found in the cache, so we try to recurse to find the
answer to the query.

Because of fetch-limits 'dns_resolver_createfetch()' returns an error,
which 'ns_query_recurse()' propagates to the caller,
'query_delegation_recurse()'.

Because serve-stale is enabled, 'query_usestale()' is called,
setting 'qctx->db' to the cache db, but leaving 'qctx->version'
untouched. Now 'query_lookup()' is called to search for stale data
in the cache database with a non-NULL 'qctx->version'
(which is set to a zone db version), and thus we hit an assertion
in rbtdb.

This crash was introduced in 'v9_16' by commit
2afaff75ed.

(cherry picked from commit 87591de6f7)
2021-03-11 13:47:20 +01:00
Ondřej Surý
f92b77ff0d Change the isc_thread_self() return type to uintptr_t
The pthread_self(), thrd_current() or GetCurrentThreadId() could
actually be a pointer, so we should rather convert the value into
uintptr_t instead of unsigned long.

(cherry picked from commit a0181056a8)
2021-02-26 21:14:17 +01:00
Matthijs Mekking
acc95d4e1d Don't servfail on staleonly lookups
When a staleonly lookup doesn't find a satisfying answer, it should
not try to respond to the client.

This is not true when the initial lookup is staleonly (that is when
'stale-answer-client-timeout' is set to 0), because no resolver fetch
has been created at this point. In this case continue with the lookup
normally.

(cherry picked from commit f8b7b597e9)
2021-02-25 12:07:34 +01:00
Matthijs Mekking
84deb57bc3 Don't allow recursion on staleonly lookups
Fix a crash that can happen in the following scenario:

A client request is received. There is no data for it in the cache,
(not even stale data). A resolver fetch is created as part of
recursion.

Some time later, the fetch still hasn't completed, and
stale-answer-client-timeout is triggered. A staleonly lookup is
started. It will also find no data in the cache.

So 'query_lookup()' will call 'query_gotanswer()' with ISC_R_NOTFOUND,
so this will call 'query_notfound()' and this will start recursion.

We will eventually end up in 'ns_query_recurse()' and that requires
the client query fetch to be NULL:

    REQUIRE(client->query.fetch == NULL);

If the previously started fetch is still running this assertion
fails.

The crash is easily prevented by not requiring recursion for
staleonly lookups.

Also remove a redundant setting of the staleonly flag at the end of
'query_lookup_staleonly()' before destroying the query context.

Add a system test to catch this case.

(cherry picked from commit 9e061faaae)
2021-02-25 12:07:27 +01:00
Matthijs Mekking
2afaff75ed Use stale on error also when unable to recurse
The 'query_usestale()' function was only called when in
'query_gotanswer()' and an unexpected error occurred. This may have
been "quota reached", and thus we were in some cases returning
stale data on fetch-limits (and if serve-stale enabled of course).

But we can also hit fetch-limits when recursing because we are
following a referral (in 'query_notfound()' and
'query_delegation_recurse()'). Here we should also check for using
stale data in case an error occurred.

Specifically don't check for using stale data when refetching a
zero TTL RRset from cache.

Move the setting of DNS_DBFIND_STALESTART into the 'query_usestale()'
function to avoid code duplication.

(cherry picked from commit 8bcd7fe69e)
2021-02-08 16:10:03 +01:00
Matthijs Mekking
dbf5428629 Only start stale refresh window when resuming
If we did not attempt a fetch due to fetch-limits, we should not start
the stale-refresh-time window.

Introduce a new flag DNS_DBFIND_STALESTART to differentiate between
a resolver failure and unexpected error. If we are resuming, this
indicates a resolver failure, then start the stale-refresh-time window,
otherwise don't start the stale-refresh-time window, but still fall
back to using stale data.

(This commit also wraps some docstrings to 80 characters width)

(cherry picked from commit aabdedeae3)
2021-02-08 16:07:43 +01:00