Add this test scenario for a bug fixed a while ago. When a third key is
introduced while the previous rollover hasn't finished yet, the keymgr
could decide to remove the first two keys, because it was not checking
for an indirect dependency on the keys.
In other words, the previous bug behavior was that the first two keys
were removed from the zone too soon.
This test case checks that all three keys stay in the zone, and no keys
are removed premature after another new key has been introduced.
(cherry picked from commit 9c40cf0566)
In the kasp script, if one expected key is not found, continue checking
the other key ids, even if there is no match for the first one. This
provides a bit more information which keys mismatch and makes for
easier debugging test failures.
(cherry picked from commit 674249f66a)
Make the cds/setup.sh compatible with the workaround which relies on
testing the TSAN_OPTIONS variable which may not be set.
(cherry picked from commit 76d9873ef6)
Surround the variables which are checked whether they're executable in
double quotes. Without them, empty paths won't be properly interpreted
as not executable.
(manually picked from commit 06056c44a7)
Since delv can occasionally hang in system tests when running with TSAN
(see GL#4119), disable these tests as a workaround. Otherwise, the hung
delv process will just waste CI resources and prevent any meaningful
output from the rest of the test suite.
(cherry picked from commit fbcf37f914)
Previously, the first check silently failed, as 450 is apparently (in
the CI) the minimum output size for the dnstap output, rather than
470 which the test was expecting. Effectively, the check served as a 5
second sleep rather than waiting for the proper file size.
Additionally, check the expected file sizes and fail if expectations
aren't met.
(manually picked from commit 5f809e50b6)
On main, the minimum file size seems to 454 bytes, while on some
platforms in our CI setup for the 9.16 branch, it appears to be 450
instead.
The log message is supposed to contain the zone name which was
erroneously omitted, but didn't pop up during tests, since return code
was silently ignored.
Now it actually waits for the proper log message rather than being an
equivalent of 3 second sleep (which was also sufficient to make the test
pass, thus we detected no failure).
(cherry picked from commit 1dd4c2b9e2)
util/parse_tsan.py builds tables of mutexes, threads, and pointers it
finds in the TSAN report provided to it as a command-line argument and
then replaces all mentions of each of these entities so that they are
numbered sequentially in the processed report. For example, this line:
Cycle in lock order graph: M0 (...) => M5 (...) => M9 (...) => M0
is expected to become:
Cycle in lock order graph: M1 (...) => M2 (...) => M3 (...) => M1
Problems arise when the gaps between mutex/thread identifiers present on
a single line are smaller than the total number of mutexes/threads found
by the script so far. For example, the following line:
Cycle in lock order graph: M0 (...) => M1 (...) => M2 (...) => M0
first gets turned into:
Cycle in lock order graph: M1 (...) => M1 (...) => M2 (...) => M1
and then into:
Cycle in lock order graph: M2 (...) => M2 (...) => M2 (...) => M2
In other words, lines like this become garbled due to information loss.
The problem stems from the fact that the numbering scheme the script
uses for identifying mutexes and threads is exactly the same as the one
used by TSAN itself. Update util/parse_tsan.py so that it uses
zero-padded numbers instead, making the "overlapping" demonstrated above
impossible.
(cherry picked from commit 7f0790c82f)
This is just a slight tweak for the respdiff CI test. The new dataset
has a different set of queries and it results in a slightly more
SERVFAILs rather than timeouts in the respdiff-long-third-party test.
In our comparison script, timeouts are not counted towards the
threshold. While the total number of differences remains roughly the
same, the different distributions of them (among SERVFAIL vs timeout)
warrants a slight bump in the threshold in order to avoid test failures.
Related isc-private/bind-qa!65
The purpose of the check is to verify the server has survived the
previous barrage of queries. This is done by sending a query and
checking we get a NOERROR response back.
Previously, that query could've been affected by a servfail cache - the
server would return a SERVFAIL answer, thus failing the check, despite
being up and running. Use version.bind txt ch query to avoid the
interference of servfail cache.
(cherry picked from commit dd7bcd2855)
The 'refresh_rrset' variable is used to determine if we can detach from
the client. This can cause a hang on shutdown. To fix this, move setting
of the 'nodetach' variable up to where 'refresh_rrset' is set (in
query_lookup(), and thus not in ns_query_done()), and set it to false
when actually refreshing the RRset, so that when this lookup is
completed, the client will be detached.
When a query was aborted because of the recursion quota being exceeded,
but triggered a stale answer response and a stale data refresh query,
it could cause named to loop back where we are iterating and following
a delegation. Having no good answer in cache, we would fall back to
using serve-stale again, use the stale data, try to refresh the RRset,
and loop back again, without ever terminating until crashing due to
stack overflow.
This happens because in the functions 'query_notfound()' and
'query_delegation_recurse()', we check whether we can fall back to
serving stale data. We shouldn't do so if we are already refreshing
an RRset due to having prioritized stale data in cache.
In other words, we need to add an extra check to 'query_usestale()' to
disallow serving stale data if we are currently refreshing a stale
RRset.
As an additional mitigation to prevent looping, we now use the result
code ISC_R_ALREADYRUNNING rather than ISC_R_FAILURE when a recursion
loop is encountered, and we check for that condition in
'query_usestale()' as well.
When cache memory usage is over the configured cache size (overmem) and
we are cleaning unused entries, it might not be enough to clean just two
entries if the entries to be expired are smaller than the newly added
rdata. This could be abused by an attacker to cause a remote Denial of
Service by possibly running out of the operating system memory.
Currently, the addrdataset() tries to do a single TTL-based cleaning
considering the serve-stale TTL and then optionally moves to overmem
cleaning if we are in that condition. Then the overmem_purge() tries to
do another single TTL based cleaning from the TTL heap and then continue
with LRU-based cleaning up to 2 entries cleaned.
Squash the TTL-cleaning mechanism into single call from addrdataset(),
but ignore the serve-stale TTL if we are currently overmem.
Then instead of having a fixed number of entries to clean, pass the size
of newly added rdatasetheader to the overmem_purge() function and
cleanup at least the size of the newly added data. This prevents the
cache going over the configured memory limit (`max-cache-size`).
Additionally, refactor the overmem_purge() function to reduce for-loop
nesting for readability.
The $t1 value equals $t2 due to the time elapsed between "rndc
managed-keys status" calls being equal to the normal active refresh
period (as calculated per rules listed in RFC 5011 section 2.3) minus an
"hour" (as set using -T mkeytimers). This value equality is expected to
happen on really slow machines. On our Windows CI runner, it happens
very often.
Fedora 38 and Debian "bullseye" images were "forked" to images used only
for TSAN CI jobs. The new images contain TSAN-aware liburcu that does
not fit well with ASAN CI jobs for which original images were also used.
liburcu is not used in this branch, but images are shared among
branches, and their use needs to be consistent in all maintained
branches.
(cherry picked from commit 04dda8661f)
We recently fixed a bug where in some cases (when following an
expired CNAME for example), named could return SERVFAIL if the target
record is still valid (see isc-projects/bind9#3678, and
isc-projects/bind9!7096). We fixed this by considering non-stale
RRsets as well during the stale lookup.
However, this triggered a new bug because despite the answer from
cache not being stale, the lookup may be triggered by serve-stale.
If the answer from database is not stale, the fix in
isc-projects/bind9!7096 erroneously skips the serve-stale logic.
Add 'answer_found' checks to the serve-stale logic to fix this issue.
(cherry picked from commit bbd163acf6)