Commit graph

128 commits

Author SHA1 Message Date
Aram Sargsyan
04ed44e7d7 Test another 'stale-answer-client-timeout 0' scenario
Add a test to check serve-stale with the 'stale-answer-client-timeout 0'
configuration option and with a delegation which is a CNAME to a auth
zone.
2025-09-02 08:07:15 +00:00
Matthijs Mekking
dc649735ad Add reproducer as test case
The issue provided a reproducer that can be easily converted into a
test case.
2025-07-23 07:18:48 +00:00
Nicki Křížek
4c487c811d Use pytest.mark.flaky as the flaky marker
It's possible to use pytest.mark.flaky, which achieves the exact same
thing as our custom-defined isctest.mark.flaky -- attempts to rerun the
test on failure, but only is flaky package is available.
2025-07-07 13:29:15 +02:00
Nicki Křížek
1e0df480c7 Mark the serve_stale system test as flaky
The serve_stale test has some inherent instabilities affecting many
different checks. While the failure rate isn't too high (about four
failures in past three weeks of nightlies), it gets ignored, because the
test has been unstable for a very long time.
2025-07-07 13:29:15 +02:00
Aram Sargsyan
441b7d53f4 Test 'stale-answer-client-timeout 0' with a delegation
Add a new test which gets an answer for a delegated zone, then
checks whether the 'stale-answer-client-timeout 0' mode (i.e. the
'stalefirst' mode) works for it.
2025-04-23 11:46:16 +00:00
Mark Andrews
de519cd1c9 Don't leak the original QTYPE to parent zone
When performing QNAME minimization, named now sends an NS
query for the original QNAME, to prevent the parent zone from
receiving the QTYPE.

For example, when looking up example.com/A, we now send NS queries
for both com and example.com before sending the A query to the
servers for example.com.  Previously, an A query for example.com
would have been sent to the servers for com.

Several system tests needed to be adjusted for the new query pattern:

- Some queries in the serve-stale test were sent to the wrong server.
- The synthfromdnssec test could fail due to timing issues; this
  has been addressed by adding a 1-second delay.
- The cookie test could fail due to the a change in the count of
  TSIG records received in the "check that missing COOKIE with a
  valid TSIG signed response does not trigger TCP fallback" test case.
- The GL #4652 regression test case in the chain system test depends
  on a particular query order, which no longer occurs when QNAME
  minimization is active. We now disable qname-minimization
  for that test.
2025-03-14 01:01:26 +00:00
Mark Andrews
0680eb6f64 Fix gratuitious DNS protocol errors in the ANS servers
The ANS servers were not to written to handle NS queries at the
QNAME resulting in gratuitious protocol errors that will break tests
when NS requests are made for the QNAME.
2025-02-04 12:49:50 +11:00
Ondřej Surý
355fc48472
Print the expiration time of the stale records (not ancient)
In #1870, the expiration time of ANCIENT records were printed, but
actually the ancient records are very short lived, and the information
carries a little value.

Instead of printing the expiration of ANCIENT records, print the
expiration time of STALE records.
2025-02-03 15:47:06 +01:00
Ondřej Surý
1bbb57f81b
In cache, set rdataset TTL to 0 when the header is not active
When the header has been marked as ANCIENT, but the ttl hasn't been
reset (this happens in couple of places), the rdataset TTL would be
set to the header timestamp instead to a reasonable TTL value.

Since this header has been already expired (ANCIENT is set), set the
rdataset TTL to 0 and don't reuse this field to print the expiration
time when dumping the cache.  Instead of printing the time, we now
just print 'expired (awaiting cleanup'.
2025-02-03 14:39:06 +01:00
Colin Vidal
27f3b8950a update serve-stale test to support EDE 22
When EDE 3 (stale answer) was added the serve-stale tests were checking
for those exclusively, i.e. grepping for no "EDE" in the dig output when
no stale answer was expected.

However, some stale tests disable stale answers and make the
authoritative server unresponsive, effectively triggering a timed out
request thus an EDE 22. Update those tests so they still tests the
absence of EDE 3 error, but also the presence of EDE 22.
2025-01-27 11:49:44 +01:00
Nicki Křížek
f2cb2e5723 Remove invocations and mentions of clean.sh 2024-11-08 10:54:24 +01:00
Nicki Křížek
7c259fe254 Replace clean.sh files with extra_artifacts mark
The artifact lists in clean.sh and extra_artifacts might be slightly
different. The list was updated for each test to reflect the current
state.
2024-11-08 10:54:24 +01:00
Evan Hunt
c3d3d12911 change allow-transfer default to "none"
Changed the default value for 'allow-transfer' to 'none'; zone
transfers now require explicit authorization.

Updated all system tests to specify an allow-transfer ACL when needed.

Revised the ARM to specify that the default is 'none'.
2024-06-05 10:50:06 -07:00
Aram Sargsyan
bd7463914f Disallow stale-answer-client-timeout non-zero values
Remove all the code and tests which support non-zero
stale-answer-client-timeout values, and adjust the
documentation.
2024-02-16 08:41:52 +00:00
Tom Krizek
339fa5690a
Use a single local port for ditch.pl
The ditch.pl script is used to generate burst traffic without waiting
for the responses. When running other tests in parallel, this can result
in a ephemeral port clash, since the ditch.pl process closes the socket
immediately. In rare occasions when the message ID also clashes with
other tests' queries, it might result in an UnexpectedSource error from
dnspython.

Use a dedicated port EXTRAPORT8 which is reserved for each test as a
source port for the burst traffic.
2024-02-08 13:41:23 +01:00
Mark Andrews
4351076d48
Handle dig timing out gracefully in serve-stale 2024-01-08 17:03:32 +01:00
Mark Andrews
15a433cb9d Stop sending queries to the internet's root servers
Disable automatic dnssec validation.
2023-12-18 23:46:03 +00:00
Matthijs Mekking
e196ba6168 Test case for issue #4355
Add a test case where serve-stale is enabled on a server that also
servers a local authoritative zone.

The particular case tests a lame delegation and checks if falling
back to serving stale data does not attempt to retrieve the query
by recursing from the root down.
2023-10-30 20:07:01 +01:00
Tom Krizek
4cb8b13987
Reformat shell scripts with shfmt
All changes in this commit were automated using the command:

  shfmt -w -i 2 -ci -bn . $(find . -name "*.sh.in")

By default, only *.sh and files without extension are checked, so
*.sh.in files have to be added additionally. (See mvdan/sh#944)
2023-10-26 10:23:50 +02:00
Tom Krizek
c3abedc0a2
Use prereq.sh for serve-stale system test 2023-09-19 14:47:48 +02:00
Tom Krizek
168dba163c
Rename system test directory with common files to _common
The old name "common" clashes with the convention of system test
directory naming. It appears as a system test directory, but it only
contains helper files.

To reduce confusion and to allow automatic detection of issues with
possibly missing test files, rename the helper directory to "_common".
The leading underscore indicates the directory is different and the its
name can no longer be confused with regular system test directories.
2023-09-19 13:29:27 +02:00
Matthijs Mekking
0f593fd70a Add serve-stale test settings after flush
Add a test case to ensure that after 'rndc flush', the serve-stale
settings are not reset.
2023-08-31 11:07:35 +02:00
Tom Krizek
f617512d37
Reproducer for CVE-2023-2911
The conditions that trigger the crash:
- a stale record is in cache
- stale-answer-client-timeout is 0
- multiple clients query for the stale record, enough of them to exceed
  the recursive-clients quota
- the response from the authoritative is sufficiently delayed so that
  recursive-clients quota is exceeded first

The reproducer attempts to simulate this situation. However, it hasn't
proven to be 100 % reproducible, especially in CI. When reproducing
locally, the priming query also seems to sometimes interfere and prevent
the crash. When the reproducer is ran twice, it appears to be more
reliable in reproducing the issue.
2023-07-25 09:23:24 +02:00
Tom Krizek
05baf7206b
Use $(...) notation for subshells in system tests
The changes were mostly done with sed:

find . -name '*.sh' | xargs sed -i 's/`\([^`]*\)`/$(\1)/g'

There have been a few manual changes where the regex wasn't sufficient
(e.g. backslashes inside the `...`) or wrong (`...` referring to docs or
in comments).
2023-07-14 15:49:18 +02:00
Tom Krizek
4e8802a22d
Handle non-zero return codes in serve-stale test 2023-07-14 15:49:17 +02:00
Tom Krizek
1436025e20
Use arithmetic expansion in system tests (followup)
These are manual edits in addition of the automated changes from the
previous commit.
2023-07-14 15:49:15 +02:00
Tom Krizek
01bc805f89
Run system tests with set -e
Ensure all shell system tests are executed with the errexit option set.
This prevents unchecked return codes from commands in the test from
interfering with the tests, since any failures need to be handled
explicitly.
2023-07-14 15:07:25 +02:00
Evan Hunt
0b09ee8cdc explicitly set dnssec-validation in system tests
the default value of dnssec-validation is 'auto', which causes
a server to send a key refresh query to the root zone when starting
up. this is undesirable behavior in system tests, so this commit
sets dnssec-validation to either 'yes' or 'no' in all tests where
it had not previously been set.

this change had the mostly-harmless side effect of changing the cached
trust level of unvalidated answer data from 'answer' to 'authanswer',
which caused a few test cases in which dumped cache data was examined in
the serve-stale system test to fail. those test cases have now been
updated to expect 'authanswer'.
2023-06-26 13:41:56 -07:00
Tom Krizek
dd7bcd2855
Avoid false positive in serve-stale system test check
The purpose of the check is to verify the server has survived the
previous barrage of queries. This is done by sending a query and
checking we get a NOERROR response back.

Previously, that query could've been affected by a servfail cache - the
server would return a SERVFAIL answer, thus failing the check, despite
being up and running. Use version.bind txt ch query to avoid the
interference of servfail cache.
2023-06-13 10:52:01 +02:00
Matthijs Mekking
c3d4fd3449 Add serve-stale test case for GL #3950
Add a test case where when priming the cache with a slow authoritative
resolver, the stale-answer-client-timeout option should not return
a delegation to the client (it should wait until an applicable answer
is found, if no entry is found in the cache).
2023-05-30 11:58:19 +02:00
Tom Krizek
2f5bf6d971
Add pytest functions for shell system tests
In order to run the shell system tests, the pytest runner has to pick
them up somehow. Adding an extra python file with a single function
for the shell tests for each system test proved to be the most
compatible way of running the shell tests across older pytest/xdist
versions.

Modify the legacy run.sh script to ignore these pytest-runner specific
glue files when executing tests written in pytest.
2023-05-22 14:11:39 +02:00
Matthijs Mekking
0bf36da305 Update serve-stale system test
The serve-stale system test was intermittently failing due to a timing
issue:

    I:serve-stale:check stale data.example TXT was refreshed...
    I:serve-stale:failed

The RRset is refreshed, however, it first checks for an expected log
line, prior checking that the stale data.example TXT was refreshed
(using dig). This log line is there to ensure the record is actually
refreshed before we start querying again. Alternatively we could just
retry_quiet 10 <wait for dig output matches expectations>. It would
lower the chances for intermittent test failures, since there is no
longer a "check for log line, sleep one second if check fails, check
for log line, ...", prior to the check.
2023-03-08 17:14:59 +01:00
Mark Andrews
add40273df
Test RRSIG queries with serve-stale enabled
Make RRSIG queries where the existing tests trigger a DNS_EVENT_TRYSTALE
event.
2023-02-22 13:22:02 +01:00
Aram Sargsyan
4b52b0b4a9
Add tests for CVE-2022-3924
Reproduce the assertion by configuring a 'named' resolver with
'recursive-clients 10;' configuration option and running 20
queries is parallel.

Also tweak the 'ans2/ans.pl' to simulate a 50ms network latency
when qname starts with "latency". This makes sure that queries
running in parallel don't get served immediately, thus allowing
the configured recursive clients quota limitation to be activated.
2023-02-22 10:39:06 +01:00
Aram Sargsyan
537187bf2f Add serve-stale CNAME check with stale-answer-client-timeout off
Prime the cache with the following records:

    shortttl.cname.example.	1	IN	CNAME	longttl.target.example.
    longttl.target.example.	600	IN	A	10.53.0.2

Wait for the CNAME record to expire, disable the authoritative server,
and query 'shortttl.cname.example' again, expecting a stale answer.
2023-01-09 10:44:01 +01:00
Tom Krizek
b8616e457f
Remove ans.pl system test files from gitignore
The ans*.pl scripts are part of system tests and should be part of the
repository. The gitignore entires for these files have been removed.
2022-12-23 13:44:18 +01:00
Tom Krizek
ba1607747c
Revert "Merge branch '3678-serve-stale-servfailing-unexpectedly' into 'main'"
This reverts commit 629f66ea8e, reversing
changes made to 84a7be327e.

It also removes release note 6038, since the fix is reverted.
2022-12-08 10:30:44 +01:00
Aram Sargsyan
21faf44ef7 Add serve-stale CNAME check with stale-answer-client-timeout off
Prime the cache with the following records:

    shortttl.cname.example.	1	IN	CNAME	longttl.target.example.
    longttl.target.example.	600	IN	A	10.53.0.2

Wait for the CNAME record to expire, disable the authoritative server,
and query 'shortttl.cname.example' again, expecting a stale answer.
2022-12-06 13:26:53 +00:00
Mark Andrews
bce1cf6c62 Log type with stale answer log messages
Add more information about which query type is dealing with serve-stale.
Update the expected log messages in the serve-stale system test.
2022-11-30 14:32:58 +01:00
Matthijs Mekking
45f7a15785 Update serve-stale test messages to include RRtype 2022-11-30 14:28:38 +01:00
Tom Krizek
c100308b7d
Simplify start/stop helper func in system tests
The system test should never attempt to start or stop any other server
than those that belong to that system test. Therefore, it is not
necessary to specify the system test name in function calls.

Additionally, this makes it possible to run the test inside a
differently named directory, as its name is automatically detected with
the $SYSTESTDIR variable. This enables running the system tests inside a
temporary directory.

Direct use of stop.pl was replaced with a more systematic approach to
use stop_servers helper function.
2022-11-25 09:27:33 +01:00
Michal Nowak
9e68997cbb Add shell interpreter line where missing
The checkbashisms script reports errors like this one:

    script util/check-line-length.sh does not appear to have a #! interpreter line;
    you may get strange results
2022-11-14 19:54:42 +00:00
Tom Krizek
6295572b05
Remove misleading comment from serve-stale test
The stale-answer-client-timeout option is not set to 0 in the config
neither is it the default value. This was probably caused by a
copy-paste error.
2022-10-24 14:23:27 +02:00
Tom Krizek
a4d72a57f9
Test serve stale cache with timeout 0 and CNAME
Add a couple of tests that verify the serve-stale behavior when
stale-answer-client-timeout is set to 0 and a (stale) CNAME record is
queried.

Related #3517
2022-10-24 14:23:26 +02:00
Matthijs Mekking
0681b15225 If refresh stale RRset times out, start stale-refresh-time
The previous commit failed some tests because we expect that if a
fetch fails and we have stale candidates in cache, the
stale-refresh-time window is started. This means that if we hit a stale
entry in cache and answering stale data is allowed, we don't bother
resolving it again for as long we are within the stale-refresh-time
window.

This is useful for two reasons:
- If we failed to fetch the RRset that we are looking for, we are not
  hammering the authoritative servers.

- Successor clients don't need to wait for stale-answer-client-timeout
  to get their DNS response, only the first one to query will take
  the latency penalty.

The latter is not useful when stale-answer-client-timeout is 0 though.

So this exception code only to make sure we don't try to refresh the
RRset again if it failed to do so recently.
2022-10-05 08:20:48 +02:00
Aram Sargsyan
8611aa759f DiG: use the same retry and fail-over logic for different failure types
DiG implements different logic in the `recv_done()` callback function
when processing a failure:

1. For a timed-out query it applies the "retries" logic first, then,
   when it fails, fail-overs to the next server.

2. For an EOF (end-of-file, or unexpected disconnect) error it tries to
   make a single retry attempt (even if the user has requested more
   retries), then, when it fails, fail-overs to the next server.

3. For other types of failures, DiG does not apply the "retries" logic,
   and tries to fail-over to the next servers (again, even if the user
   has requested to make retries).

Simplify the logic and apply the same logic (1) of first retries, and
then fail-over, for different types of failures in `recv_done()`.
2022-07-22 08:35:35 +00:00
Mark Andrews
ce324ae8ba Use DEFAULT_HMAC for rndc 2022-07-07 10:11:42 +10:00
Evan Hunt
f1485ca145 don't keep stale NXDOMAIN cache entries
when serve-stale is enabled, NXDOMAIN cache entries are no longer
preserved after the normal negative cache TTL, in order to reduce
unnecessary cache memory consumption.
2022-06-13 12:53:37 -07:00
Matthijs Mekking
f764cee136 Tweak timings in serve-stale system test
Give a little bit more time if we wait on a time out from the
authoritative (aka resolver failure), and give up after one try
(because the second attempt will likely result in a different EDE).
2022-05-23 14:23:07 +02:00
Matthijs Mekking
c66b9abc0b Add stale answer extended errors
Add DNS extended errors 3 (Stale Answer) and 19 (Stale NXDOMAIN Answer)
to responses. Add extra text with the reason why the stale answer was
returned.

To test, we need to change the configuration such that for the first
set of tests the stale-refresh-time window does not interfer with the
expected extended errors.
2022-04-28 09:58:25 +02:00