Stop the cname_and_other_data processing if we already know that the
result is true. Also, we know that CNAME will be placed in the priority
headers, so we can stop looking for CNAME if we haven't found CNAME and
we are past the priority headers.
Currently we test the incoming zone transfers data in the statistics
channel by retransfering the zones in slow mode and capturing the XML
and JSON outputs in the meantime to check their validity. Add a new
transfer to the test, and check that the XML and JSON files correctly
indicate that we have 3 retransfers and 1 new (first time) transfer.
Explicitly use an empty 'trust-anchors' statement in the system
tests where it was used implicitly before.
In resolver/ns5/named.conf.in use the trust anchor in 'trusted.conf',
which was supposed to be used there.
Add checks into the 'checkconf' system test to make sure that the
'dnssec-validation yes' option fails without configured trusted
anchors, and succeeds with configured non-empty, as well as empty
trusted anchors.
the 'low', 'high' and 'discount' parameters to 'fetch-quota-param'
are meant to be ratios with values between zero and one, but higher
values can be assigned. this could potentially lead to an assertion
in maybe_adjust_quota().
In the second test we are looking for key files and extract the key
id numbers. Because keys can be in different directories, we needed
to change the maxdepth when searching for keys.
For the second kasp system test, check that 'dnssec-keygen -k' (default
policy) creates valid files, the 'get_keyids' returned more than one
keytag, namely the ones that are inside the keys/ directory, that were
created for the predecessor test, check that 'dnssec-keygen -k'
(configuredd policy) creates valid files.
This caused the system test to spew out errors that key files were
missing (we were looking for key files in the current directory, but
when looking for key id numbers we included the keys/ directory). It
could also cause the next test to fail, check that 'dnssec-settime' by
default does not edit key state file, because the STATE_FILE environment
variable was overwritten with the key file path of one of the keys that
were created with the configured policy.
We fix this by adjusting the maxdepth for the test in question. Other
tests don't need adjusting because they use unique zone names.
The name "uri" was considered to be too generic and could potentially
clash with a future URI configuration option. Renamed to "pkcs11-uri".
Note that this option name was also preferred over "pkcs11uri", the
dash is considered to be the more clearer form.
Add test cases for zones in different views that are using PKCS#11
tokens to store its keys.
If it is using the same DNSSEC policy, only one PKCS#11 token should be
created and the same key should be used for the zone in both views.
If it is using a different DNSSEC policy, multiple PKCS#11 token should
be created and each view should use their respective key.
This log may still occur if there is a DNSKEY in the unsigned zone.
This may happen in a multi-signer setup for example.
Ideally this should not log a warning, but that requires looking up
keys a different way (by searching for key files only). However, that
requires adapting a bunch of system tests, and is out of scope for now.
- Shell function body should be in between curly braces.
- Some erroneous '|| return 1' are replaced with '|| ret=1'.
- Fix a variable name (was 'ret', should be '_ret').
- Clean up when setting up a new test.
The bullseye and bookworm images are not set up with pkcs11-provider,
so we need to add an additional prerequisite for running the
pkcs11engine test. Check the path of OPENSSL_CONF.
Move dns_dnssec_findzonekeys from the dnssec.{c,h} source code to
zone.{c,h} (the header file already commented that this should be done
inside dns_zone_t).
Alter the function in such a way, that keys are searched for in the
key stores if a 'dnssec-policy' (kasp) is attached to the zone,
otherwise keep using the zone's key-directory.
When using dnssec-policy with dnssec-keygen in combination with setting
the key-directory on the command line, the commandline argument takes
priority over the key-directory from the default named.conf.
Add a test case where dnssec-policy uses key stores with a directory
other than the zone's key-directory.
This requires changing the kasp shell script to take into account that
keys can be in different directories. When looking for keys, the
'find' command now takes a maxdepth of 3 to also look for keys in
subdirectories. Note this maxdepth value is arbitrary, the added
'keystore.kasp' test only requires a maxdepth of 2.
Because of this change, the dnssec-keygen tests no longer work because
they are for the same zone (although different directories). Change
the test to use a different zone ('kasp2' instead of 'kasp').
Add cases for each algorithm to test the interaction between
dnssec-policy and engine_pkcs11. Ensure that named creates keys on
startup.
Also test dnssec-keygen when using a dnssec-policy with a PKCS#11
based key-store.
The check for printing zone list failed because of these additional
lines in the output:
good.conf:22: dnssec-policy: key algorithm 13 has predefined length; \
ignoring length value 256
I am not sure why this failure hasn't happened before already.
Similar to key-directory, check for zones in different views and
different key and signing policies. Zones must be using different key
directories to store key files on disk.
Now that a key directory can be linked with a dnssec-policy key, the
'keydirexist' checking needs to be reshuffled.
Add tests for bad configuration examples, named-checkconf should catch
those. Also add test cases for a mix of key-directory and key-store
directory.
Similar to key-directory, check if the key-store directory exists and
if it is an actual directory.
This commit fixes an accidental test bug in checkconf where if
the "warn key-dir" test failed, the result was ignored.
Add checkconf check to ensure that the used key-store in the keys
section exists. Error if that is not the case. We also don't allow
the special keyword 'key-directory' as that is internally used to
signal that the zone's key-directory should be used.
Add new configuration for setting key stores. The new 'key-store'
statement allows users to configure key store backends. These can be
of type 'file' (that works the same as 'key-directory') or of type
'pkcs11'. In the latter case, keys should be stored in a HSM that is
accessible through a PKCS#11 interface.
Keys configured within 'dnssec-policy' can now also use the 'key-store'
option to set a specific key store.
Update the checkconf test to accomodate for the new configuration.
These tests have ns1 configured as a mock root server. Make sure it is
used in all config files of those tests, otherwise some queries could
leak to root nameservers.
Some tests don't have a mock root server configured, because they don't
need one. However, these tests might still leak queries to actual name
servers. Add a shared root hints file which can serve as a blackhole for
these queries.
As dnsrps and native test cases have been properly split up, the
ckdnsrps.sh script is no longer used anywhere, as the logic for
selecting these test cases is handled by pytest.
Previously, dnsrps test was executed as an optional part of the rpz and
rpzrecurse system tests. This was conceptually problematic, as the test
took the responsibility of running parts of the test framework -
cleaning files and setting up servers again.
Instead, allow these tests to execute either the native variant, or the
dnsrps one. To ensure the same test coverage, trigger both of these
variants as separate test cases from pytest.
We need to skip some portions the system test in FIPS mode as some of
the algorithms used in the test are not available when using the FIPS
mode (e.g. TLS_CHACHA20_POLY1305_SHA256)
The older version of the code was reporting that listeners are going
to be of the same type after reconfiguration when switching from DoT
to HTTPS listener, making BIND abort its executions.
That was happening due to the flaw in logic due to which the code
could consider a current listener and a configuration for the new one
to be of the same type (DoT) even when the new listener entry is
explicitly marked as HTTP.
The checks for PROXY in between the configuration were masking that
behaviour, but when porting it to 9.18 (when there is no PROXY
support), the behaviour was exposed.
Now the code mirrors the logic in 'interface_setup()' closely (as it
was meant to).
This commit adds a system test that helps to verify that changing a
listener transport by editing "listen-on" statements before
reconfiguration works as expected.
This commit adds a new system test which verifies that using the
'cipher-suites' option actually works as expected (as well as adds
first TLSv1.3 specific tests).
The "exceeded time limit waiting for literal 'too many DNS UPDATEs
queued' in ns1/named.run" is prone to fail due to a timing issue.
Despite out efforts to stabilize it, the check still often fails on
FreeBSD in our CI. Allow the test to be re-run on this platform.
This file was missing explicit dnssec-validation. Seems like it was
missed in our previous efforts, probably because of the different
filename / extension. Rename it to end with *.in to reflect that it is a
template file used by copy_setports.
If the dnskey-ttl in the dnssec-policy doesn't match the DNSKEY's
ttl then the DNSKEY, CDNSKEY and CDS rrset should be updated by
named to reflect the expressed policy. Check that named does this
by creating a zone with a TTL that does not match the policy's TTL
and check that it is correctly updated.
In Net::DNS 1.42 $ns->main_loop no longer loops. Use current methods
for starting the server, wait for SIGTERM then cleanup child processes
using $ns->stop_server(), then remove the pid file.
Reconfiguring named using RNDC is a common action in BIND 9 system
tests. It involves sending the "reconfig" RNDC command to a named
instance and waiting until it is fully processed. Add a reconfigure()
method to the NamedInstance class in order to simplify and standardize
named reconfiguration using RNDC in Python-based system tests.
TODO:
- full reconfiguration support (w/templating *.in files)
- add an "rndc null" before every reconfiguration to show which file
is used (NamedInstance.add_mark_to_log() as it may be generically
useful?)
The "checkds" system test contains a lot of duplicated code despite
carrying out the same set of actions for every tested scenario
(zone_check() → wait for logs to appear → keystate_check()). Extract
the parts of the code shared between all tests into a new function,
test_checkds(), and use pytest's test parametrization capabilities to
pass distinct sets of test parameters to this new function, in an
attempt to cleanly separate the fixed parts of this system test from the
variable ones. Replace format() calls with f-strings.
The "checkds" system test only uses dns.resolver.Resolver objects to
access their 'nameservers' and 'port' attributes. Instances of the
NamedInstance class also expose that information via their attributes,
so only pass NamedInstance objects around instead of needlessly
depending on dns.resolver.Resolver.
Make log file watching in Python-based system tests consistent by
employing the helper Python classes designed for that purpose. Drop the
custom code currently used.
Waiting for a specific log line to appear in a named.run file is a
common action in BIND 9 system tests. Implement a set of Python classes
which intend to simplify and standardize this task in Python-based
system tests.
Co-authored-by: Štěpán Balážik <stepan@isc.org>
The "addzone" and "shutdown" system tests currently invoke rndc using
test-specific helper code. Rework the relevant bits of those tests so
that they use the helper classes from bin/tests/system/isctest.py.
Controlling named instances using RNDC is a common action in BIND 9
system tests. However, there is currently no standardized way of doing
that from Python-based system tests, which leads to code duplication.
Add a set of Python classes and pytest fixtures which intend to simplify
and standardize use of RNDC in Python-based system tests.
For now, RNDC commands are sent to servers by invoking the rndc binary.
However, a switch to a native Python module able to send RNDC commands
without executing external binaries is expected to happen soon. Even
when that happens, though, having the capability to invoke the rndc
binary (in order to test it) will remain useful. Define a common Python
interface that such "RNDC executors" should implement (RNDCExecutor), in
order to make switching between them convenient.
Co-authored-by: Štěpán Balážik <stepan@isc.org>
When changing the NSEC3 chain, the new NSEC3 chain must be built before
the old NSEC3PARAM is removed. Check each delta in the conversion to
ensure this ordering is met.
When transitioning from NSEC3 to NSEC the NSEC3 must be built before
the NSEC3PARAM is removed. Check each delta in the conversion to
ensure this ordering is met.
When transitioning from NSEC3 to NSEC the added records where not
being signed because the wrong time was being used to determine if
a key should be used or not. Check that these records are actually
signed.
The check_loaded() function compares the zone's loadtime value and
an expected loadtime value, which is based on the zone file's mtime
extracted from the filesystem.
For the secondary zones there may be cases, when the zone file isn't
ready yet before the zone transfer is complete and the zone file is
dumped to the disk, so a so zero value mtime is retrieved.
In such cases wait one second and retry until timeout. Also modify
the affected check to allow a possible difference of the same amount
of seconds as the chosen timeout value.
certain dig options which were deprecated and became nonoperational
several releases ago still had documentation in the dig man page and
warnings printed when they were used: these included +mapped,
+sigchase, +topdown, +unexpected, +trusted-key, and the -i and -n
options. these are now all fatal errors.
another option was described as deprecated in the man page, but
the code to print a warning was never added. it has been added now.
This commit extends the 'doth' system tests with additional secondary
NS instance that reuses the same 'tls' entry for connecting the the
primary to download zones. This configurations were known to crash
secondaries in some cases.
This commit adds a system test suite for PROXYv2. The idea on which it
is based is simple:
1. Firstly we check that 'allow-proxy' and 'allow-proxy-on' (whatever
is using the new 'isc_nmhandle_real_localaddr/peeraddr()') do what
they intended to do.
2. Anything else that needs an interface or peer address (ACL
functionality, for example) is using the old
'isc_nmhandle_localaddr/peeraddr()' - which are now returning
addresses received via PROXY (if any) instead of the real connection
addresses. The beauty of it that we DO NOT need to verify every bit of
the code relying on these functions: whatever works in one place will
work everywhere else, as these were the only functions that allowed
any higher level code to get peer and interface addresses.
This way it is relatively easy to see if PROXYv2 works as intended.
The system tests need to be updated because non-zero iterations are no
longer accepted.
The autosign system test changes its iterations from 1 to 0 in one
test case. This requires the hash to be updated.
The checkconf system test needs to change the iterations in the good
configuration files to 0, and in the bad ones to 1 (any non-zero value
would suffice, but we test the corner case here). Also, the expected
failure message is change, so needs to be adjusted.
The nsec3 system test also needs iteration configuration adjustments.
In addition, the test script no longer needs the ITERATIONS environment
variable.
In the process of updating the system tests, I noticed an error
in the dnssec-policy "nsec3-other", where the salt length in one
configuration file is different than in the other (they need to be
the same). Furthermore, the 'rndc signing -nsec3param' test case
is operated on the zone 'nsec-change.kasp', so is moved so that the
tests on the same zone are grouped together.
Create a utility package for code shared by the python tests. The
utility functions should use reasonable defaults and be split up into
modules according to their functionality.
Ensure assert rewriting is enabled for the modules to get the most
useful output from pytest.
By default, the useful assertion message rewrite is used by pytest for
test modules only. Since another module is imported with shared
functionality, ensure it has pytest's assertion message rewriting
enabled to obtain more debug information in case it fails.
This file is executed outside of pytest with pure python, which doesn't
do any AssertionError message rewriting like pytest. Ensure the assert
messages in this file provide a useful debug message.
Rewrite and reorganize the test documentation to focus on the pytest
runner, omit any mentions of the legacy runner which are no longer
relevant, and mention a few pytest tricks.
The ns2/named-alt3.conf.in config file was removed in
f8e264ba6d. From then on, system test
reports:
sed: can't read ns2/named-alt3.conf.in: No such file or directory"
Drop the last remnant of ns2/named-alt3.conf.in.
when transferring in a non-inline-signing secondary for the first time,
we previously never set the value of zone->loadtime, so it remained
zero. this caused a test failure in the statschannel system test,
and that test case was temporarily disabled. the value is now set
correctly and the test case has been reinstated.
The AES algorithm for DNS cookies was being kept for legacy reasons, and
it can be safely removed in the next major release. Remove both the AES
usage for DNS cookies and the AES implementation itself.
Not every element tagged `skipped` in the JUnitXML tree has to contain
the `type` attribute. An example of that is a test that results in
xpass.
This has been verified with pytest version 7.4.2 and prior.
Add a test case where serve-stale is enabled on a server that also
servers a local authoritative zone.
The particular case tests a lame delegation and checks if falling
back to serving stale data does not attempt to retrieve the query
by recursing from the root down.
The lock-file configuration (both from configuration file and -X
argument to named) has better alternatives nowadays. Modern process
supervisor should be used to ensure that a single named process is
running on a given configuration.
Alternatively, it's possible to wrap the named with flock(1).
All changes in this commit were automated using the command:
shfmt -w -i 2 -ci -bn . $(find . -name "*.sh.in")
By default, only *.sh and files without extension are checked, so
*.sh.in files have to be added additionally. (See mvdan/sh#944)
When named fails to starts due to not being able to obtain
a lock on the lock file that lock file should remain. Check
that the lock file exists before and after the attempt to
start a second instance of named.
We test EDNS requests returning FORMERR where named is expected
to retry without EDNS.
We test EDNS requests returning NOTIMP where named is expected
to fail the transfer as the remote end is not protocol compliant.
The server is expected to retry the transfer using SOA and if
the returned serial is greater than the current serial AXFR.
Check the log that IXFR is request.
Now that inline-signing is ignored when there is no dnssec-policy,
add 'dnssec-policy default;' to the zones when attempting to add them
via 'rndc addzone'.
The 'dynamic-signed-inline-signing.kasp' zone was set up with
the environment variable 'ksktimes', but that should be 'csktimes'
which is set one line above. Since the values are currently the same
the behavior is identical, but of course it should use the correct
variable.
The 'step4.enable-dnssec.autosign' zone was set up twice. This is
unnecessary.
Add a test scenario for a dynamic zone that uses inline-signing which
accidentally has signed the raw version of the zone.
This should not trigger resign scheduling on the raw version of the
zone.
The usage of the newline in the replacement part of the 'sed' call
works in GNU systems, but not in OpenBSD. Use 'awk' instead.
Also use the extended syntax of regular expressions for 'grep', which
is similarly more portable across the supported systems.
'rndc thaw' initiates asynchrous loading of all the zones
similar to 'rndc load'. Wait for the test zone's load to
complete before testing that it is updatable again.
Instead of creating new memory pools for each new dns_message, change
dns_message_create() method to optionally accept externally created
dns_fixedname_t and dns_rdataset_t memory pools. This allows us to
preallocate the memory pools in ns_client and dns_resolver units for the
lifetime of dns_resolver_t and ns_clientmgr_t.
The XFRST_INITIALSOA state in the xfrin module is named like that,
because the first RR in a zone transfer must be SOA. However, the
name of the state is a bit confusing (especially when exposed to
the users with statistics channel), because it can be mistaken with
the refresh SOA request step, which takes place before the zone
transfer starts.
Rename the state to XFRST_ZONEXFRREQUEST (i.e. Zone Transfer Request).
During that step the state machine performs several operations -
establishing a connection, sending a request, and receiving/parsing
the first RR in the answer.
Add two more secondary zones to ns3 to be transferred from ns1,
using its IPv6 address for which the 'tcp-only' is set to 'yes'.
Check the statistics channel's incoming zone transfers information
to confirm that the expected transports were used for each of the
SOA query cases (UDP, TCP, TLS), and also for zone transfers (TCP,
TLS).
This allows the statistics channel to be viewed in a browser while
the transfer is in progress. Also set the transfer format to
one-answer to extend the amount of time the re-transfer takes.
When running the statschannel test on its own, use
<http://10.53.0.3:5304/xml/v3/xfrins> to see the output.
Note: the port is subject to future change.
Use the named -T transferslowly test options to slow down a zone
transfer from the primary server, and test that it's correctly
exposed in the statistics channel of the secondary server, while
it's in-progress.
Apply the semantic patch to catch all the places where we pass 'char' to
the <ctype.h> family of functions (isalpha() and friends, toupper(),
tolower()).
The Unix Domain Sockets support in BIND 9 has been completely disabled
since BIND 9.18 and it has been a fatal error since then. Cleanup the
code and the documentation that suggest that Unix Domain Sockets are
supported.
The previous symlink name convention was prone to name collisions If a
system test contained both a shell test and a pytest module of the same
name (e.g. dnstap test has both tests.sh and tests_dnstap.py), then
these would have the same convenience symlink, which could cause test
setup issues as well as confusion when examining test artifacts.
Update the naming convention to include the full pytest module name.
This results in a slightly more verbose names for shell tests (e.g.
dnstap_sh_dnstap instead of the previous dnstap_dnstap), but it removes
the chance of a collision.
Reorganize individual port fixtures and re-use the ports fixture to
obtain their number. Store it as integer and only cast it to string when
setting it as environment variable.
Remove code fork for legacy runner, reorganize imports and move a
pylint-silencing snippet to the top of the file. The rest of the code
was just unindented.
In order to python system tests, pytest (runner) has to be used
directly. This makes it possible to simplify the pytest runner and make
its behavior simpler and easier to extend.
The legacy runner can still be used to run shell system tests.
The legacy runner no longer uses make check. Ensure the legacy runner
script doesn't interact with that automake target in any way. The legacy
runner script remains available to execute the legacy runner, but there
is no out-of-the box support for running tests in parallel. Other tools
such as xargs can be utilized for that.
Pytest provides JUnit output and uses different exit codes from
Automake. Use the conversion script to interpret the JUnit test results
from python rather than relying on the status code.
It's important to parse the JUnit result file rather than relying on the
exit code from pytest, which has a different meaning. Include a .trs test
result for each test case and set an exit code which is most appropriate
as the aggregate result (e.g. it will be set to 77 (SKIP) if there's at
least one test case that was skipped).
Dependencies for these tests are already checked in prereq.sh - if the
dependencies are missing, these tests will be skipped. The extra
dependency check in Makefile.am is extraneous and only applied for the
legacy test runner.
To allow concurrent invocations of pytest, it is necessary to assign
ports properly to avoid conflicts. In order to do that, pytest needs to
know a complete list of all test modules.
When pytest is invoked from run.sh, the current working directory is the
system test directory. To properly detect other tests, the conftest.py
has to look in the bin/tests/system directory, rather than the current
working directory.
To conform with the expected naming convention, the pytest glue file for
the `allow-query` test should use underscore as the word separator in
the python file name: allow-query/tests_sh_allow_query.py
The _common directory is a special case directory which contains shared
files for other system test directories. Make sure it's tracked in git
and not deleted during temporary directory cleanup.
The old name "common" clashes with the convention of system test
directory naming. It appears as a system test directory, but it only
contains helper files.
To reduce confusion and to allow automatic detection of issues with
possibly missing test files, rename the helper directory to "_common".
The leading underscore indicates the directory is different and the its
name can no longer be confused with regular system test directories.
Reusing TCP connections with dns_dispatch_gettcp() used linear linked
list to lookup existing outgoing TCP connections that could be reused.
Replace the linked list with per-loop cds_lfht hashtable to speedup the
lookups. We use cds_lfht because it allows non-unique node insertion
that we need to check for dispatches in different connection states.
Test resolver-use-dns64 by simulating a connection to an IPv4-only
server through a NAT64.
This test uses EXTRAPORT1 rather than PORT for DNS traffic exchanged
between ns3 and ns4. Both servers also listen on PORT on their IPv4
addresses to support server startup testing in start.pl.
The dnssec-must-be-secure feature was added in the early days of BIND 9
and DNSSEC and it makes sense only as a debugging feature.
Remove the feature to simplify the code.
Use the new isc_mem_c*() calloc-like API for allocations that are
zeroed.
In turn, this also fixes couple of incorrect usage of the ISC_MEM_ZERO
for structures that need to be zeroed explicitly.
There are few places where isc_mem_cput() is used on structures with a
flexible member (or similar).
This allow for the EDNS options EXPIRE and NSID to be sent when
when making requests. The existing controls controlling whether
EDNS is used and whether EXPIRE or NSID are sent are honoured.
Adjust the expected byte counts in the xfer system test to reflect
the EDNS overhead. Adjust the dig call to match named's behavior
(don't set +expire as we are talking to a secondary).
*** CID 464884: Null pointer dereferences (REVERSE_INULL)
/bin/tests/system/dyndb/driver/db.c: 644 in create_db()
638
639 *dbp = (dns_db_t *)sampledb;
640
641 return (ISC_R_SUCCESS);
642
643 cleanup:
CID 464884: Null pointer dereferences (REVERSE_INULL)
Null-checking "sampledb" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
644 if (sampledb != NULL) {
645 if (dns_name_dynamic(&sampledb->common.origin)) {
646 dns_name_free(&sampledb->common.origin, mctx);
647 }
648
649 isc_mem_putanddetach(&sampledb->common.mctx, sampledb,
- Simplify configuration management by deducing SoftHSM module path
from openssl config
- Determine the engine flag (-E) value from openssl config
- Drop unused/unneeded environment variables
- Run pkcs11-provider tests on Debian "sid" ossl3 flavor
The rrl system test has been unstable and producing false positive
results for years (GL #172). Allow the test to be re-run (once) to
reduce the noise it causes.
The reclimit system test has been unstable and producing false positive
results for years (GL #1587). Allow the test to be re-run (once) to
reduce the noise it causes.
The qmin test is inherently unstable. It fails quite often with failure
modes described in GL #904. Allow the pytest runner to re-run the test
up to 3 times to only detect a more persistent and reproducible failures
rather than random noise caused by the nature of the test.
It is better to disable the specific check that causes the test to fail
rather than mark the entire test as xfail, which can mask other issues
which the test is capable of detecting.
Using check_PROGRAMS would postpone compiling the binaries needed by
system tests until `make check` would be called. Since it's preferable
to invoke pytest directly to run the system test suite, compile these
binaries without installing them during `make all` instead by using
noinst_PROGRAMS.
This removes the need to use TESTS= make -e check hack invoked from
pytest to work around this issue.
The command finds all directories in bin/tests/system which contain an
underscore. Underscore indicates either a temporary directory (_tmp_), a
symlink to test artifacts (TESTNAME_MODULENAME), or a python-related
cache. Using underscore for a system test name is invalid and a hyphen
must be used instead.
While it'd be fairly easy to split the function up into smaller ones,
the readability wouldn't be improved in this case. Silence the
suggestions instead.
While temporary directories are useful for test execution to keep
everything clean, they are difficult to work with manually. Create a
symlink for each test artifact directory with a stable and predictable
path. The symlink always either points to the latest artifacts, or is
missing in case the last run succeeded.
Ensure these symlinked directories aren't detected as test suites by the
pytest runner.
In some cases, BIND is not fast enough to fill the send buffer and
manages to answer all queries, contrary to what the test expects.
Repeat the check up to 3 times to limit this test instability.
If the flaky plugin for pytest is available, use its decorator to
support re-running unstable tests. In case the package is missing,
execute the test as usual without attempts to re-run it in case of
failure.
This is mostly intended to increase the test stability in CI. Using a
custom decorator enables us to keep the flaky package as an optional
dependency.
The following files were reported in CI by the legacy system test runner
and prevented job to pass. They should be removed.
$ if git rev-parse > /dev/null 2>&1; then ( ! grep "^I:.*:file.*not removed$" *.log ); fi
autosign.log:I:autosign:file autosign/ns3/kskonly.example.db.jbk not removed
autosign.log:I:autosign:file autosign/ns3/optout.example.db.jbk not removed
autosign.log:I:autosign:file autosign/ns3/reconf.example.db.jbk not removed
masterformat.log:I:masterformat:file masterformat/ns1/signed.db.raw.jbk not removed
masterformat.log:I:masterformat:file masterformat/ns1/signed.db.raw.signed not removed
masterformat.log:I:masterformat:file masterformat/ns1/signed.db.raw.signed.jnl not removed
Don't print an error when the ns*/inactive directory is not
present:
rmdir: ns*/inactive: No such file or directory
Remove nsupdate.out.test file instead of nsupdate.out, as the latter
does not exist.
The dns_dispatchmgr object was only set in the dns_view object making it
prone to use-after-free in the dns_xfrin unit when shutting down named.
Remove dns_view_setdispatchmgr() and optionally pass the dispatchmgr
directly to dns_view_create() when it is attached and not just assigned,
so the dns_dispatchmgr doesn't cease to exist too early.
The dns_view_getdnsdispatchmgr() is now protected by the RCU lock, the
dispatchmgr reference is incremented, so the caller needs to detach from
it, and the function can return NULL in case the dns_view has been
already shut down.
The mkeys system test could fail because root zone was resigned
within the same second as it was previously signed causing reloads
to fail. Add delays to the test to prevent this.
'addr', 'ckresult' and 'drop' should return 0 rather than 1 after
calling 'setret' as the error has been logged and these functions
are not expect to fail.
The setup.pl script has been replaced with static BIND configurations,
and in the course of this change, the unused ns1 server was removed.
This enhancement has greatly improved the overall test's readability.
The shell version of the test was completed only after all DNS zone
updates were sent, even if the BIND server crashed while processing
them, leading to prolonged execution and potential hang in the CI
environment. The Python rewrite of the test ensures that DNS update
tasks finish within five minutes of starting, irrespective of a BIND
crash possibility or DNS zone updates not finishing in time.
Lower the size requirement for the dnstap output file produced during
the "dnstap" system test from 454 to 450 bytes; while files of that size
are not generated in any GitLab CI job, they are in other environments
where the test passes.
The fstrm_capture utility is started in the background during the
"dnstap" system test. Consequently, "rndc dnstap-reopen" and similar
commands may be executed before fstrm_capture starts listening on the
Unix domain socket it is configured to receive dnstap data on. This
results in the dnstap data sent to that socket in the meantime to be
lost; while the fstrm writer thread is able to recover from such a
scenario within a couple of seconds (by reopening the configured dnstap
destination itself), only one write attempt is made for data
successfully queued to the writer thread, so dnstap frames can still be
lost in the process. This may happen during the "dnstap" system test,
leading to the dnstap output file being empty, which in turn causes the
test to fail.
Fix by waiting until fstrm_capture starts listening on the Unix domain
socket it is configured to use before asking named to reopen the
configured dnstap destination. Since various fstrm_capture versions log
different messages when the listening socket is set up, wait for a
common string that works for all fstrm_capture versions released to
date. Add a few extra debug messages indicating test progress and make
the test fail if the expected fstrm_capture log message is not generated
within 10 seconds.
The fstrm_capture.out file is overwritten when the fstrm_capture utility
is restarted during the "dnstap" system test. Use a separate output
file for each fstrm_capture instance to ensure all output produced by
that tool during the "dnstap" system test is preserved for forensic
purposes.
Errors getting transfer statistics from named.run where not detected
as ret was not set to one if there hadn't been a success after looping
for a while.
these options concentrate zone maintenance actions into
bursts for the benefit of servers with intermittent connections.
that's no longer something we really need to optimize.
'HOME=value command' should only change HOME for command but on
some platforms this occasionally sets HOME for the rest of the
test. Explicitly isolate the enviroment change using a sub shell.
When using automated DNSSEC management, it is required that the zone
is dynamic, or that inline-signing is enabled (or both). Update the
checkconf code to also allow inline-signing to be enabled within
dnssec-policy.
Add an option to enable/disable inline-signing inside the
dnssec-policy clause. The existing inline-signing option that is
set in the zone clause takes priority, but if it is omitted, then the
value that is set in dnssec-policy is taken.
The built-in policies use inline-signing.
This means that if you want to use the default policy without
inline-signing you either have to set it explicitly in the zone
clause:
zone "example" {
...
dnssec-policy default;
inline-signing no;
};
Or create a new policy, only overriding the inline-signing option:
dnssec-policy "default-dynamic" {
inline-signing no;
};
zone "example" {
...
dnssec-policy default-dynamic;
};
This also means that if you are going insecure with a dynamic zone,
the built-in "insecure" policy needs to be accompanied with
"inline-signing no;".
When dns_request was canceled via dns_requestmgr_shutdown() the cancel
event would be propagated on different loop (loop 0) than the loop where
request was created on. In turn this would propagate down to isc_netmgr
where we require all the events to be called from the matching isc_loop.
Pin the dns_requests to the loops and ensure that all the events are
called on the associated loop. This in turn allows us to remove the
hashed locks on the requests and change the single .requests list to be
a per-loop list for the request accounting.
Additionally, do some extra cleanup because some race condititions are
now not possible as all events on the dns_request are serialized.
The conditions that trigger the crash:
- a stale record is in cache
- stale-answer-client-timeout is 0
- multiple clients query for the stale record, enough of them to exceed
the recursive-clients quota
- the response from the authoritative is sufficiently delayed so that
recursive-clients quota is exceeded first
The reproducer attempts to simulate this situation. However, it hasn't
proven to be 100 % reproducible, especially in CI. When reproducing
locally, the priming query also seems to sometimes interfere and prevent
the crash. When the reproducer is ran twice, it appears to be more
reliable in reproducing the issue.
The keys directory should be cleaned up in clean.sh. Doing that in the
test itself isn't reliable which may lead to failing mkdir which causes
the test to fail with set -e.
After commit f4eb3ba4, that is part of removing 'auto-dnssec', the
inline system test started to fail in FIPS CI jobs. This is because
the 'nsec3-loop' zone started to use a RSASHA256 key size of 1024 and
this is not FIPS compliant.
This commit changes the key size from 1024 to 4096, in order to
become FIPS compliant again.
The catz module has a fail-safe code to recreate a member zone
that was expected to exist but was not found.
Improve a test case where the fail-safe code is expected to execute
to check that the log message exists.
Add a test case where the fail-safe code is not expected to execute
to check that the log message does not exist.
1. Change the _new, _add and _copy functions to return the new object
instead of returning 'void' (or always ISC_R_SUCCESS)
2. Cleanup the isc_ht_find() + isc_ht_add() usage - the code is always
locked with catzs->lock (mutex), so when isc_ht_find() returns
ISC_R_NOTFOUND, the isc_ht_add() must always succeed.
3. Instead of returning direct iterator for the catalog zone entries,
add dns_catz_zone_for_each_entry2() function that calls callback
for each catalog zone entry and passes two extra arguments to the
callback. This will allow changing the internal storage for the
catalog zone entries.
4. Cleanup the naming - dns_catz_<fn>_<obj> -> dns_catz_<obj>_<fn>, as an
example dns_catz_new_zone() gets renamed to dns_catz_zone_new().
When checking for the number of logs related to DNSKEY key maintenance
events, don't include CDNSKEY is published lines.
Also consider RSASHA1: If not supported, the key maintenance for
the nsec-only zone are not logged.
These two configuration options worked in conjunction with 'auto-dnssec'
to determine KSK usage, and thus are now obsoleted.
However, in the code we keep KSK processing so that when a zone is
reconfigured from using 'dnssec-policy' immediately to 'none' (without
going through 'insecure'), the zone is not immediately made bogus.
Add one more test case for going straight to none, now with a dynamic
zone (no inline-signing).
Change test configuration to make use of 'dnssec-policy' instead of
'auto-dnssec'.
Because we now use 'dnssec-policy', there is no need to create an
explicit key in the final test that adds multiple inline zones
followed by a reconfig.
Change test configuration to make use of 'dnssec-policy' instead of
'auto-dnssec'.
Because we now add a DNSKEY with dynamic update, the sign statistics
change. When adding signatures triggered by dynamic update, the
dnssec-refresh stats are not incremented (this is only incremented
when signing is triggered by resign in lib/dns/zone.c).
The mkeys system test configured 'auto-dnssec' on the root zone to do
smart signing and simulate root key changes that should be picked up
by the automated trust anchor management of BIND.
This does not require 'auto-dnssec' or 'dnssec-policy', so change the
tests to use manual smart signing with 'dnssec-signzone'.
This test uses key timing metadata to do rollovers, this is no longer
applicable with 'dnssec-policy'. Note that with 'dnssec-policy' key
timing metadata is still written, but it is not used for determining
what and when to do key rollovers.
The inline system test tests 'auto-dnssec' in conjunction with
'inline-signing'. Change the tests to make use of 'dnssec-policy'.
Remove some tests that no longer make sense:
- The 'retransfer3.' zone tests changing the parameters with
'rndc signing -nsec3param'. This command is going away and NSEC3
parameters now need to be configured with nsec3param within
'dnssec-policy'.
- The 'inactivezsk.' and 'inactiveksk.' zones test whether the ZSK take
over signing if the KSK is inactive, or vice versa. This fallback
mode longer makes sense when using a DNSSEC policy.
Some tests need to be adapted more than just changing 'auto-dnssec'
to 'dnssec-policy':
- The 'delayedkeys.' zone first needs to be configured as insecure,
then we can change it to start signing. Previously, no existing
keys means that you cannot sign the zone, with 'dnssec-policy'
new keys will be created.
- The 'updated.' zone needs to have key states in a specific state
so that the minimal journal check still works (otherwise CDS/
CDNSKEY and related records will be in the journal too).
- External keys are now added to the unsigned zone and no longer
are maintained with key files. Adjust the 'externalkey.' zone
accordingly.
- The 'nsec3-loop.' zone requires three signing keys. Since
'dnssec-policy' will ignore duplicates in the 'keys' section,
create RSASHA256 keys with different role and/or key length.
Finally, the 'externalkey.' zone checks for an expected number of
DNSKEY and RRSIG records in the response. This used to be 3 DNSKEY
and 2 RRSIG records. Due to logic behavior changes (key timing
metadata is no longer authoritative, these expected values are
changed to 4 DNSKEY records (two signing keys and two external keys
per algorithm) and 1 RRSIG record (one active KSK per signing
algorithm).
The dnssec system test has some tests that use auto-dnssec. Update
these tests to make use of dnssec-policy.
Remove any 'rndc signing -nsec3param' commands because with
dnssec-policy you set the NSEC3 parameters in the configuration.
Remove now duplicate tests that checked if CDS and CDNSKEY RRsets
are signed with KSK only (the dnssec-dnskey-kskonly option worked
in combination with auto-dnssec).
Also remove the publish-inactive.example test case because such
use cases are no longer supported (only with manual signing).
The auto-nsec and auto-nsec3 zones need to use an alternative
algorithm because duplicate lines in dnssec-policy/keys are ignored.
The autosign system test mainly tests the auto-dnssec configuration
option. Since this option is going to be removed, update the system
test so that it uses dnssec-policy.
We could remove the complete system test, but keeping an altered
version of the system test may still be useful to detect unexpected
behavior after code changes.
Change the ns1 (test root server) to use manual signing. This zone
has some weird corner cases that do not fit the dnssec-policy model
very well.
The ns2 bar zone also needs to use manual signing, because it revokes
its key, and RFC 5011 key revocation is not supported with
dnssec-policy.
There are also a couple of weird corner test cases that can be removed:
- Inactive KSK or ZSK. With dnssec-policy there is no such thing as
ZSK taking over the role of a KSK when the KSK is deleted, or vice
versa.
- The CDS and CDNSKEY DELETE records are now automated with
dnssec-policy and so the tests for persistence are no longer required.
In tests.sh, bump the expected number of root DNSKEY records to 11,
because with manual signing the activation before publication is
actually honored.
Also remove any 'rndc signing -nsec3param' commands because with
dnssec-policy you set the NSEC3 parameters in the configuration.
Remove any check interval tests, these "next key event" times are
now calculated and tested in the kasp system test.
The "uname -o" command is harmful on OpenBSD because this platform does
not know about the "-o" option. It is a permanent failure since system
tests are started with "set -e".
We removed DNSSEC management via dynamic update (see issue #3686),
this means we also should no longer add signing records (of private
type) for DNSKEY records added via dynamic update.
to reduce the amount of common code that will need to be shared
between the separated cache and zone database implementations,
clean up unused portions of dns_db.
the methods dns_db_dump(), dns_db_isdnssec(), dns_db_printnode(),
dns_db_resigned(), dns_db_expirenode() and dns_db_overmem() were
either never called or were only implemented as nonoperational stub
functions: they have now been removed.
dns_db_nodefullname() was only used in one place, which turned out
to be unnecessary, so it has also been removed.
dns_db_ispersistent() and dns_db_transfernode() are used, but only
the default implementation in db.c was ever actually called. since
they were never overridden by database methods, there's no need to
retain methods for them.
in rbtdb.c, beginload() and endload() methods are no longer defined for
the cache database, because that was never used (except in a few unit
tests which can easily be modified to use the zone implementation
instead). issecure() is also no longer defined for the cache database,
as the cache is always insecure and the default implementation of
dns_db_issecure() returns false.
for similar reasons, hashsize() is no longer defined for zone databases.
implementation functions that are shared between zone and cache are now
prepended with 'dns__rbtdb_' so they can become nonstatic.
serve_stale_ttl is now a common member of dns_db.
To improve the compatibility of the inline test with the `set -e`
option, ensure all commands which are expected to pass are explicitly
checked for return code and non-zero return codes are handled.
The changes were mostly done with sed:
find . -name '*.sh' | xargs sed -i 's/`\([^`]*\)`/$(\1)/g'
There have been a few manual changes where the regex wasn't sufficient
(e.g. backslashes inside the `...`) or wrong (`...` referring to docs or
in comments).
Change the way arithmetic operations are performed in system test shell
scripts from using `expr` to `$(())`. This ensures that updating the
variable won't end up with a non-zero exit code, which would case the
script to exit prematurely when `set -e` is in effect.
The following replacements were performed using sed in all text files
(git grep -Il '' | xargs sed -i):
s/status=`expr $status + $ret`/status=$((status + ret))/g
s/n=`expr $n + 1`/n=$((n + 1))/g
s/t=`expr $t + 1`/t=$((t + 1))/g
s/status=`expr $status + 1`/status=$((status + 1))/g
s/try=`expr $try + 1`/try=$((try + 1))/g
Ensure all shell system tests are executed with the errexit option set.
This prevents unchecked return codes from commands in the test from
interfering with the tests, since any failures need to be handled
explicitly.
Add this test scenario for a bug fixed a while ago. When a third key is
introduced while the previous rollover hasn't finished yet, the keymgr
could decide to remove the first two keys, because it was not checking
for an indirect dependency on the keys.
In other words, the previous bug behavior was that the first two keys
were removed from the zone too soon.
This test case checks that all three keys stay in the zone, and no keys
are removed premature after another new key has been introduced.
In the kasp script, if one expected key is not found, continue checking
the other key ids, even if there is no match for the first one. This
provides a bit more information which keys mismatch and makes for
easier debugging test failures.
Pass 5 second timeout to the rndc status command(s) to avoid hitting the
hard 10 second timeout from subprocess.call, which would result in an
unwanted exception that would only mask the real issue: if the rndc
status times out in this test, it is likely due to the server not
stopping as it should.
The shutdown test attempts to shut down the server using two different
methods - rndc and sigterm. Use pytest.mark.parametrize to run these as
separate test cases for easier identification of failures.
Surround the variables which are checked whether they're executable in
double quotes. Without them, empty paths won't be properly interpreted
as not executable.
Since delv can occasionally hang in system tests when running with TSAN
(see GL#4119), disable these tests as a workaround. Otherwise, the hung
delv process will just waste CI resources and prevent any meaningful
output from the rest of the test suite.
tsig-keygen is now used to generate key files for TSIG. These have
a different format to those that were generated by dnssec-keygen.
Test that dig can still read these files.
tsig-keygen generates key files that are different to those that
where generated by dnssec-keygen. Check that nsupdate can still
read those old format files.
The ability to read legacy HMAC-MD5 K* keyfile pairs using algorithm
number 157 was accidentally lost when the algorithm numbers were
consolidated into a single block, in commit
09f7e0607a.
The assumption was that these algorithm numbers were only known
internally, but they were also used in key files. But since HMAC-MD5
got renumbered from 157 to 160, legacy HMAC-MD5 key files no longer
work.
Move HMAC-MD5 back to 157 and GSSAPI back to 160. Add exception for
GSSAPI to list_hmac_algorithms.
the default value of dnssec-validation is 'auto', which causes
a server to send a key refresh query to the root zone when starting
up. this is undesirable behavior in system tests, so this commit
sets dnssec-validation to either 'yes' or 'no' in all tests where
it had not previously been set.
this change had the mostly-harmless side effect of changing the cached
trust level of unvalidated answer data from 'answer' to 'authanswer',
which caused a few test cases in which dumped cache data was examined in
the serve-stale system test to fail. those test cases have now been
updated to expect 'authanswer'.
Previously, the first check silently failed, as 454 is apparently (in my
local setup) the minimum output size for the dnstap output, rather than
470 which the test was expecting. Effectively, the check served as a 5
second sleep rather than waiting for the proper file size.
Additionally, check the expected file sizes and fail if expectations
aren't met.
The log message is supposed to contain the zone name which was
erroneously omitted, but didn't pop up during tests, since return code
was silently ignored.
Now it actually waits for the proper log message rather than being an
equivalent of 3 second sleep (which was also sufficient to make the test
pass, thus we detected no failure).
In HTTP/1.0 and HTTP/1.1, RFC 9112 section 9.6 says the last response
in a connection should include a `Connection: close` header, but the
statschannel server omitted it.
In an HTTP/1.0 response, the statschannel server can sometimes send a
`Connection: keep-alive` header when it is about to close the
connection. There are two ways:
If the first request on a connection is keep-alive and the second
request is not, then _both_ responses have `Connection: keep-alive`
but the connection is (correctly) closed after the second response.
If a single request contains
Connection: close
Connection: keep-alive
then RFC 9112 section 9.3 says the keep-alive header is ignored, but
the statschannel sends a spurious keep-alive in its response, though
it correctly closes the connection.
To fix these bugs, make it more clear that the `httpd->flags` are part
of the per-request-response state. The Connection: flags are now
described in terms of the effect they have instead of what causes them
to be set.
The "dns_dnssec_findzonekeys2" log message is a leftover from when that
was the name of the function. Rename to match the current name of the
function.
When we add DNSKEY records via dynamic update, this should no longer
trigger signing the zone with these keys. This currently happens when
'find_zone_keys()' looks up the keys by inspecting the DNSKEY RRset,
then attempting to read the corresponding key files.
Add checks that inspect the logs whether an attempt to read the key
files for the newly added keys was done (and failed because these files
are not available).
The purpose of the check is to verify the server has survived the
previous barrage of queries. This is done by sending a query and
checking we get a NOERROR response back.
Previously, that query could've been affected by a servfail cache - the
server would return a SERVFAIL answer, thus failing the check, despite
being up and running. Use version.bind txt ch query to avoid the
interference of servfail cache.
Prepare the fetchlimit system test for adding a clients-per-query
check. Change some functions and commands to accept a destination
NS IP address instead of using the hardcoded 10.53.0.3.
The get_core_dumps.sh script couldn't find and process core files of
out-of-tree configurations because it looked for them in the source
instead of the build directory.
Add a test case where when priming the cache with a slow authoritative
resolver, the stale-answer-client-timeout option should not return
a delegation to the client (it should wait until an applicable answer
is found, if no entry is found in the cache).
The check for 'would limit' log message is triggered by sending at least
three messages within one second. However, in extremely slow conditions
(currently when running with clang+TSAN in CI), the individual queries
might take too much time to send enough of them within one second.
Since this is a pretty rare condition, let's just silently skip this
test in environments where a single query takes more than 500 ms, since
there's no way to perform the check under such conditions.
Closes#4082
The 'update-nsec3.example' requires to be DNSSEC maintained via
dynamic update. Commit 03b22983cd20cec51ad8b9f25f2e7d0e472dc79c adds
checks to make sure the raw zone is not signed. So the test case neesd
to be updated to allow for DNSSEC maintenance.
A zone in multisigner model 2 should also be possible to remove
previously added DNSKEY, CDS and CDNSKEY records from the zone operated
by the other provider.
Add a test case where updates are being made against a hidden primary
and two bump in the wire signers (the providers in the multisigner
model) serve the zone.
The test covers the same cases as for two primary providers that is:
- Add DNSKEY
- Remove (previously added) DNSKEY
- Add CDNSKEY
- Remove (previously added) CDNSKEY
- Add CDS
- Remove (previously added) CDS
A zone in multisigner model 2 should also be possible to publish the
CDS and CDNSKEY records from their KSK into the zone operated by the
other provider.
Add a new system test to test multisigner model use cases. This
initial test just tests a small part of the model 2, and uses two
providers for the same zone, ns3 and ns4, each with their own unique
key set. This commit tests that each provider can import their ZSK
of the other provider into their DNSKEY RRset, using dynamic update.
Both providers use dnssec-policy, ns3 applies the DNSSEC records
directly, while ns4 uses inline-signing.
The check which attempts to forward dynamic update to a dead primary may
trigger a timing issue #4080. For some reason, this has manifested under
the pytest runner, while the test still passes with the legacy runner.
Move the dead primary check closer to the end of the test to avoid
hitting this issue before we have a proper fix.
The module-level logger has a handler that writes into a temporary
directory. Ensure the logging output is flushed and the handler is
closed before attempting to remove this temporary directory.
Previously, run.sh tried to use pytest's -k option for test selection.
The downside was that this filter expression matched any test case with
the given substring, rather than executing a system test suite with the
given name.
The run.sh has been rewritten to invoke pytest from a system test
directory instead. This behaves more consistently with the run.sh from
legacy system test framework.
run.sh is now also a shell script to avoid confusion regarding its
file extension.
EL8+ systems declare "which" function using environment variables in the
/etc/profile.d/which2.sh file. Because of our suboptimal environment
variable detection, which is required in order to support the legacy
runner, these variables are picked up by the pytest runner.
If subprocesses are spawned with these environment variables set, it
will cause the following issue when they spawn yet another subprocess:
/bin/sh: which: line 1: syntax error: unexpected end of file
/bin/sh: error importing function definition for `which'
Instantiate a new logger that is used during pytest initialization /
configuration. This logging isn't handled by pytest itself, since it
happens outside of any tests or fixtures.
Root logger can't be reused for this purpose, because that would
duplicate the logs. Instead, create a conftest-specific logger for this
purpose.
Unfortunately, this introduces another log file,
pytest.conftest.log.txt, which contains only the logging from pytest
initialization. However, unless one is debugging the runner /
environment, there should be no need to investigate this file.
In order to take the most advantage of parallel execution of tests,
ensure certain long running tests are scheduled first.
The list of tests considered long-running was created empirically. In
addition to the test run time, its position in the default
(alphabetical) ordering was also taken into account.
The logger fixture is provided as a test-level logging facility which
can be easily passed to tests to enable capturing and/or displaying
messages from tests written in Python.
While this works optimally with the pytest runner, messages on INFO
level or above will also be visible when using the legacy runner.
The test_zone_timers_secondary_json() and
test_zone_timers_secondary_xml() tests are affected by issue #3983. Due
to the way tests are run, they are only affected when executing them
with the pytest runner.
Strict mode is set for pytest runner, as it always fails there. The
strict mode ensures we'll catch the change when the it starts passing
once the underlying issue is fixed. It can't be set for the legacy
runner, since the test (incorrectly) passes there.
Related #3983
If a test fails with an assertion failure or exception, its content
along with traceback is displayed in pytest output. This information
should be preserved in the test-specific logger for a given system test
to make it easier to debug test failures.
In order to avoid issues with decoding/encoding env variables due to
different encodings on different systems, deal with the environment
variables directly as bytes.
The loadscope setting is required for parallel execution of our system
tests using pytest. The option ensure that all tests within a single
(module) scope will be assigned to the same worker.
This is neccessary because the worker sets up the nameservers for all
the tests within a module scope. If tests from the same module would be
assigned to different workers, then the setup could happen multiple
times, causing a race condition. This happens because each module uses
deterministic port numbers for the nameservers.
Utilize developers' muscle memory to incentivize using the pytest runner
instead of the legacy one. The script also serves as basic examples of
how to run the pyest command to achieve the same results as the legacy
runner.
Invoking pytest directly should be the end goal, since it offers many
potentially useful options (refer to pytest --help).
In order to run the shell system tests, the pytest runner has to pick
them up somehow. Adding an extra python file with a single function
for the shell tests for each system test proved to be the most
compatible way of running the shell tests across older pytest/xdist
versions.
Modify the legacy run.sh script to ignore these pytest-runner specific
glue files when executing tests written in pytest.
Special care needs to be taken to support older pytest / xdist versions.
The target versions are what is available in EL8, since that seems to
have the oldest versions that can be reasonably supported.
When an issue occurs inside a fixture (e.g. servers fail to start/stop),
the test result won't be detected as failed, but rather an error will be
thrown.
To ensure the tempdir is kept even if the test itself passes but the
system_test() fixture throws an error, a different mechanism is needed.
At the start of the critical test setup section, note that the fixture
hasn't finished yet. When this is detected in the system_test_dir()
fixture, it is recognized as error in test setup/teardown and the temp
directory is kept.
This may seem cumbersome, because it is. It's basically a workaround for
the way pytest handles fixtures and test errors in general.
The temporary directory contains artifacts for the pytest module. That
module may contain multiple individual tests which were executed
sequentially. The artifacts should be kept if even one of these tests
failed.
Since pytest doesn't have any facility to expose test results to
fixtures, customize the pytest_runtest_makereport() hook to enable that.
It stores the test results into a session scope variable which is
available in all fixtures.
When deciding whether to remove the temporary directory, find the
relevant test results for this module and don't remove the tmpdir if any
one the tests failed.
For every pytest module, create a copy of its system test directory and
run the test from that directory. This makes it easier to clean up
afterwards and makes it less error-prone when re-running (failed) tests.
Configure the logger to capture the module's log output into a separate
file available from the temporary directory. This is quite convenient
for exploring failures.
Cases where temporary directory should be kept are handled in a
follow-up commits.
This is basically the pytest re-implementation of the run.sh script.
The fixture is applied to every module and ensures complete test
setup/teardown, such as starting/stopping servers, detecting coredumps
etc.
Note that the fixture system_test_dir is not defined yet. It is omitted
now for review readability and it's added in follow-up commits.
Add fixtures for deriving system test name from the directory name and
for a module-specific logger.
Add fixtures to execute shell and perl scripts. Similar to how run.sh
calls commands, this functionality is also needed in pytest. The
fixtures take care of switching to a proper directory, logging
everything and handling errors.
Note that the fixture system_test_dir is not defined yet. It is omitted
now for review readability and it's added in follow-up commits.
This is basically a pytest re-implementation of the get_ports.sh script.
The main difference is that ports are assigned on a module basis, rather
than a directory basis. Module is the new atomic unit for parallel
execution, therefore it needs to have unique ports to avoid collisions.
Each module gets its ports through the env fixture which is updated with
ports and other module-specific variables.
Some system tests require extra programs and/or dependencies to be
compiled first. This is done via `make check` with the automake
framework when using check_* variables such as check_PROGRAMS.
To avoid running any tests via the automake framework, set the TESTS
env variable to empty string and utilize `make -e check` to override
default Makefile variables with environment ones. This ensures automake
will only compile the needed dependencies without running any tests.
Additional consideration needs to be taken for xdist. The compilation
command should be called just once before any tests are executed. To
achieve that, use the pytest_configure() hook and check that the
PYTEST_XDIST_WORKER env variable isn't set -- if it is, it indicates
we're in the spawned xdist worker and the compilation was already done
by the main pytest process that spawned the workers.
This is mostly done to have on-par functionality with legacy test
framework. In the future, we should get rid of the need to run "empty"
make -e check and perhaps compile test-stuff by default.
The commands executed by pytest during a system test need to have the
same environment variables set as if they were executed by the run.sh
shell script.
It was decided that for the moment, legacy way of executing system tests
with run.sh should be kept, which complicates things a bit. In order to
avoid duplicating the required variables in both conf.sh and pytest, it
was decided to use the existing conf.sh as the only authoritative
place for the variables.
It is necessary to process the environment variables from conf.sh right
when conftest.py is loaded, since they might be needed right away (e.g.
to test for feature support during test collection).
This solution is a bit hacky and is only meant to be used during the
transitory phase when both pytest and the legacy run.sh are both
supported. In the future, a superior pytest-only solution should be
used.
For discussion of other options, refer to
https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/6809#note_318889
Ensure pytest picks up our python test modules, since we're using
tests_*.py convention, which is different from the default.
Configure pytest logging to display the output from all tests (even the
ones that passed). This ensures we have sufficient amount of information
to debug test post-mortem just from the artifacts.
The legacy system test framework uses pytest to execute some tests.
Since it'd be quite difficult to convince pytest to decide whether to
include conftest.py (or which ones to include when launching from
subdir), it makes more sense to have a shared conftest.py which is used
by both the legacy test runner invocations of pytest and the new pytest
system test runner. It is ugly, but once we drop support for the legacy
runner, we'll get rid of it.
Properly scope the *port fixtures in order to ensure they'll work as
expected with the new pytest runner. Instead of using "session" (which
means the fixture is only evaluated once for the entire execution of
pytest), use "module" scope, which is evaluated separately for each
module. The legacy runner invoked pytest for each system test
separately, while the new pytest runner is invoked once for all system
tests -- therefore it requires the more fine-grained "module" scope to
for the fixtures to work properly.
Remove python shebang, as conftest.py isn't supposed to be an executable
script.
The line summarising TSAN reports was misplaced in the ASAN territory
and thus never used.
I also made core dumps, assertion failures, and TSAN reports detection
independent of each other.
When the FIPS provider is available, RSASHA1 signing keys for zone
"example.com." are ignored if the zone is attempted to be signed with
the dnssec-signzone "-F" (FIPS mode) option:
"fatal: No signing keys specified or found"
The upforwd test for forwarding updates to a dead primary can continue
running a little bit past its end, causing update replies to be
recorded during a subsequent test case. Correct this by only looking
for update requests and replies for the specific domain name being
tested at any given time.
After the RCU changes were merged, the `upforwd` test started
consistenly failing when run under thread sanitizer. After some
investigation, it turned out that retry attempts were continuing after
the "update forwarding to dead primary" test. This caused mismatches
in the DNSTAP message counts for the subsequent tests, because they
were also counting retries.
Fix this problem by `wait`ing for the `nsupdate` processes to exit.
While investigating the bug, I replaced several fixed 15 second delays
with `wait_for_log`, so the test runs faster.
All the places the qp-trie code was using `call_rcu()` needed
`__tsan_release()` and `__tsan_acquire()` annotations, so
add a couple of wrappers to encapsulate this pattern.
With these wrappers, the tests run almost clean under thread
sanitizer. The remaining problems are due to `rcu_barrier()`
which can be suppressed using `.tsan-suppress`. It does not
suppress the whole of `liburcu`, because we would like thread
sanitizer to detect problems in `call_rcu()` callbacks, which
are called from `liburcu`.
The CI jobs have been updated to use `.tsan-suppress` by
default, except for a special-case job that needs the
additional suppressions in `.tsan-suppress-extra`.
We might be able to get rid of some of this after liburcu gains
support for thread sanitizer.
Note: the `rcu_barrier()` suppression is not entirely effective:
tsan sometimes reports races that originate inside `rcu_barrier()`
but tsan has discarded the stack so it does not have the
information required to suppress the report. These "races" can
be made much easier to reproduce by adding `atexit_sleep_ms=1000`
to `TSAN_OPTIONS`. The problem with tsan's short memory can be
addressed by increasing `history_size`: when it is large enough
(6 or 7) the `rcu_barrier()` stack usually survives long enough
for suppression to work.
Previously, if an exception would happen inside the `with` block, the
error handler would wait indefinitely for the process to end. That would
never happen, since the termination signal was never sent to named and
the test would get stuck.
Using the try-finally block ensures that the named process is always
killed and any exception or errors will be handled gracefully.
Improve code readability by splitting the test into more functions. Some
could be re-used later on for more general-purpose subprocess handling
or named checks.
Add more tests to the dnstap system test to roll with different values.
Touch some files to make sure the number of existing files exceed the
number that we want to keep.
Add a test to the logfileconfig system test for the increment suffix.
When dns_request_create() failed in notify_send_toaddr(), sending the
notify would silently fail. When notify_done() failed, the error would
be logged on the DEBUG(2) level.
This commit remedies the situation by:
* Promoting several messages related to notifies to INFO level and add
a "success" log message at the INFO level
* Adding a TCP fallback - when sending the notify over UDP fails, named
will retry sending notify over TCP and log the information on the
NOTICE level
* When sending the notify over TCP fails, it will be logged on the
WARNING level
Closes: #4001, #4002
There is no 'ret' in this test, and it is obvious that 'ret=1'
should be 'tmp=1' for the check to work correctly, if the string
is not found in the log file.
Add a test case to cover #3679 where a user migrates from a KSK/ZSK
split using auto-dnssec maintain, to the default dnssec-policy (CSK).
The test actually does not use the default dnssec-policy, but it does
use one that has the same keys clause. For testing convenience, we use
the same propagation time values as other test cases that migrate to
dnssec-policy with mismatching existing key set.
At the time of test number (19), there were 10 "sending packet to
10.53.0.7" lines in the "legacy/ns1/named.run" file; usually, only seven
are present:
I:legacy:checking recursive lookup to edns 512 + no tcp server does not cause query loops (19)
I:legacy:ns1 sent 10 queries to ns7, expected less than 10
I:legacy:failed
Those three can be attributed to tests "8", "10", and "18", where the
dig of "resolution_fails()" retried after a timeout to succeed with
"status: SERVFAIL" subsequently, as seen in each of
dig.out.test{8,10,18} files.
;; communications error to 10.53.0.1#13093: timed out
; <<>> DiG 9.19.12-dev <<>> -p 13093 +tcp @10.53.0.1 edns512-notcp. TXT
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5368
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
This retry is unnecessary because "resolution_fails()" considers timeout
a positive result.
This change makes the zone table lock-free for reads. Previously, the
zone table used a red-black tree, which is not thread safe, so the hot
read path acquired both the per-view mutex and the per-zonetable
rwlock. (The double locking was to fix to cleanup races on shutdown.)
One visible difference is that zones are not necessarily shut down
promptly: it depends on when the qp-trie garbage collector cleans up
the zone table. The `catz` system test checks several times that zones
have been deleted; the test now checks for zones to be removed from
the server configuration, instead of being fully shut down. The catz
test does not churn through enough zones to trigger a gc, so the zones
are not fully detached until the server exits.
After this change, it is still possible to improve the way we handle
changes to the zone table, for instance, batching changes, or better
compaction heuristics.
The dnspython.Resolve.resolve() requires at least dnspython >= 2.0.0,
this wasn't enforced in the shutdown system test leading to infinite
loop waiting for the server start due to failing resolve() call.
We don't need a separate module/file for every test. Both the rpz tests
could live in the same file.
The setup/teardown of servers if performed separately for each module --
unless there is a need to do that, it's better to avoid it.
This adds rudimentary test for response-policy zones in multiple
views. Different combinations are tested:
- two views with response-policy inherited from options {};
- two views view explicit response-policy using same RPZ zone name
- two views view explicit response-policy using secondary RPZ zone
* nsupdate should take 12 seconds (one try and three retries with
3 second timeout for each), UDP mode
* nsupdate -u 4 -r 1 should take 8 seconds (one try and one retry with
4 second timeout for each), UDP mode
* nsupdate -u 0 -t 8 -r 1 should also take 8 seconds, UDP mode
* nsupdate -u 4 -t 30 -r 1 should also take 8 seconds, as -u takes
precedence over -t, UDP mode
* nsupdate -t 8 -v should also take 8 seconds, TCP mode
The checkds system test could fail if some parent secondary servers did
not yet loaded all the zones before ns9 started sending DS queries. This
leads to SERVFAIL responses, while the test case expects good DS
responses. In order to mitigate against this issue, call 'rndc loadkeys'
to quickly restart the checkds procedure again.
Also refactor the checkds system test, to get rid of the many zone
name duplications. Update the functions 'zone_check' and
'keystate_check' to make the zone name an FQDN so we can just pass
the 'zone' variable into the function.
If the 'checkds' option is not explicitly set, check if there are
'parental-agents' for the zone configured. If so, default to "explicit",
otherwise default to "yes".
Add two new checkds test servers, that are hidden secondaries (hidden
as in not published in the NS RRset), that can be used specifically
for testing explicitly configured parental-agents.
Implement the new feature, automatic parental-agents. This is enabled
with 'checkds yes'.
When set to 'yes', instead of querying the explicit configured
parental agents, look up the parental agents by resolving the parent
NS records. The found parent NS RRset is considered to be the list
of parental agents that should be queried during a KSK rollover,
looking up the DS RRset corresponding to the key signing keys.
For each NS record, look up the addresses in the ADB. These addresses
will be used to send the DS requests. Count the number of servers and
keep track of how many good DS responses were seen.
The previous test cases already test the more complex case where there
are empty non-terminals between the child apex and the parent domain.
Add a test case where this is not the case, to execute the other code
path.
Add test cases for when checkds is disabled. Copy the test cases that
would have resulted in a DSPublish or DSRemoved and make sure that
with 'checkds no' the metadata is not set.
Add the test cases for automatic parental-agents, i.e. when 'checkds'
is set to 'yes'. Split out the special cases that use a reference
or a resolver as parental-agent so that the common use cases can be
tested with the same function.
Make the checkds system test more structured with the many more test
cases to come. Add a README for clarity.
Update the 'has_signed_apex_nsec' helper function so it can take any
domain name regardless of the number of labels.
Change the DNS tree structure such that we have different TLD names
for the various test scenarios, because we need servers that respond
differently to DS queries. Note that this isn't applicable to the
existing "checkds explicit" test cases, but is preparation work for
testing "checkds yes" (automatic parental agents).
Add a trust-anchor to the server that will be querying for parent
NS records.
Add a new configuration option to set how the checkds method should
work. Acceptable values are 'yes', 'no', and 'explicit'.
When set to 'yes', the checkds method is to lookup the parental agents
by querying the NS records of the parent zone.
When set to 'no', no checkds method is enabled. Users should run
the 'rndc checkds' command to signal that DS records are published and
withdrawn.
When set to 'explicit', the parental agents are explicitly configured
with the 'parental-agents' configuration option.
Cleanup the remnants of MS Compiler bits from <isc/refcount.h>, printing
the information in named/main.c, and cleanup some comments about Windows
that no longer apply.
The bits in picohttpparser.{h,c} were left out, because it's not our
code.
hypothesis prior to 4.41.2 uses hashlib.md5 which is not FIPS
compliant causing the wildcard system test to fail. Check if
we are running if FIPS mode and if so make the minimum version
of hypothesis we will accept to be 4.41.2.
The existing set of kerberos credential used deprecated algorithms
which are not supported by some implementations in FIPS mode.
Regenerate the saved credentials using more modern algorithms.
Added tsiggss/krb/setup.sh which sets up a test KDC with the required
principals for the system test to work. The tsiggss system test
needs to be run once with this active and KRB5_CONFIG appropriately.
set. See tsiggss/tests.sh for an example of how to do this.
OPENSSL_CONF="" is treated differently to no OPENSSL_CONF in
the environment by OpenSSL. OPENSSL_CONF="" lead to crypto
failure being reported in FIPS mode.
Call dst_lib_init to set FIPS mode if it was turned on at configure
time.
Check that named-checkconf report that dnssec policies that wont
work in FIPS mode are reported if named would be running in FIPS
mode.
Diffie-Hellman key echange doesn't appear to work in FIPS mode for
OpenSSL 1.x.x. Add feature test (--have-fips-dh) to identify builds
where DH key exchanges work (non FIPS builds and OpenSSL 3.0.0+) and
exclude test that would otherwise fail.
- RSASHA1 (5) and NSEC3RSASHA1 (7) are not accepted in FIPS mode
- minimum RSA key size is set to 2048 bit
adjust kasp and checkconf system tests to ensure non FIPS
compliant configurations are not used in FIPS mode
when testing the DNSRPS API, instead of linking to an installed
librpz.so from fastrpz, we now link to the test library. code that
ran dnsrpzd and checked the fastrpz license is now unnecessary and
has been removed.
two dnsrps-specific test cases in rpz (qname_as_ns and ip_as_ns) have
been removed, because they were only supported by fastrpz and do not
work in the test library. in rpzrecurse, nsip-wait-recurse and
nsdname-wait-recurse are now only tested in native mode, due to those
tests being specific to the native implementation.
These options and zone type were created to address the
SiteFinder controversy, in which certain TLD's redirected queries
rather than returning NXDOMAIN. since TLD's are now DNSSEC-signed,
this is no longer likely to be a problem.
The deprecation message for 'type delegation-only' is issued from
the configuration checker rather than the parser. therefore,
isccfg_check_namedconf() has been modified to take a 'nodeprecate'
parameter to suppress the warning when named-checkconf is used with
the command-line option to ignore warnings on deprecated options (-i).
Previously, an AXFR request would be issued every second while waiting
for the zone to be signed. This might've been the cause of issues in CI
where many tests are running in parallel and any extra load may increase
test instability.
Instead, check for the last NSEC record to have a signature before
commencing the AXFR request to check the zone has been fully signed.
Also increase the time for the zone signing to a total of 60+10 seconds
up from the previous 30.
Ensure messages from dupsigs system test end up in its log rather than
stdout. Previously, the output was hard to debug when running the tests
in parallel and messages wouldn't end up in the dupsigs.log.
stop and restart the server in the 'tsiggss' test, in order
to confirm that GSS negotiated TSIG keys are saved and restored
when named loads.
added logging to dns_tsigkey_createfromkey() to indicate whether
a key has been statically configured, generated via GSS negotiation,
or restored from a file.
If the zone already has existing NSEC/NSEC3 chains then zone_sign
needs to continue to use them. If there are no chains then use
kasp setting otherwise generate an NSEC chain.
The dnstap system test fails intermittently, and it appears to be
a timing issue - adding a short delay after running 'fstrm_capture',
and before running 'dnstap -reopen' improves the situation from
50% failures (5 out of 10 times) to 0% failures (0 out of 20 times),
tested locally.
The reason is that 'fstrm_capture' is executed in the background,
and due to OS scheduling and other factors, the listener socket
may not be ready when the following command runs and tells 'named'
to (re)open it.
Dumping of the freshly transferred zone file can take some time.
Retry 5 times before failing.
The log excerpt below shows such a case, when dumping lasted more than
two seconds.
06-Mar-2023 09:32:09.973 zone example6/IN: Transfer started.
06-Mar-2023 09:32:10.301 zone example6/IN: zone transfer finished: success
06-Mar-2023 09:32:10.301 zone_dump: zone example6/IN: enter
06-Mar-2023 09:32:11.789 client @0x7fe9ab435d68 10.53.0.10#44113 (example6): AXFR request
06-Mar-2023 09:32:11.801 client @0x7fe9ab435d68 10.53.0.10#44113 (example6): transfer of 'example6/IN': AXFR ended: 5 messages, 2676 records, 55815 bytes, 0.011 secs (5074090 bytes/sec) (serial 1397051952)
06-Mar-2023 09:32:12.409 zone_gotwritehandle: zone example6/IN: enter
06-Mar-2023 09:32:12.421 dump_done: zone example6/IN: enter
06-Mar-2023 09:32:12.421 zone_journal_compact: zone example6/IN: target journal size 53044
There can be comments in dig output for a zone transfer only in case
of an error, so we should print those errors not when wait_for_tls_xfer
succeeds, but when it fails.
Also, there is no point in printing those comments when a failure was
indeed expected.
The serve-stale system test was intermittently failing due to a timing
issue:
I:serve-stale:check stale data.example TXT was refreshed...
I:serve-stale:failed
The RRset is refreshed, however, it first checks for an expected log
line, prior checking that the stale data.example TXT was refreshed
(using dig). This log line is there to ensure the record is actually
refreshed before we start querying again. Alternatively we could just
retry_quiet 10 <wait for dig output matches expectations>. It would
lower the chances for intermittent test failures, since there is no
longer a "check for log line, sleep one second if check fails, check
for log line, ...", prior to the check.
Completely remove the TKEY Mode 2 (Diffie-Hellman Exchanged Keying) from
BIND 9 (from named, named.conf and all the tools). The TKEY usage is
fringe at best and in all known cases, GSSAPI is being used as it should.
The draft-eastlake-dnsop-rfc2930bis-tkey specifies that:
4.2 Diffie-Hellman Exchanged Keying (Deprecated)
The use of this mode (#2) is NOT RECOMMENDED for the following two
reasons but the specification is still included in Appendix A in case
an implementation is needed for compatibility with old TKEY
implementations. See Section 4.6 on ECDH Exchanged Keying.
The mixing function used does not meet current cryptographic
standards because it uses MD5 [RFC6151].
RSA keys must be excessively long to achieve levels of security
required by current standards.
We might optionally implement Elliptic Curve Diffie-Hellman (ECDH) key
exchange mode 6 if the draft ever reaches the RFC status. Meanwhile the
insecure DH mode needs to be removed.
The trick is to configure a duplicate zone, which comes after the
catalog zone, where the duplicate zone is an existing member zone.
In that scenario, all the zones which come before the "faulty" zone
in the configuration file will fail to be reverted to the previous
version of the view after a reconfiguration error, and in this
particular case that will result in an assertion failure when the
catalog zone update is initiated, because it will be still tied to
the new version of the view, which was dismissed.
Building the bin/tests/system/rpz/dnsrps helper binary is currently not
possible at all as the necessary compiler and linker flag definitions
are missing from bin/tests/system/Makefile.am. Add these as a basis for
addressing the problem.
Unfortunately, this is where the "mostly" bit mentioned in this commit's
subject line comes into play. The dlopen() parts of DNSRPS code have
not yet been reworked to use libuv's dlopen() API (uv_dlopen() etc.)
(See commit 37b9511ce1 for prior work in
this area.) While it is certainly possible to do that, implementing
such a change without testing it in practice against a usable librpz.so
(i.e. a DNSRPS provider library) is bound to cause more trouble and
confusion than keeping the code the way it is right now. However,
making that code buildable as-is requires linking against a C standard
library that exports the dlopen(), dlsym(), and dlclose() symbols used
by the DNSRPS dynamic loading code. glibc 2.34+ satisfies that
requirement, but older glibc versions do not (these come with a separate
libdl shared library that would need to be linked in as well). (Other
C standard library implementations have not been examined.) Since the
long-term plan is to rely on libuv's dlopen() API exclusively and
detecting the shared object containing dlopen() & friends would only
pull in build system complexity for no good reason, assume for now that
the target system provides the dlopen() API in its C standard library.
This change enables the system test suite to be run for a BIND 9 build
prepared using --enable-dnsrps --enable-dnsrps-dl (on systems satisfying
the requirement explained above). However, it is important to note that
this change by itself does NOT enable actual testing of the DNSRPS
feature as doing that requires a DNSRPS provider library to be present
on the test host.
This implements node reference tracing that passes all the internal
layers from dns_db API (and friends) to increment_reference() and
decrement_reference().
It can be enabled by #defining DNS_DB_NODETRACE in <dns/trace.h> header.
The output then looks like this:
incr:node:check_address_records:rootns.c:409:0x7f67f5a55a40->references = 1
decr:node:check_address_records:rootns.c:449:0x7f67f5a55a40->references = 0
incr:nodelock:check_address_records:rootns.c:409:0x7f67f5a55a40:0x7f68304d7040->references = 1
decr:nodelock:check_address_records:rootns.c:449:0x7f67f5a55a40:0x7f68304d7040->references = 0
There's associated python script to find the missing detach located at:
https://gitlab.isc.org/isc-projects/bind9/-/snippets/1038
Change the commandline option -G to take a string that determines what
sync records should be published. It is a comma-separated string with
each element being either "cdnskey", or "cds:<algorithm>", where
<algorithm> is a valid digest type. Duplicates are suppressed.
Change one of the test cases to use a different digest type (4). The
system tests and kasp script need to be updated to take into account
the new algorithm (instead of the hard coded 2).
The test was setting a minimum count for recursive clients which
was not always being met (e.g. 91 instead of 100) producing a false
positive. Lower the lower bound on recursive clients for this
test to 1.
Add the 'ixfr-from-differences yes;' option to trigger a failed
zone postload operation when a zone is updated but the serial
number is not updated, then issue two successive 'rndc reload'
commands to trigger the bug, which causes an assertion failure.
the dns_xfrin module was still using the network manager directly to
manage TCP connections and send and receive messages. this commit
changes it to use the dispatch manager instead.
the 'dispatchmgr' member of the resolver object is used by both
the dns_resolver and dns_request modules, and may in the future
be used by others such as dns_xfrin. it doesn't make sense for it
to live in the resolver object; this commit moves it into dns_view.
the parser could crash when "include" specified an empty string in place
of the filename. this has been fixed by returning ISC_R_FILENOTFOUND
when the string length is 0.
Reproduce the assertion by configuring a 'named' resolver with
'recursive-clients 10;' configuration option and running 20
queries is parallel.
Also tweak the 'ans2/ans.pl' to simulate a 50ms network latency
when qname starts with "latency". This makes sure that queries
running in parallel don't get served immediately, thus allowing
the configured recursive clients quota limitation to be activated.
move database attach/detach functions to db.c, instead of
requiring them to be implemented for every database type.
instead, they must implement a 'destroy' function that is
called when references go to zero.
this enables us to use ISC_REFCOUNT_IMPL for databases,
with detailed tracing enabled by setting DNS_DB_TRACE to 1.
initialize dns_dbmethods, dns_sdbmethods and dns_rdatasetmethods
using explicit struct member names, so we don't have to keep track
of NULLs for unimplemented functions any longer.
some dns_db functions would have crashed if the DB implementation failed
to implement them, requiring the implementations to add functions that
did nothing but return ISC_R_NOTIMPLEMENTED or some obvious default
value. we can just have the dns_db wrapper functions themselves return
those values, and clean up the implementations accordingly.
as there is no further use of isc_task in BIND, this commit removes
it, along with isc_taskmgr, isc_event, and all other related types.
functions that accepted taskmgr as a parameter have been cleaned up.
as a result of this change, some functions can no longer fail, so
they've been changed to type void, and their callers have been
updated accordingly.
the tasks table has been removed from the statistics channel and
the stats version has been updated. dns_dyndbctx has been changed
to reference the loopmgr instead of taskmgr, and DNS_DYNDB_VERSION
has been udpated as well.
change functions using isc_taskmgr_beginexclusive() to use
isc_loopmgr_pause() instead.
also, removed an unnecessary use of exclusive mode in
named_server_tcptimeouts().
most functions that were implemented as task events because they needed
to be running in a task to use exclusive mode have now been changed
into loop callbacks instead. (the exception is catz, which is being
changed in a separate commit because it's a particularly complex change.)
dns_request_create() and _createraw() now take a 'loop' parameter
and run the callback event on the specified loop.
as the task manager is no longer used, it has been removed from
the dns_requestmgr structure. the dns_resolver_taskmgr() function
is also no longer used and has been removed.
Include MD5 feature detection in featuretest tool and use it in some
places. When RHEL distribution or Fedora ELN is in FIPS mode, then MD5
algorithm is unavailable completely and even hmac-md5 algorithm usage
will always fail. Work that around by checking MD5 works and if not,
skipping its usage.
Those changes were dragged as downstream patch bind-9.11-fips-tests.patch
in Fedora and RHEL.
Tests using diff to compare outputs of dig +short shall ignore lines
starting with ";". In dig +short output, such lines should only be
present for errors such as network issues. Since we utilize dig's
default timeout/retry mechanisms, these transitory issues should be
ignored and only the final output should be considered during the diff
comparison.
This adds an island of trust that is reachable from the root
where the trust anchors are added to island.conf.
This add an island of trust that is not reachable from the root
where the trust anchors are added to private.conf.
Occasionally, the allotted 10 seconds for the "running" line to appear
in log after named is started proved insufficient in CI, especially
during increased load. Give named up to 60 seconds to start up to
mitigate this issue.
isc_bind9 was a global bool used to indicate whether the library
was being used internally by BIND or by an external caller. external
use is no longer supported, but the variable was retained for use
by dyndb, which needed it only when being built without libtool.
building without libtool is *also* no longer supported, so the variable
can go away.
Send the test message from ns3 to ns2 instead of ns2 to ns3 as ns2
is started first and therefore the test doesn't have to wait on the
resend of the the NOTIFY message to be successful.
the nsupdate system test was intermittently failing due to the update
quota not being exceeded when it should have been. this is most likely
a timing issue: the client is sending updates too slowly, or the server
is processing them too quickly, for the quota to fill. this commit
attempts to make that the failure less likely by increasing the number
of update transactions from 10 to 20.
Following deleting the root trust anchor and reconfiguring the
server it takes some time to for trust anchor to appear in 'rndc
managed-keys status' output. Retry several times.
check in the log files of receiving servers that the originating
ports for notify and SOA query messages were set correctly from
configured notify-source and transfer-source options.
* rbt node chains were sized to allow for bitstring labels, so they
had 256 levels; but in the absence of bistrings, 128 is enough.
* dns_byaddr_createptrname() had a redundant options argument,
and a very outdated doc comment.
* A number of comments referred to bitstring labels in a way that is
no longer helpful. (A few informative comments remain.)
Set the DS state after issuing 'rndc dnssec -checkds'. If the DS
was published, it should go in RUMOURED state, regardless whether it
is already safe to do so according to the state machine.
Leaving it in HIDDEN (or if it was magically already in OMNIPRESENT or
UNRETENTIVE) would allow for easy shoot in the foot situations.
Similar, if the DS was withdrawn, the state should be set to
UNRETENTIVE. Leaving it in OMNIPRESENT (or RUMOURED/HIDDEN)
would also allow for easy shoot in the foot situations.
The malloced and maxmalloced memory counters were mostly useless since
we removed the internal allocator blocks - it would only differ from
inuse by the memory context size itself.
The validity default days value of 1 was used for debugging and
left as such accidentally.
Use 10950 days, as used elsewhere (for example, in doth test CA).
This does not affect anything, the value will be effective when
generating new test certificates in the future.
Change the 'forward' system test to enable DoT on ns2 server,
and test that forwarding from ns4 to the DoT-enabled ns2 works.
In order to test different scenarios, create a test CA (based on
similar CAs for 'doth' and 'nsupdate' system tests), and test
both insecure (no certificate validation) and secure (also with
mutual TLS) TLS configurations, as well as a configuration with an
expired certificate.
A 'tls' statement can be specified both for individual addresses
and for the whole list (as a default value when an individual
address doesn't have its own 'tls' set), just as it was done
before for the 'port' value.
Create a new function 'print_rawqstring()' to print a string residing
in a 'isc_textregion_t' type parameter.
Create a new function 'copy_string()' to copy a string from a
'cfg_obj_t' object into a 'isc_textregion_t'.
Add a test case for a server that uses a resolver as an parental-agent.
We need two root servers, ns1 and ns10, one that delegates to the
'checkds' tld with the DS published (ns2), and one that delegates to
the 'checkds' tld with the DS removed (ns5). Both root zones are
being setup in the 'ns1/setup.sh' script.
We also need two resolvers, ns3 and ns8, that use different root hints
(one uses ns1 address as a hint, the other uses ns10).
Then add the checks to test_checkds.py is similar to the existing tests.
Update 'types' because for zones that have the DS withdrawn (or to be
withdrawn), the CDS and CDNSKEY records should not be published and
thus should not be in the NSEC bitmap.
Add 'port' token to deprecated.conf. Also add options
'use-v4-udp-ports', 'use-v6-udp-ports', 'avoid-v4-udp-ports',
and 'avoid-v6-udp-ports'.
All of these should trigger warnings (except when deprecation warnings
are being ignored).
Deprecate the use of "port" when configuring query-source(-v6),
transfer-source(-v6), notify-source(-v6), parental-source(-v6),
etc. Also deprecate use-{v4,v6}-udp-ports and avoid-{v4,v6}udp-ports.
The condition was accidentally reversed during refactoring in
9730ac4c56 . It would result in skipped
tests on builds with proper support and false negatives on builds
without proper feature support.
Credit for reporting the issue and the fix goes to Stanislav Levin.
Instead of using the current working directory to find the ifconfig.sh
script, look for the ifconfig.sh.in template in the directory where the
testsock.pl script is located. This enables the testsock.pl script to be
called from any working directory.
Using the ifconfig.sh.in template is sufficient, since it contains
the necessary information to be extracted: the max= value (which is
hard-coded in the template).
Move the core dump detection functionality for system test runs into a
separate script. This enables reuse by the pytest runner. The
functionality remains the same.
Avoid creating any temporary files in the current workdir.
Additional/changing files in the bin/tests/system directory are
problematic for pytest/xdist collection phase, which assumes the list of
files doesn't change between the collection phase of the main pytest
thread and the subsequent collection phase of the xdist worker threads.
Since the testcrypto.sh is also called during pytest initialization
through conf.sh.common (to detect feature support), this could
occasionally cause a race condition when the list of files would be
different for the main pytest thread and the xdist worker.
verify that updates are refused when the client is disallowed by
allow-query, and update forwarding is refused when the client is
is disallowed by update-forwarding.
verify that "too many DNS UPDATEs" appears in the log file when too
many simultaneous updates are processing.
DSCP has not been fully working since the network manager was
introduced in 9.16, and has been completely broken since 9.18.
This seems to have caused very few difficulties for anyone,
so we have now marked it as obsolete and removed the
implementation.
To ensure that old config files don't fail, the code to parse
dscp key-value pairs is still present, but a warning is logged
that the feature is obsolete and should not be used. Nothing is
done with configured values, and there is no longer any
range checking.
Prime the cache with the following records:
shortttl.cname.example. 1 IN CNAME longttl.target.example.
longttl.target.example. 600 IN A 10.53.0.2
Wait for the CNAME record to expire, disable the authoritative server,
and query 'shortttl.cname.example' again, expecting a stale answer.
This bug was masked in the tests because the `catz` test script did an
`rndc addzone` before an `rndc delzone`. The `addzone` autovivified
the NZF config, so `delzone` worked OK.
This commit swaps the order of two sections of the `catz` test script
so that it uses `delzone` before `addzone`, which provokes a crash
when `delzone` requires a non-NULL NZF config.
To fix the crash, we now try to remove the zone from the NZF config
only if it was dynamically added but not by a catalog zone.
Remove parsing the configuration options 'alt-transfer-source',
'alt-transfer-source-v6', and 'use-alt-transfer-source', and remove
the corresponding code that implements the feature.
Use the configured 'source' and 'source-v6' when initiating a zone
transfer, sending a notify, or when checking for the DS. Remove the
special code for using alternate transfer sources.
Update some system tests to use the new configuration and make sure
the tests still work.
Add a new way to configure the preferred source address when talking to
remote servers such as primaries and parental-agents. This will
eventually deprecate options such as 'parental-source',
'parental-source-v6', 'transfer-source', etc.
Example of the new configuration:
parental-agents "parents" port 5353 \
source 10.10.10.10 port 5354 dscp 54 \
source-v6 2001:db8::10 port 5355 dscp 55 {
10.10.10.11;
2001:db8::11;
};
The pre-defined test cases use named.$TESTCASE.conf naming convention,
where TESTCASE is a human readable name contaning actual word(s). The
autogenerated test cases' names always start with a number from 1 to 6.
bin/tests/system/rrsetorder/dig.out* files match a gitignore expression
present in bin/tests/system/.gitignore. Since these are meant to be
reference files that are compared to the files generated when the
"rrsetorder" system test is run, rename them to avoid listing tracked
files in .gitignore files.
The test expects a "connection timed out" message from DiG when it
experiences a timeout, while the current version of DiG prints just
a "timed out" message, like below:
;; communications error to 10.53.0.1#11314: timed out
;; communications error to 10.53.0.1#11314: timed out
;; communications error to 10.53.0.1#11314: timed out
; <<>> DiG 9.19.9-dev <<>> -p 11314 +tries +time +tcp +tries +time @10.53.0.1 dropedns. TXT
; (1 server found)
;; global options: +cmd
;; no servers could be reached
Change the expected string to match the current DiG output.
Use the '-F' switch for "grep" for matching a fixed string.
In order to have a common naming convention for system tests, rename the
only outlier "engine_pkcs11" to "enginepkcs11", which was the only
system test using an underscore in its name.
The only allowed word separators for system test names are either dash
or no separator.
It is better to use consistent file names to avoid issue with sorting
etc.
Using underscore in filenames as opposed to dash was chosen because it
seems more common in pytest/python to use underscore for filenames.
Also rename the bin/tests/system/timeouts/tests-tcp.py file to
bin/tests/system/timeouts/tests_tcp_timeouts.py to avoid pytest name
collision (there can't be two files named tests_tcp.py).
This introduces a Python dependency for running system tests. It is
needed in order to:
- write new test control scripts in Python
- gradually rewrite old Perl scripts into Python if needed
- eventually introduce pytest as the new test runner framework
This commit is not intended to be backported to 9.16.
Nothing from conf.sh.common is required to set these values. On the
contrary, a Python interpreter needs to be set in order to randomize the
algorithm set (which happens in conf.sh.common).
When testcrypto.sh is used as a standalone script, always use quiet mode
to avoid using undefined commands (such as echo_i) which require
inclusion of the entire conf.sh machinery.
The dispatches are not thread-bound, and used freely between various
threads (see the dns_resolver and dns_request units for details).
This refactoring make sure that all non-const dns_dispatch_t and
dns_dispentry_t members are accessed under a lock, and both object now
track their internal state (NONE, CONNECTING, CONNECTED, CANCELED)
instead of guessing the state from the state of various struct members.
During the refactoring, the artificial limit DNS_DISPATCH_SOCKSQUOTA on
UDP sockets per dispatch was removed as the limiting needs to happen and
happens on in dns_resolver and limiting the number of UDP sockets
artificially in dispatch could lead to unpredictable behaviour in case
one dispatch has the limit exhausted by others are idle.
The TCP artificial limit of DNS_DISPATCH_MAXREQUESTS makes even less
sense as the TCP connections are only reused in the dns_request API
that's not a heavy user of the outgoing connections.
As a side note, the fact that UDP and TCP dispatch pretends to be same
thing, but in fact the connected UDP is handled from dns_dispentry_t and
dns_dispatch_t acts as a broker, but connected TCP is handled from
dns_dispatch_t and dns_dispatchmgr_t acts as a broker doesn't really
help the clarity of this unit.
This refactoring kept to API almost same - only dns_dispatch_cancel()
and dns_dispatch_done() were merged into dns_dispatch_done() as we need
to cancel active netmgr handles in any case to not leave dangling
connections around. The functions handling UDP and TCP have been mostly
split to their matching counterparts and the dns_dispatch_<function>
functions are now thing wrappers that call <udp|tcp>_dispatch_<function>
based on the socket type.
More debugging-level logging was added to the unit to accomodate for
this fact.
When the resolver was refactored, the statistics system test had to be
adjusted in c6b4d82557. Unfortunately,
this change had to be done because of an error in the resolver
refactoring where timeout would not retry next server, but keep trying
the same server. As we have now fixed this bug, revert the change to
the test back to the previous state.
Using a restored catalog zone excercised a use-after-free bug.
The test checks that the use-after-free bug is gone and is just
a reasonable behaviour check in its own right.
Prime the cache with the following records:
shortttl.cname.example. 1 IN CNAME longttl.target.example.
longttl.target.example. 600 IN A 10.53.0.2
Wait for the CNAME record to expire, disable the authoritative server,
and query 'shortttl.cname.example' again, expecting a stale answer.
While some of these tests are for DoT which doesn't require nghttp2,
the server configs won't allow the server to start without nghttp2
support during compile time.
It might be possible to split these tests into DoT and DoH and only
require nghttp2 for DoH tests, but since almost all of our CI jobs are
compiled with nghttp2, we wouldn't gain a lot of coverage, so it's
probably not worth the effort.
Avoid using the environment variables for feature detection and use the
feature-test utility instead.
Remove the obsolete environment variables from conf.sh, since they're no
longer used anywhere.
Previously, there were two different ways to detect feature support.
Either through an environment variable set by configure in conf.sh, or
using the feature-test utility.
It is more simple and consistent to have only one way of detecting the
feature support. Using the feature-test utility seems superior the the
environment variables set by configure.
The mechanism for associating a worker task to a database now
uses loops rather than tasks.
For this reason, the parameters to dns_cache_create() have been
updated to take a loop manager rather than a task manager.
The dns_adb unit has been refactored to be much simpler. Following
changes have been made:
1. Simplify the ADB to always allow GLUE and hints
There were only two places where dns_adb_createfind() was used - in
the dns_resolver unit where hints and GLUE addresses were ok, and in
the dns_zone where dns_adb_createfind() would be called without
DNS_ADBFIND_HINTOK and DNS_ADBFIND_GLUEOK set.
Simplify the logic by allowing hint and GLUE addresses when looking
up the nameserver addresses to notify. The difference is negligible
and would cause a difference in the notified addresses only when
there's mismatch between the parent and child addresses and we
haven't cached the child addresses yet.
2. Drop the namebuckets and entrybuckets
Formerly, the namebuckets and entrybuckets were used to reduced the
lock contention when accessing the double-linked lists stored in each
bucket. In the previous refactoring, the custom hashtable for the
buckets has been replaced with isc_ht/isc_hashmap, so only a single
item (mostly, see below) would end up in each bucket.
Removing the entrybuckets has been straightforward, the only matching
was done on the isc_sockaddr_t member of the dns_adbentry.
Removing the zonebuckets required GLUEOK and HINTOK bits to be
removed because the find could match entries with-or-without the bits
set, and creating a custom key that stores the
DNS_ADBFIND_STARTATZONE in the first byte of the key, so we can do a
straightforward lookup into the hashtable without traversing a list
that contains items with different flags.
3. Remove unassociated entries from ADB database
Previously, the adbentries could live in the ADB database even after
unlinking them from dns_adbnames. Such entries would show up as
"Unassociated entries" in the ADB dump. The benefit of keeping such
entries is little - the chance that we link such entry to a adbname
is small, and it's simpler to evict unlinked entries from the ADB
cache (and the hashtable) than create second LRU cleaning mechanism.
Unlinked ADB entries are now directly deleted from the hash
table (hashmap) upon destruction.
4. Cleanup expired entries from the hash table
When buckets were still in place, the code would keep the buckets
always allocated and never shrink the hash table (hashmap). With
proper reference counting in place, we can delete the adbnames from
the hash table and the LRU list.
5. Stop purging the names early when we hit the time limit
Because the LRU list is now time ordered, we can stop purging the
names when we find a first entry that doesn't fullfil our time-based
eviction criteria because no further entry on the LRU list will meet
the criteria.
Future work:
1. Lock contention
In this commit, the focus was on correctness of the data structure,
but in the future, the lock contention in the ADB database needs to
be addressed. Currently, we use simple mutex to lock the hash
tables, because we almost always need to use a write lock for
properly purging the hashtables. The ADB database needs to be
sharded (similar to the effect that buckets had in the past). Each
shard would contain own hashmap and own LRU list.
2. Time-based purging
The ADB names and entries stay intact when there are no lookups.
When we add separate shards, a timer needs to be added for time-based
cleaning in case there's no traffic hashing to the inactive shard.
3. Revisit the 30 minutes limit
The ADB cache is capped at 30 minutes. This needs to be revisited,
and at least the limit should be configurable (in both directions).
The system test should never attempt to start or stop any other server
than those that belong to that system test. Therefore, it is not
necessary to specify the system test name in function calls.
Additionally, this makes it possible to run the test inside a
differently named directory, as its name is automatically detected with
the $SYSTESTDIR variable. This enables running the system tests inside a
temporary directory.
Direct use of stop.pl was replaced with a more systematic approach to
use stop_servers helper function.
Remove test cases that rely upon key and denial of existence
management operations triggered by dynamic updates.
The autosign system test needed a bit more care than just removing
because the test cases are dependent on each other, so there are some
additional tweaks such as setting the NSEC3PARAM via rndc signing,
and renaming zone input files. In the process, some additional
debug output files have been added, and a 'ret' fail case overwrite
was fixed.
Create a zone that triggers DNAME owner name checks in a zone that
is only reachable using a dual stack server. The answer contains
a name that is higher in the tree than the query name.
e.g.
foo.v4only.net. CNAME v4only.net.
v4only.net. A 10.0.0.1
ns4 is serving the test zone (ipv4-only)
ns6 is the root server for this test (dual stacked)
ns7 is acting as the dual stack server (dual stacked)
ns9 is the server under test (ipv6-only)
checkbashisms warns about possible reliance on HOSTNAME environmental
variable which Bash sets to the name of the current host, and some
commands may leverage it:
possible bashism in builtin/tests.sh line 199 ($HOST(TYPE|NAME)):
grep "^\"$HOSTNAME\"$" dig.out.ns1.$n > /dev/null || ret=1
possible bashism in builtin/tests.sh line 221 ($HOST(TYPE|NAME)):
grep "^\"$HOSTNAME\"$" dig.out.ns2.$n > /dev/null || ret=1
possible bashism in builtin/tests.sh line 228 ($HOST(TYPE|NAME)):
grep "^; NSID: .* (\"$HOSTNAME\")$" dig.out.ns2.$n > /dev/null || ret=1
We don't use the variable this way but rename it to HOST_NAME to silence
the tool.
"next_key_event_threshold" is assigned with
"next_key_event_threshold+i", but "i" is empty (never set, nor used
afterwards).
posh, the Policy-compliant Ordinary SHell, failed on this assignment
with:
tests.sh:253: : unexpected `end of expression'
checkbashisms gets confused by the rndc command being on two lines:
possible bashism in bin/tests/system/nzd2nzf/tests.sh line 37 (type):
rndccmd 10.53.0.1 addzone "added.example { type primary; file \"added.db\";
checkbashisms reports Bash-style ("==") string comparisons inside test/[
command:
possible bashism in bin/tests/system/checkconf/tests.sh line 105 (should be 'b = a'):
if [ $? == 0 ]; then echo_i "failed"; ret=1; fi
possible bashism in bin/tests/system/keyfromlabel/tests.sh line 62 (should be 'b = a'):
test $ret == 0 || continue
possible bashism in bin/tests/system/keyfromlabel/tests.sh line 79 (should be 'b = a'):
test $ret == 0 || continue
The checkbashisms script reports errors like this one:
script util/check-line-length.sh does not appear to have a #! interpreter line;
you may get strange results
It was possible to set operating system limits (RLIMIT_DATA,
RLIMIT_STACK, RLIMIT_CORE and RLIMIT_NOFILE) from named.conf. It's
better to leave these untouched as setting these is responsibility of
the operating system and/or supervisor.
Deprecate the configuration options and remove them in future BIND 9
release.
The retry 3 times when checking signatures did not make sense because
at this point the input file does not change.
Raise the number of retries when checking the apex DNSKEY response to
reduce the number of intermittent failures due to unexpected delays.
If 'set -x' is in effect file.prev gets populated with debugging output.
To prevent this open descriptor 3 and redirect stderr from the awk
command to descriptor 3. Debugging output will stay directed to stderr.
The changes in the code have the side effect that the CDNSKEY and CDS
records in the secure version of the zone are not reusable and thus
are thrashed from the zone. Remove the apex checks for this use case.
We only care about that the zone is not immediately goes bogus, but
a user really should use the built-in "insecure" policy when unsigning
a zone.
Add one more case that tests reconfiguring a zone to turn off
inline-signing. It should still be a valid DNSSEC zone and the NSEC3
parameters should not change.
Add another test to ensure that you cannot update the zone with a
NSEC3 record.
We no longer accept copying DNSSEC records from the raw zone to
the secure zone, so update the kasp system test that relies on this
accordingly.
Also add more debugging and store the dnssec-verify results in a file.
Add a kasp system test that reconfigures a dnssec-policy zone from
maintaining DNSSEC records directly to the zone to using inline-signing.
Add a similar test case to the nsec3 system test, testing the same
thing but now with NSEC3 in use.
the dupsigs test is prone to failing on slow CI machines
because the first test can occur before the zone is fully
signed.
instead of just waiting ten seconds arbitrarily, we now
check every second, and allow up to 30 seconds before giving
up.
Checks that malformed _dns SVCB records are rejected unless
check-svcb is set to no, in which case they are accepted. Both
missing ALPN and missing DOHPATH are checked for.
Add a new upforwd system test that checks if update forwarding still
works if the first primary is badly configured.
We cannot reuse the 'example.' zone for this test because that
checks if update forwarding works for DoT. What transport is used
in the new test is of no relevance.
Update the system test to use different known good file names for
the different zones that are being tested.
Use the ALGORITHM_SET option to use randomly selected default algorithm
in this test. Make sure the test works by using variables instead of
hard-coding values.
Use the get_algorithms.py script to detect supported algorithms and
select random algorithms to use for the tests.
Make sure to load common.conf.sh after KEYGEN env var is exported.
Multiple algorithm sets can be defined in this script. These can be
selected via the ALGORITHM_SET environment variable. For compatibility
reasons, "stable" set contains the currently used algorithms, since our
system tests need some changes before being compatible with randomly
selected algorithms.
The script operation is similar to the get_ports.py - environment
variables are created and then printed out as `export NAME=VALUE`
commands, to be interpreted by shell. Once we support pytest runner for
system tests, this should be a fixture instead.
Certain variables have to be exported in order for the system tests to
work. It makes little sense to export the variables in one place/script
while they're defined in another place.
Since it makes no harm, export all the variables to make the behaviour
more predictable and consistent. Previously, some variables were
exported as environment variables, while others were just shell
variables which could be used once the configuration was sourced from
another script. However, they wouldn't be exposed to spawned processes.
For simplicity sake (and for the upcoming effort to run system tests
with pytest), export all variables that are used. TESTS, PARALLEL_UNIX
and SUBDIRS variables are automake-specific, aren't used anywhere else
and thus not exported.
The only variable really needed for the script to work is the path to
the $KEYGEN binary. Allow setting this via an environment variable to
avoid loading conf.sh (and causing a chicken-egg problem). Also make
testcrypto.sh executable to allow its use from conf.sh.
Add a test case that if the first primary fails, the fallback of a
second primary on plain DNS works. This is mainly to test that the port
configuration inheritance works correctly.
Add a couple of tests that verify the serve-stale behavior when
stale-answer-client-timeout is set to 0 and a (stale) CNAME record is
queried.
Related #3517
The test triggers a prefetch, but fails to check if it acutally
happened, which prevented it from catching a bug when the record's
TTL value matches the configured prefetch eligibility value.
Check that prefetch happened by comparing the TTL values.
For tests where the TCP connection might get interrupted abruptly,
replace the nc with curl as the data sent from server to client might
get lost because of abrupt TCP connection. This happens when the TCP
connection gets closed during sending the large request to the server.
As we already require curl for other system tests, replace the nc usage
in the statschannel test with curl that actually understands the
HTTP/1.1 protocol, so the same connection is reused for sending the
consequtive requests, but without client-side "pipelining".
For the record, the server doesn't support parallel processing of the
pipelined request, so it's a bit misnomer here, because what we are
actually testing is that we process all requests received in a single
TCP read callback.
The 5 seconds requirement to finish the 'pipelined with truncated
stream' was causing spurious failures in the CI because the job runners
might be very busy and sending 128k of data might simply take some time.
Remove the time requirement altogether, there's actually no reason why
the test SHOULD or even MUST finish under 5 seconds.
Correctly source conf.sh in dupsigs test scripts (fix issue introduced
by 093af1c00a).
Update dupsigs test for dnssec-dnskey-kskonly default. Since v9.17.20,
the dnssec-dnskey-kskonly is set to yes. Update the test to not expect
the additional RRSIG with ZSK for DNSKEY.
Speed up the test from 20 minutes to 2.5 minutes and make it part of the
default test suite executed in CI.
- decrease number of records to sign from 2000 to 500
- decrease the signing interval by a factor of 6
- shorten the final part of the test after last signing (since nothing
new happens there)
Finally, clarify misleading comments about (in)sufficient time for zone
re-signing. The time used in the test is in fact sufficient for the
re-signing to happen. If it wasn't, the previous ZSK would end up being
deleted while its signatures would still be present, which is a
situation where duplicate signatures can still happen.
Ensure the port numbers are dynamically filled in with copy_setports.
Clarify test fail condition.
Make the stress test part of the default test suite since it doesn't
seem to run too long or interfere with other tests any more (the
original note claiming so is more than 20 years old).
Related !6883
Properly template the port number in config files with copy_setports.
The test takes two minutes on my machine which doesn't seem like a
proper justification to exclude it from the test suite, especially
considering we run these tests in parallel nowadays. The resource usage
doesn't seems significantly increased so it shouldn't interfere with
other system tests.
There also exists a precedent for longer running system tests that are
already part of the default system test suite (e.g. serve-stale takes
almost three minutes on the same machine).
When a target server is unreachable, the varying network conditions may
cause different ICMP message (or no message). The host unreachable
message was discovered when attempting to run the test locally while
connected to a VPN network which handles all traffic.
Extend the dig output check with "host unreachable" message to avoid a
false negative test result in certain network environments.
Add a test ensuring that the amount of work fctx_getaddresses() performs
for any encountered delegation is limited: delegate example.net to a set
of 1,000 name servers in the redirect.com zone, the names of which all
resolve to IP addresses that nothing listens on, and query for a name in
the example.net domain, checking the number of times the findname()
function gets executed in the process; fail if that count is excessively
large.
Since the size of the referral response sent by ans3 is about 20 kB, it
cannot be sent back over UDP (EMSGSIZE) on some operating systems in
their default configuration (e.g. FreeBSD - see the
net.inet.udp.maxdgram sysctl). To enable reliable reproduction of
CVE-2022-2795 (retry patterns vary across BIND 9 versions) and avoid
false positives at the same time (thread scheduling - and therefore the
number of fetch context restarts - vary across operating systems and
across test runs), extend bin/tests/system/resolver/ans3/ans.pl so that
it also listens on TCP and make "ns1" in the "resolver" system test
always use TCP when communicating with "ans3".
Also add a test (foo.bar.sub.tld1/TXT) that ensures the new limitations
imposed on the resolution process by the mitigation for CVE-2022-2795 do
not prevent valid, glueless delegation chains from working properly.
add a test to compare the Content-Length of successive compressed
messages on a single HTTP connection that should contain the same
data; fail if the size grows by more than 100 bytes from one query
to the next.
I.e. print the name of the function in BIND that called the system
function that returned an error. Since it was useful for pthreads
code, it seems worthwhile doing so everywhere.
There are multiple reasons to remove this test as obsolete:
- The test may not possibly work for over 2.5 years, since
98b3b93791 removed the rndc.py python
tool on which this test relies.
- It isn't part of the test suite either in CI or locally unless it is
explicitly enabled. As a result, there are many issues which prevent
the test from being executed caused by various refactoring efforts
accumulated over time.
- Even if the test could be executed, it has no clear failure condition.
If the python script(s) fail, the test still passes.
Now that the artificial limit on the recv buffer has been removed, the
current system test always fails because it tests if the truncation has
happened.
Add test that sending more than 10 headers makes the connection to
closed; and add test that sending huge HTTP request makes the connection
to be closed.
Sometimes doth test could intermittently fail shortly after start due
to inability to complete a zone transfer in time. As it turned out, it
could happen due to transfers-in/out limits. Initially the defaults
were fine, but over time, especially when adding Strict/Mutual TLS, we
added more than 10 zones so it became possible to hit the limits.
This commit takes care of that by bumping the limits.
This commit reduces the size of HTTP listener quota from 300 (default)
to 100 so that it would make hitting any global limits in case of
running multiple tests in parallel in multiple containers unlikely.
This way the need in opening many file descriptors of different
kinds (e.g. client side connections and pipes) gets significantly
reduced while the required code paths are still verified.
The bin/tests/system/start.pl script waits until a "running" message is
logged by a given name server instance before attempting to send a
version.bind/CH/TXT query to it. The idea behind this was to make the
script wait until named loads all the zones it is configured to serve
before telling the system test framework that a given server is ready to
use; this prevents the need to add boilerplate code that waits for a
specific zone to be loaded to each test expecting that.
The problem is that when it looks for "running" messages, the
bin/tests/system/start.pl script assumes that the existence of any such
message in the named.run file indicates that a given named instance has
already finished loading all zones. Meanwhile, some system tests
restart all the named instances they use throughout their lifetime (some
even do that a few times), for example to run Python-based tests. The
bin/tests/system/start.pl script handles such a scenario incorrectly: as
soon as it finds any "running" message in the named.run file it inspects
and it gets a response to a version.bind/CH/TXT query, it tells the
system test framework that a given server is ready to use, which might
not be true - it is possible that only the "version.bind" zone is loaded
at that point and the "running" message found was logged by a
previously-shutdown named instance. This triggers intermittent failures
for Python-based tests.
Fix by improving the logic that the bin/tests/system/start.pl script
uses to detect server startup: check how many "running" lines are
present in a given named.run file before attempting to start a named
instance and only proceed with version.bind/CH/TXT queries when the
number of "running" lines found in that named.run file increases after
the server is started.
In the "rrsetorder" system test, the ns2 named instance is restarted
without passing the --restart option to bin/tests/system/start.pl. This
causes the log file for that named instance to be needlessly truncated.
Prevent this from happening by restarting the affected named instance
in the same way as all the other named instances used in system tests.
The previous commit failed some tests because we expect that if a
fetch fails and we have stale candidates in cache, the
stale-refresh-time window is started. This means that if we hit a stale
entry in cache and answering stale data is allowed, we don't bother
resolving it again for as long we are within the stale-refresh-time
window.
This is useful for two reasons:
- If we failed to fetch the RRset that we are looking for, we are not
hammering the authoritative servers.
- Successor clients don't need to wait for stale-answer-client-timeout
to get their DNS response, only the first one to query will take
the latency penalty.
The latter is not useful when stale-answer-client-timeout is 0 though.
So this exception code only to make sure we don't try to refresh the
RRset again if it failed to do so recently.
dohpath is specfied in draft-ietf-add-svcb-dns and has a value
of 7. It must be a relative path (start with a /), be encoded
as UTF8 and contain the variable dns ({?dns}).
When the TCP test is run on the busy server, the server might take a
while to wind the server down because it might still be processing all
that 300k invalid XFR requests.
Increate the rncd wait time to 120 seconds, the SIGTERM time to 300
seconds, and reduce the time to wait for ans servers from 1200 second
to just 120 seconds.
Add several test cases in the 'upforwd' system test to make sure
that different scenarios of Dynamic DNS update forwarding are
tested, in particular when both the original and forwarded requests
are over Do53, or DoT, or they use different transports.
The HMACs and GSSAPI are just using unallocated values.
Moving them around shouldn't cause issues.
Only the dnssec system test knew the internal number in use for hmacmd5.
Switch the primary to require 'next_key' for zone transfers then
update the catalog zone to say to use 'next_key'. Next update the
zones contents then check that those changes are seen on the
secondary.
Add a simple test PKI based on the existing one in the doth test.
Check ephemeral, forward-secrecy, and forward-secrecy-mutual-tls
TLS configurations with different scenarios.
The comments in CA.cfg file serve as a good tutorial for setting up
a simple PKI for a system test. There is a typo in one of the presented
commands, which results in openssl not exiting with an error message
instead of generating a certificate.
Fix the typo.
If an NS RRset at the parent side of a delegation point only contains
in-bailiwick NS records, at least one glue record should be included in
every referral response sent for such a delegation point or else clients
will need to send follow-up queries in order to determine name server
addresses. In certain edge cases (when the total size of a referral
response without glue records was just below to the UDP packet size
limit), named failed to adhere to that rule by sending non-truncated,
glueless referral responses.
Add tests attempting to trigger that bug in several different scenarios,
covering all possible combinations of the following factors:
- type of zone (signed, unsigned),
- glue record type (A, AAAA, both).
Bring the "glue" system test up to speed with other system tests: add
check numbering, ensure test artifacts are preserved upon failure,
improve error reporting, make the test fail upon unexpected errors,
address ShellCheck warnings.
Instead of expecting that telemetry check has already finished,
wait for it for maximum of three seconds, because named is run with
-tat=3, so the telemetry check must happen with 3 second window.
Co-authored-by: Evan Hunt <each@isc.org>
This change prepares ground for sending DNS requests using DoT,
which, in particular, will be used for forwarding dynamic updates
to TLS-enabled primaries.
The tcp Pytest on OpenBSD fairly reliably fails when receive_tcp()
on a socket is attempted:
> (response, rtime) = dns.query.receive_tcp(sock, timeout())
tests-tcp.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.9/site-packages/dns/query.py:659: in receive_tcp
ldata = _net_read(sock, 2, expiration)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sock = <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>
count = 2, expiration = 1662719959.8106785
def _net_read(sock, count, expiration):
"""Read the specified number of bytes from sock. Keep trying until we
either get the desired amount, or we hit EOF.
A Timeout exception will be raised if the operation is not completed
by the expiration time.
"""
s = b''
while count > 0:
try:
> n = sock.recv(count)
E socket.timeout: timed out
This is because the socket is already closed.
Bump the socket connection timeout to 10 seconds.
If there was a collision of key id across algorithms it was not
possible to determine where counter applies to which algorithm for
xml statistics while for json only one of the values was emitted.
The key names are now "<algorithm-number>+<id>" (e.g. "8+54274").
the 'resolve' binary was added for testing dns_client as part of
the export library. the export libraries are no longer supported,
and tests using 'delv' provide the same coverage, so 'resolve' can
be removed now.
dns_request_create() was a front-end to dns_request_createvia() that
was only used by test binaries. dns_request_createvia() has been
renamed to dns_request_create(), and the test programs that formerly
used dns_request_create() have been updated to use the new parameters.
The test expected `xn--ah-` to be treated as a syntax error (punycode
requires letters after the last hyphen) but libidn2 on buster
converted the label to `ah` instead. To avoid this bug, change the
invalid label to `xn--0000h` which translates to an out-of-range
unicode codepoint (beyond the maximum value) which is corectly
trated as invalid in older libidn2.
The change to `testsock.pl` in commit 258a896a broke the system
tests in out-of-tree builds because `ifconfig.sh.in` is not
copied to the worktree. Use `ifconfig.sh` instead.
Reduce the number of places that know about the number of IP addresses
required by the system tests, by changing `testsock.pl` to read the
`max` from `ifconfig.sh.in`. This should make the test runner fail
early with a clear message when the interfaces have been set up by an
obsolete script.
Add comments to cross-reference `ifconfig.sh.in`, `testsock.pl`, and
`org.isc.bind.system` to make it easier to remember what needs
updating when an IP address is added.
If there are any problems with IDN processing, DiG will now quietly
handle the name as if IDN were disabled. This means that international
query names are rendered verbatim on the wire, and ACE names are
printed raw without conversion to UTF8.
If you want to check the syntax of international domain names,
use the `idn2` utility.
It is possible to bypass Response Rate Limiting (RRL)
`responses-per-second` limitation using specially crafted wildcard
names, because the current implementation, when encountering a found
DNS name generated from a wildcard record, just strips the leftmost
label of the name before making a key for the bucket.
While that technique helps with limiting random requests like
<random>.example.com (because all those requests will be accounted
as belonging to a bucket constructed from "example.com" name), it does
not help with random names like subdomain.<random>.example.com.
The best solution would have been to strip not just the leftmost
label, but as many labels as necessary until reaching the suffix part
of the wildcard record from which the found name is generated, however,
we do not have that information readily available in the context of RRL
processing code.
Fix the issue by interpreting all valid wildcard domain names as
the zone's origin name concatenated to the "*" name, so they all will
be put into the same bucket.
The zone 'retransfer3.' tests whether zones that 'rndc signing
-nsec3param' requests are queued even if the zone is not loaded.
The test assumes that if 'rndc signing -list' shows that the zone is
done signing with two keys, and there are no NSEC3 chains pending, the
zone is done handling the '-nsec3param' queued requests. However, it
is possible that the 'rndc signing -list' command is received before
the corresponding privatetype records are added to the zone (the records
that are used to retrieve the signing status with 'rndc signing').
This is what happens in test failure
https://gitlab.isc.org/isc-projects/bind9/-/jobs/2722752.
The 'rndc signing -list retransfer3' is thus an unreliable check.
It is simpler to just remove the check and wait for a certain amount
of time and check whether ns3 has re-signed the zone using NSEC3.
Check the new configuration option's syntax using the 'checkconf' system
test.
Check if the new option works by parsing DiG's output in the 'rpz'
system test.
Previously:
* applications were using isc_app as the base unit for running the
application and signal handling.
* networking was handled in the netmgr layer, which would start a
number of threads, each with a uv_loop event loop.
* task/event handling was done in the isc_task unit, which used
netmgr event loops to run the isc_event calls.
In this refactoring:
* the network manager now uses isc_loop instead of maintaining its
own worker threads and event loops.
* the taskmgr that manages isc_task instances now also uses isc_loopmgr,
and every isc_task runs on a specific isc_loop bound to the specific
thread.
* applications have been updated as necessary to use the new API.
* new ISC_LOOP_TEST macros have been added to enable unit tests to
run isc_loop event loops. unit tests have been updated to use this
where needed.
The wait_for_zone_is_signed function was never called, which could lead
to test failures due to timing issues (where a zone was not fully signed
yet, but the test was trying to verify the zone).
Also add two missing set_nsec3param calls to ensure the ITERATIONS
value is set for these test cases.
Add two scenarios where we change the dnssec-policy from using RSASHA1
to something with NSEC3.
The first case should work, as the DS is still in hidden state and we
can basically do anything with DNSSEC.
The second case should fail, because the DS of the predecessor is
published and we can't immediately remove the predecessor DNSKEY. So
in this case we should keep the NSEC chain for a bit longer.
Add two more scenarios where we change the dnssec-policy from using
NSEC3 to something NSEC only. Both should work because there are no
restrictions on using NSEC when it comes to algorithms, but in the
cases where the DS is published we can't bluntly remove the predecessor.
Extend the nsec3 system test by also checking the DNSKEY RRset for the
expected DNSKEY records. This requires some "kasp system"-style setup
for each test (setting key properties and key states). Also move the
dnssec-verify check inside the check_nsec/check_nsec3 functions because
we will have to do that every time.
Before the commit some checks in the system test would try to verify
that different HTTP methods can be used and are functional. However,
until recently, it was not possible to tell from the output which
method was in fact used, so it turned out that +http-plain-get option
is broken.
This commit add the additional checks to prevent that from happening
in the future.
Moves tests from being RSASHA1 based to RSASHA256 based where possible
and split out the remaining RSASHA1 based tests so that they are not
run on OS's that don't support RSASHA1.
migrate-nomatch-alglen: switched to RSASHA256 instead of RSASHA1
and the key size now changes from 2048 bits to 3072 bits instead
of 1024 bits to 2048 bits.
migrate-nomatch-algnum: switched to RSASHA256 instead of RSASHA1
as initial algorithm and adjusted mininum key size to 2048 bits.
rsasha256: adjusted minimum key size to 2048 bits.
The nsec-only.example zone was not converted as we use it to
test nsec-only DNSSEC algorithms to nsec3 conversion failure.
The subtest is skipped in fips mode.
Update "checking revoked key with duplicate key ID" test
to use FIPS compatible algorithm.
Clean up dns_rdatalist_tordataset() and dns_rdatalist_fromrdataset()
functions by making them return void, because they cannot fail.
Clean up other functions that subsequently cannot fail.
In the CI dig sometimes produces warning/error comments when
communicating with the server, which produces problems when comparing
the outputs.
Here is an example of a dig output with a warning message which
is benign, because dig, after a retry, managed to query the server.
;; communications error to 10.53.0.3#7529: timed out
1.2.3.1
1.2.3.2
1.2.3.3
1.2.3.4
When comparing this to the expected output, which doesn't contain
the comment line (starting with double ';'), the outputs don't match.
Use grep inverse logic to strip the comments from the dig outputs.
There are existing tests for simulating timeouts, read errors, and
refused connecion errors. Implement also "network unreachable"
simulation.
Use "fixed" string search mode `-F` for `grep` in more places where
it is appropriate to do so.
DiG implements different logic in the `recv_done()` callback function
when processing a failure:
1. For a timed-out query it applies the "retries" logic first, then,
when it fails, fail-overs to the next server.
2. For an EOF (end-of-file, or unexpected disconnect) error it tries to
make a single retry attempt (even if the user has requested more
retries), then, when it fails, fail-overs to the next server.
3. For other types of failures, DiG does not apply the "retries" logic,
and tries to fail-over to the next servers (again, even if the user
has requested to make retries).
Simplify the logic and apply the same logic (1) of first retries, and
then fail-over, for different types of failures in `recv_done()`.
The "max-zone-ttl" option should now be configured as part of
"dnssec-policy". The option with the same name in "zone" and
"options" is hereby flagged as deprecated, and its functionality
will be removed in a future release.
The BUFSIZ value varies between platforms, it could be 8K on Linux and
512 bytes on mingw. Make sure the buffers are always big enough for the
output data to prevent truncation of the output by appropriately
enlarging or sizing the buffers.
The statistics system test makes a query to foo.info to check for the
pending connections because the ans4 doesn't respond to the query.
This might or might not (depending on exact timing) increment the failed
TCP connection counter when the query is retried over TCP because ans4
doesn't listen on the TCP.
Wait for the 'connection refused' in the ns3 log file to be able to
count the exactly 1 failed TCP connection.
There should be 2 keys with the same key id after the numerically
lower one is revoked (serial space arithmetic). The DS points
at the non-revoked key so validation should still succeed.
Fix a comment, ensuring the right parameters are used (zone is
parameter $3, not $2) and add view and policy parameters to the comment.
Fix the view tests and test the correct view (example3 instead of
example2).
Fix placement of "n=$((n+1)" for two test cases.
"rndc fetchlimit" now also prints a list of domain names that are
currently rate-limited by "fetches-per-zone".
The "fetchlimit" system test has been updated to use this feature
to check that domain limits are applied correctly.
this command runs dns_adb_dumpquota() to display all servers
in the ADB that are being actively fetchlimited by the
fetches-per-server controls (i.e, servers with a nonzero average
timeout ratio or with the quota having been reduced from the
default value).
the "fetchlimit" system test has been updated to use the
new command to check quota values instead of "rndc dumpdb".
* make it harder to get the interface numbers wrong by using 'max'
to specify the upper bound of the sequence of interfaces and use 'max'
when calculating the interface number
* extract the platform specific instruction into 'up' and 'down'
and call them from the inner loop so that the interface number is
calculated in one place.
* calculate the A and AAAA address in a single place rather than
in each command
* use /sbin/ipadm on Solaris 2.11 and greater
previously, when an iterative query returned FORMERR, resolution
would be stopped under the assumption that other servers for
the same domain would likely have the same capabilities. this
assumption is not correct; some domains have been reported for
which some but not all servers will return FORMERR to a given
query; retrying allows recursion to succeed.
The original sscanf processing allowed for a number of syntax errors
to be accepted. This included missing the closing brace in
${modifiers}
Look for both comma and right brace as intermediate seperators as
well as consuming the final right brace in the sscanf processing
for ${modifiers}. Check when we got right brace to determine if
the sscanf consumed more input than expected and if so behave as
if it had stopped at the first right brace.
$GENERATE uses 'int' for its computations and some constructions
can overflow values that can be represented by an 'int' resulting
in undefined behaviour. Detect these conditions and return a
range error.
On slow systems we have seen this take 9 seconds. Increased the
allowance from 3 seconds to 10 seconds to reduce the probabilty of
a false negative from the system test.
The previous test code could emit "D:cds:stderr did not match ''" rather
that just showing the contents of stderr. Moved the debug line inside
the if/else block.
Replaced backquotes with $() and $(()) as approriate.
We are grafting on an unsigned zone "example.internal" where the higher
zone (".") is signed and would otherwise cause named to synthesise a
NXDOMAIN for example.internal. We prime the cache by performing a
lookup for "internal" and then lookup "example.internal".
The "glue-cache" option was marked as deprecated by commit
5ae33351f2 (first released in BIND 9.17.6,
back in October 2020), so now obsolete that option, removing all code
and documentation related to it.
Note: this causes the glue cache feature to be permanently enabled, not
disabled.