As ns_query_init() cannot fail now, remove the error paths, especially
in ns__client_setup() where we now don't have to care what to do with
the connection if setting up the client could fail. It couldn't fail
even before, but now it's formal.
View matching on an incoming query checks the query's signature,
which can be a CPU-heavy task for a SIG(0)-signed message. Implement
an asynchronous mode of the view matching function which uses the
offloaded signature checking facilities, and use it for the incoming
queries.
In order to protect from a malicious DNS client that sends many
queries with a SIG(0)-signed message, add a quota of simultaneously
running SIG(0) checks.
This protection can only help when named is using more than one worker
threads. For example, if named is running with the '-n 4' option, and
'sig0checks-quota 2;' is used, then named will make sure to not use
more than 2 workers for the SIG(0) signature checks in parallel, thus
leaving the other workers to serve the remaining clients which do not
use SIG(0)-signed messages.
That limitation is going to change when SIG(0) signature checks are
offloaded to "slow" threads in a future commit.
The 'sig0checks-quota-exempt' ACL option can be used to exempt certain
clients from the quota requirements using their IP or network addresses.
The 'sig0checks-quota-maxwait-ms' option is used to define a maximum
amount of time for named to wait for a quota to appear. If during that
time no new quota becomes available, named will answer to the client
with DNS_R_REFUSED.
The changes in this MR prevent the memory used for sending the outgoing
TCP requests to spike so much. That strictly remove the extra need for
own memory context, and thus since we generally prefer simplicity,
remove the extra memory context with own jemalloc arenas just for the
outgoing send buffers.
As a single thread can process only one TCP send at the time, we don't
really need a memory pool for the TCP buffers, but it's enough to have
a single per-loop (client manager) static buffer that's being used to
assemble the DNS message and then it gets copied into own sending
buffer.
In the future, this should get optimized by exposing the uv_try API
from the network manager, and first try to send the message directly
and allocate the sending buffer only if we need to send the data
asynchronously.
Constantly allocating, reallocating and deallocating 64K TCP send
buffers by 'ns_client' instances takes too much CPU time.
There is an existing mechanism to reuse the ns_clent_t structure
associated with the handle using 'isc_nmhandle_getdata/_setdata'
(see ns_client_request()), but it doesn't work with TCP, because
every time ns_client_request() is called it gets a new handle even
for the same TCP connection, see the comments in
streamdns_on_complete_dnsmessage().
To solve the problem, we introduce an array of available (unused)
TCP buffers stored in ns_clientmgr_t structure so that a 'client'
working via TCP can have a chance to reuse one (if there is one)
instead of allocating a new one every time.
The high-water allows administrators to better tune the recursive
clients limit without having to to poll the statistics channel in high
rates to get this number.
Returning the value allows for better high-water tracking without
running into edge cases like the following:
0. The counter is at value X
1. Increment the value (X+1)
2. The value is decreased multiple times in another threads (X+1-Y)
3. Get the value (X+1-Y)
4. Update-if-greater misses the X+1 value which should have been the
high-water
This commit ensures that listeners are recreated on reconfiguration in
the case when their type changes (or when PROXY protocol type changes,
too).
Previously, if a "listen-on" statement was modified to represent a
different transport, BIND would not pick-up the change on
reconfiguration if listener type changes (e.g. DoH -> DoT) for a given
interface address and port combination. This commit fixes that by
recreating the listener.
Initially, that worked for most of the new transports as we would
recreate listeners on each reconfiguration for DoH and DoT. But at
some point we changed that in such a way that listeners were not
recreated to avoid rebinding a port as on some platforms only root can
do that for port numbers <1000, making some ports binding possible
only on start-up. We chose to asynchronously update listener socket
settings (like TLS contexts, HTTP settings) instead.
Now, we both avoid recreating the sockets if unnecessary and recreate
listeners when listener type changes.
This commit extends "listen-on" statement with "proxy" options that
allows one to enable PROXYv2 support on a dedicated listener. It can
have the following values:
- "plain" to send PROXYv2 headers without encryption, even in the case
of encrypted transports.
- "encrypted" to send PROXYv2 headers encrypted right after the TLS
handshake.
The AES algorithm for DNS cookies was being kept for legacy reasons, and
it can be safely removed in the next major release. Remove both the AES
usage for DNS cookies and the AES implementation itself.
Instead of creating new memory pools for each new dns_message, change
dns_message_create() method to optionally accept externally created
dns_fixedname_t and dns_rdataset_t memory pools. This allows us to
preallocate the memory pools in ns_client and dns_resolver units for the
lifetime of dns_resolver_t and ns_clientmgr_t.
Instead of high number of dispatches (4 * named_g_udpdisp)[1], make the
dispatches bound to threads and make dns_dispatchset_t create a dispatch
for each thread (event loop).
This required couple of other changes:
1. The dns_dispatch_createudp() must be called on loop, so the isc_tid()
is already initialized - changes to nsupdate and mdig were required.
2. The dns_requestmgr had only a single dispatch per v4 and v6. Instead
of using single dispatch, use dns_dispatchset_t for each protocol -
this is same as dns_resolver.
The isc_stats_create() can no longer return anything else than
ISC_R_SUCCESS. Refactor isc_stats_create() and its variants in libdns,
libns and named to just return void.
The server was previously tolerant of out-of-date or otherwise bad
DNS SERVER COOKIES that where well formed unless require-cookie was
set. BADCOOKIE is now return for these conditions.
This commit changes send buffers allocation strategy for stream based
transports. Before that change we would allocate a dynamic buffers
sized at 64Kb even when we do not need that much. That could lead to
high memory usage on server. Now we resize the send buffer to match
the size of the actual data, freeing the memory at the end of the
buffer for being reused later.
'-T transferinsecs' makes named interpret the max-transfer-time-out,
max-transfer-idle-out, max-transfer-time-in and max-transfer-idle-in
configuration options as seconds instead of minutes.
'-T transferslowly' makes named to sleep for one second for every
xfrout message.
'-T transferstuck' makes named to sleep for one minute for every
xfrout message.
In e185412872, the TCP accept quota code
became broken in a subtle way - the quota would get initialized on the
first accept for the server socket and then deleted from the server
socket, so it would never get applied again.
Properly fixing this required a bigger refactoring of the isc_quota API
code to make it much simpler. The new code decouples the ownership of
the quota and acquiring/releasing the quota limit.
After (during) the refactoring it became more clear that we need to use
the callback from the child side of the accepted connection, and not the
server side.
This should have no functional effects.
The message size stats are specified by RSSAC002 so it's best not
to mess around with how they appear in the statschannel. But it's
worth changing the implementation to use general-purpose histograms,
to reduce code size and benefit from sharded counters.
add a public function ns_interface_create() allowing the caller
to set up a listening interface directly without having to set
up listen-on and scan network interfaces.
this function was just a front-end for gethostname(). it was
needed when we supported windows, which has a different function
for looking up the hostname; it's not needed any longer.
as there is no further use of isc_task in BIND, this commit removes
it, along with isc_taskmgr, isc_event, and all other related types.
functions that accepted taskmgr as a parameter have been cleaned up.
as a result of this change, some functions can no longer fail, so
they've been changed to type void, and their callers have been
updated accordingly.
the tasks table has been removed from the statistics channel and
the stats version has been updated. dns_dyndbctx has been changed
to reference the loopmgr instead of taskmgr, and DNS_DYNDB_VERSION
has been udpated as well.
callback events from dns_resolver_createfetch() are now posted
using isc_async_run.
other modules which called the resolver and maintained task/taskmgr
objects for this purpose have been cleaned up.
removed some functions that are no longer used and unlikely to
be resurrected, and also some that were only used to support Windows
and can now be replaced with generic versions.
limit the number of simultaneous DNS UPDATE events that can be
processed by adding a quota for update and update forwarding.
this quota currently, arbitrarily, defaults to 100.
also add a statistics counter to record when the update quota
has been exceeded.
DSCP has not been fully working since the network manager was
introduced in 9.16, and has been completely broken since 9.18.
This seems to have caused very few difficulties for anyone,
so we have now marked it as obsolete and removed the
implementation.
To ensure that old config files don't fail, the code to parse
dscp key-value pairs is still present, but a warning is logged
that the feature is obsolete and should not be used. Nothing is
done with configured values, and there is no longer any
range checking.
Send the ns_query_cancel() on the recursing clients when we initiate the
named shutdown for faster shutdown.
When we are shutting down the resolver, we cancel all the outstanding
fetches, and the ISC_R_CANCEL events doesn't propagate to the ns_client
callback.
In the future, the better solution how to fix this would be to look at
the shutdown paths and let them all propagate from bottom (loopmgr) to
top (f.e. ns_client).
the 'nupdates' field was originally used to track whether a client
was ready to shut down, along with other similar counters nreads,
nrecvs, naccepts and nsends. this is now tracked differently, but
nupdates was overlooked when the other counters were removed.
Don't attempt to resolve DNS responses for intermediate results. This
may create multiple refreshes and can cause a crash.
One scenario is where for the query there is a CNAME and canonical
answer in cache that are both stale. This will trigger a refresh of
the RRsets because we encountered stale data and we prioritized it over
the lookup. It will trigger a refresh of both RRsets. When we start
recursing, it will detect a recursion loop because the recursion
parameters will eventually be the same. In 'dns_resolver_destroyfetch'
the sanity check fails, one of the callers did not get its event back
before trying to destroy the fetch.
Move the call to 'query_refresh_rrset' to 'ns_query_done', so that it
is only called once per client request.
Another scenario is where for the query there is a stale CNAME in the
cache that points to a record that is also in cache but not stale. This
will trigger a refresh of the RRset (because we encountered stale data
and we prioritized it over the lookup).
We mark RRsets that we add to the message with
DNS_RDATASETATTR_STALE_ADDED to prevent adding a duplicate RRset when
a stale lookup and a normal lookup conflict with each other. However,
the other non-stale RRset when following a CNAME chain will be added to
the message without setting that attribute, because it is not stale.
This is a variant of the bug in #2594. The fix covered the same crash
but for stale-answer-client-timeout > 0.
Fix this by clearing all RRsets from the message before refreshing.
This requires the refresh to happen after the query is send back to
the client.
Previously:
* applications were using isc_app as the base unit for running the
application and signal handling.
* networking was handled in the netmgr layer, which would start a
number of threads, each with a uv_loop event loop.
* task/event handling was done in the isc_task unit, which used
netmgr event loops to run the isc_event calls.
In this refactoring:
* the network manager now uses isc_loop instead of maintaining its
own worker threads and event loops.
* the taskmgr that manages isc_task instances now also uses isc_loopmgr,
and every isc_task runs on a specific isc_loop bound to the specific
thread.
* applications have been updated as necessary to use the new API.
* new ISC_LOOP_TEST macros have been added to enable unit tests to
run isc_loop event loops. unit tests have been updated to use this
where needed.
* isc_timer was rewritten using the uv_timer, and isc_timermgr_t was
completely removed; isc_timer objects are now directly created on the
isc_loop event loops.
* the isc_timer API has been simplified. the "inactive" timer type has
been removed; timers are now stopped by calling isc_timer_stop()
instead of resetting to inactive.
* isc_manager now creates a loop manager rather than a timer manager.
* modules and applications using isc_timer have been updated to use the
new API.
This way only quota size is passed to the interface/listener
management code instead of a quota object. Thus, we can implement
updating the quota object size instead of recreating the object.
Similarly to how different code paths reused common client handle
pointers and fetch references despite being logically unrelated, they
also reuse client->recursionquota, the field in which a reference to the
recursion quota is stored. This unnecessarily forces all code using
that field to be aware of the fact that it is overloaded by different
features.
Overloading client->recursionquota also causes inconsistent behavior.
For example, if prefetch code triggers recursion and then delegation
handling code also triggers recursion, only one of these code paths will
be able to attach to the recursion quota, but both recursions will be
started anyway. In other words, each code path only checks whether the
recursion quota has not been exceeded if the quota has not yet been
attached to by another code path. This behavior theoretically allows
the configured recursion quota to be slightly exceeded; while it is not
expected to be a real-world operational issue, it is still confusing and
should therefore be fixed.
Extend the structures comprising the 'recursions' array with a new field
holding a pointer to the recursion quota that a given recursion process
attached to. Update all code paths using client->recursionquota so that
they use the appropriate slot in the 'recursions' array. Drop the
'recursionquota' field from ns_client_t.
Drop the 'fetchhandle' field from ns_client_t as all code using it has
been migrated to use the recursion-type-specific HANDLE_RECTYPE_*()
macros.
Drop the 'fetch' field from ns_query_t as all code using it has been
migrated to use the recursion-type-specific FETCH_RECTYPE_*() macros.
Async hooks are the last feature using the client->fetchhandle and
client->query.fetch pointers. Update ns_query_hookasync() and
query_hookresume() so that they use a dedicated slot in the 'recursions'
array. Note that async hooks are still not expected to initiate
recursion if one was already started by a prior ns_query_recurse() call,
so the REQUIRE assertion in ns_query_hookasync() needs to check the
RECTYPE_NORMAL slot rather than the RECTYPE_HOOK one.
With prefetch and RPZ code updated to use separate slots in the
'recursions' array, the code responsible for starting recursion in
ns_query_recurse() and resuming query handling in fetch_callback()
should follow suit, so that it does not need to explicitly cooperate
with other code paths that may initiate recursion.
Replace:
- client->fetchhandle with HANDLE_RECTYPE_NORMAL(client)
- client->query.fetch with FETCH_RECTYPE_NORMAL(client)
Also update other functions using client->fetchhandle and
client->query.fetch (ns_query_cancel(), query_usestale()) so that those
two fields can shortly be dropped altogether.
Replace:
- client->prefetchhandle with HANDLE_RECTYPE_PREFETCH(client)
- client->query.prefetch with FETCH_RECTYPE_PREFETCH(client)
This is preparatory work for separating prefetch code from RPZ code.