Currenly, quic_conn on the backend side may access their parent proxy
instance during their lifetime. In particular, this is the case for
counters update, with <prx_counters> field directly referencing a proxy
memory zone.
As such, this prevents safe backend removal. One solution would be to
check if the upper connection instance is still alive, as a proxy cannot
be removed if connection are still active. However, this would
completely prevent proxy counters update via
quic_conn_prx_cntrs_update(), as this is performed on quic_conn release.
Another solution would be to use refcount, or a dedicated counter on the
which account for QUIC connections on a backend instance. However,
refcount is currently only used by short-term references, and it could
also have a negative impact on performance.
Thus, the simplest solution for now is to disable a backend removal if a
QUIC server is/was used in it. This is considered acceptable for now as
QUIC on the backend side is experimental.
When a server is deleted via "del server", increment refcount of its
parent backend. This is necessary as the server is not referenced
anymore in the backend, but can still access it via its own <proxy>
member. Thus, backend removal must not happen until the complete purge
of the server.
The proxy refcount is released in srv_drop() if the flag SRV_F_DELETED
is set, which indicates that "del server" was used. This operation is
performed after the complete release of the server instance to ensure no
access will be performed on the proxy via itself. The refcount must not
be decremented if a server is freed without "del server" invokation.
Another solution could be for servers to always increment the refcount.
However, for now in haproxy refcount usage is limited, so the current
approach is preferred. It should also ensure that if the refcount is
still incremented, it may indicate that some servers are not completely
purged themselves.
Note that this patch may cause issues if "del backend" are used in
parallel with LUA scripts referencing servers. Currently, any servers
referenced by LUA must be released by its garbage collector to ensure it
can be finally freed. However, it appeas that in some case the gc does
not run for several minutes. At least this has been observed with Lua
version 5.4.8. In the end, this will result in indefinitely blocking of
"del backend" commands.
As seen with the last changes to counters allocation, the move of the
counters storage to the thread group as operated in commit 04a9f86a85
("MEDIUM: counters: add a dedicated storage for extra_counters in various
structs") causes some random errors when using ASAN, because the extra
counters are freed in srv_drop() after calling srv_free_params(), which
is responsible for freeing the per-thread group storage.
For the proxies however it's OK because free calls are made before the
call to deinit_proxy() which frees the per_tgrp area.
No backport is needed, this is purely 3.4-dev.
We'll need to permit any user to update its own tgroup's extra counters
instead of the global ones. For this we now store the per-tgroup step
between two consecutive data storages, for when they're stored in a
tgroup array. When shared (e.g. resolvers or listeners), we just store
zero to indicate that it doesn't scale with tgroups. For now only the
registration was handled, it's not used yet.
Servers, proxies, listeners and resolvers all use extra_counters. We'll
need to move the storage to per-tgroup for those where it matters. Now
we're relying on an external storage, and the data member of the struct
was replaced with a pointer to that pointer to data called datap. When
the counters are registered, these datap are set to point to relevant
locations. In the case of proxies and servers, it points to the first
tgrp's storage. For listeners and resolvers, it points to a local
storage. The rationale here is that listeners are limited to a single
group anyway, and that resolvers have a low enough load so that we do
not care about contention there.
Nothing should change for the user at this point.
It appears that in cli_parse_add_server(), we're calling srv_alloc_lb()
and stats_allocate_proxy_counters_internal() before srv_preinit() which
allocates the thread groups. LB algos can make use of the per_tgrp part
which is initialized by srv_preinit(). Fortunately for now no algo uses
both tgrp and ->server_init() so this explains why this remained
unnoticed to date. Also, extra counters will soon require per_tgrp to
already be initialized. So let's move these between srv_preinit() and
srv_postinit(). It's possible that other parts will have to be moved
in between.
This could be backported to recent versions for the sake of safety but
it looks like the current code cannot tell the difference.
Auto SNI configuration is configured during check config validity.
However, nothing was implemented for dynamic servers.
Fix this by implementing auto SNI configuration during "add server" CLI
handler. Auto SNI configuration code is moved in a dedicated function
srv_configure_auto_sni() called both for static and dynamic servers.
Along with this, allows the keyword "no-sni-auto" on dynamic servers, so
that this process can be deactivated if wanted. Note that "sni-auto"
remains unavailable as it only makes sense with default-servers which
are never used for dynamic server creation.
This must be backported up to 3.3.
An issue was introduced in 3.0 with commit faa8c3e024 ("MEDIUM: lb-chash:
Deterministic node hashes based on server address"): the new server_key
field and lb_nodes entries initialization were not updated for servers
added at run time with "add server": server_key remains zero and the key
used in lb_node remains the one depending only on the server's ID.
This will cause trouble when adding new servers with consistent hashing,
because the hash-key will be ignored until the server's weight changes
and the key difference is detected, leading to its recalculation.
This is essentially caused by the poorly placed lb_nodes initialization
that is specific to lb-chash and had to be replicated in the code dealing
with server addition.
This commit solves the problem by adding a new ->server_init() function
in the lbprm proxy struct, that is called by the server addition code.
This also allows to abandon the complex check for LB algos that was
placed there for that purpose. For now only lb-chash provides such a
function, and calls it as well during initial setup. This way newly
added servers always use the correct key now.
While it should also theoretically have had an impact on servers added
with the "random" algorithm, it's unlikely that the difference between
proper server keys and those based on their ID could have had any visible
effect.
This patch should be backported as far as 3.0. The backport may be eased
by a preliminary backport of previous commit "CLEANUP: lb-chash: free
lb_nodes from chash's deinit(), not global", though this is not strictly
necessary if context is manually adjusted.
There's an ambuity on the ownership of lb_nodes in chash, it's allocated
by chash but freed by the server code in srv_free_params() from srv_drop()
upon deinit. Let's move this free() call to a chash-specific function
which will own the responsibility for doing this instead. Note that
the .server_deinit() callback is properly called both on proxy being
taken down and on server deletion.
Move backend compatibility checks performed during 'add server' in a
dedicated function be_supports_dynamic_srv(). This should simplify
addition of future restriction.
This function will be reused when implementing backend creation at
runtime.
The code in srv_add_to_idle_list() has its roots in 2.0 with commit
9ea5d361ae ("MEDIUM: servers: Reorganize the way idle connections are
cleaned."). At this era we didn't yet have the current set of atomic
load/store operations and we used to perform loads using volatile casts
after a barrier. It turns out that this function has kept this schema
over the years, resulting in a big mfence stalling all the pipeline
in the function:
| static __inline void
| __ha_barrier_full(void)
| {
| __asm __volatile("mfence" ::: "memory");
27.08 | mfence
| if ((volatile void *)srv->idle_node.node.leaf_p == NULL) {
0.84 | cmpq $0x0,0x158(%r15)
0.74 | je 35f
| return 1;
Switching these for a pair of atomic loads got rid of this and brought
0.5 to 3% extra performance depending on the tests due to variations
elsewhere, but it has never been below 0.5%. Note that the second load
doesn't need to be atomic since it's protected by the lock, but it's
cleaner from an API and code review perspective. That's also why it's
relaxed.
This was the last user of _ha_barrier_full(), let's try not to
reintroduce it now!
Function proxy_preset_defaults() purpose has evolved over time.
Originally, it was only used to initialize defaults proxies instances.
Until today, it was extended so that all proxies use it. Its objective
is to initialize settings to common default values.
To remove the confusion, this function is now removed. Its content is
integrated directly into init_new_proxy().
There remained some cases (on error paths) were a server could be freed
while still attached on the parent proxy server list. In 3.3 this can be
problematic because new_server() automatically adds the server to the
parent proxy list.
The bug is insignificant because it is on errors paths during init and
often haproxy exits right after. But let's fix that to ensure no UAF or
undefined behavior occurs because of that.
This patch depends on ("MINOR: cli: use srv_drop() when server was created using new_server()")
It must be backported in 3.3 with the above mentioned patch.
Now that new_server() is becoming more and more complex, we need to
take care that servers created using new_server() must be released
using the corresponding release function srv_drop() which takes care
of properly de-initing the server and its members.
Contrarily to what was previously believed, there are corner cases where
the counters may not be allocated, and we may want to make them optional
at a later date, so we have to check if those counters are there.
However, just checking that shared.tg is non-NULL is enough, we can then
assume that shared.tg[tgid - 1] has properly been allocated too.
Also modify the various COUNTER_SHARED_* macros to make sure they check
for that too.
Before updating counters, a few tests are made to check if the counters
exits. but those counters should always exist at this point, so just
remmove them.
This commit should have no impact, but can easily be reverted with no
functional impact if various crashes appear.
Instead of statically allocating the per-thread group counters,
based on the max number of thread groups available, allocate
them dynamically, based on the number of thread groups actually
used. That way we can increase the maximum number of thread
groups without using an unreasonable amount of memory.
Extend QUIC server configuration so that congestion algorithm and
maximum window size can be set on the server line. This can be achieved
using quic-cc-algo keyword with a syntax similar to a bind line.
This should be backported up to 3.3 as this feature is considered as
necessary for full QUIC backend support. Note that this relies on the
serie of previous commits which should be picked first.
A recent patch has introduced free operation for QUIC tokens stored in a
server. These values are located in <per_thr> server array.
However, a server instance may be released prior to its full
initialization in case of a failure during "add server" CLI command. The
mentionned patch would cause a srv_drop() crash due to an invalid usage
of NULL <per_thr> member.
Fix this by adding a check on <per_thr> prior to dereference it in
srv_drop().
No need to backport.
A recent patch has implemented caching of QUIC token received from a
NEW_TOKEN frame into the server cache. This value is stored per thread
into a <quic_retry_token> field.
This field is an ist, first set to an empty string. Via
qc_try_store_new_token(), it is reallocated to fit the size of the newly
stored token. Prior to this patch, the field was never freed so this
causes a memory leak.
Fix this by using istfree() on <quic_retry_token> field during
srv_drop().
No need to backport.
The latest idle conns fix 9481cef948 ("BUG/MEDIUM: connection: do not
reinsert a purgeable conn in idle list") addresses a very hard-to-hit
case which manifests itself with an attempt to reuse a connection fails
because conn->mux is NULL:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000655410b8642c in conn_backend_get (reuse_mode=4, srv=srv@entry=0x6554378a7140,
sess=sess@entry=0x7cfe140948a0, is_safe=is_safe@entry=0,
hash=hash@entry=910818338996668161) at src/backend.c:1390
1390 if (conn->mux->takeover && conn->mux->takeover(conn, i, 0) == 0) {
However the condition that leads to this situation can be detected
earlier, by the presence of the connection in the toremove_list, whose
race window is much larger and easier to detect.
This patch adds a few BUG_ON_STRESS() at selected places that an detect
this condition. When built with -DDEBUG_STRESS and run under stress with
two distinct processes communicating over H2 over SSL, under a stress of
400-500k req/s, the front process usually crashes in the first 10-30s
triggering in _srv_add_idle() if the fix above is reverted (and it does
not crash with the fix).
This is mainly included to serve as an illustration of how to instrument
the code for seamless stress testing.
The target patch fixes a rare race condition which happen when a MUX IO
handler is working on a connection already moved into the purge list. In
this case, the handler will incorrectly moved back the connection into
the idle list.
To fix this, conn_delete_from_tree() was extended to remove flags along
with the connection from the idle list. This was performed when the
connection is moved into the purge list. However, it introduces another
issue related to the idle server connection accounting. Thus it is
necessary to revert it prior to the incoming newer fix.
This patch must be backported to every version where the original commit
is.
When a server is being disabled or deleted, in case it matches the
backend's ready_srv, this one is reset. However it's currently done in
a non-atomic way when the server goes down, and that could occasionally
reset the entry matching another server, but more importantly if in
parallel some requests are dequeued for that server, it may re-appear
there after having been removed, leading to a possible crash once it
is fully removed, as shown in issue #3177.
Let's make sure we reset the pointer when detaching the server from
the proxy, and use a CAS in both cases to only reset this server.
This fix needs to be backported to 3.2. There, srv_detach() is in
server.c instead of server.h. Thanks to Basha Mougamadou for the
detailed report and the useful backtraces.
Almost all callers of _srv_add_idle() lock the list then call the
function. It's not the most efficient and it requires some care from
the caller to take care of that lock. Let's change this a little bit by
having srv_add_idle() that takes the lock and calls _srv_add_idle() that
is now inlined. This way callers don't have to handle the lock themselves
anymore, and the lock is only taken around the sensitive parts, not the
function call+return.
Interestingly, perf tests show a small perf increase from 2.28-2.32M RPS
to 2.32-2.37M RPS on a 128-thread system.
There's currently a function conn_delete_from_tree() which is used to
detach an idle connection from the tree it's currently attached to so
that it is no longer found. This function is used in three circumstances:
- when picking a new connection that no longer has any avail stream
- when temporarily working on the connection from an I/O handler,
in which case it's re-added at the end
- when killing a connection
The 2nd case above is quite specific, as it requires to preserve the
CO_FL_LIST_MASK flags so that the connection can be re-inserted into
the proper tree when leaving the handler. However, there's a catch.
When killing a connection, we want to be certain it will not be
reinserted into the tree. The flags preservation is causing a tiny
race if an I/O happens while the connection is in the kill list,
because in this case the I/O handler will note the connection flags,
do its work, then reinsert the connection where it believed it was,
then the connection gets purged, and another user can find it in the
tree.
The issue is very difficult to reproduce. On a 128-thread machine it
happens in H2 around 500k req/s after around 50M requests. In H1 it
happens after around 1 billion requests.
The fix here consists in passing an extra argument to the function to
indicate if the removal is permanent or not. When it's permanent, the
function will clear the associated flags. The callers were adjusted
so that all those dequeuing a connection in order to kill it do it
permanently and all other ones do it only temporarily.
A slightly different approach could have worked: the function could
always remove all flags, and the callers would need to restore them.
But this would require trickier modifications of the various call
places, compared to only passing 0/1 to indicate the permanent status.
This will need to be backported to all stable versions. The issue was
at least reproduced since 3.1 (not tested before). The patch will need
to be adjusted for 3.2 and older, because a 2nd argument "thr" was
added in 3.3, so the patch will not apply to older versions as-is.
Also call srv_reset_path_parameters() when the server changed states,
and got up. It is not enough to do it when the server goes down, because
there's a small race condition, and a connection could get established
just after we did it, and could have set the path parameters.
This does not need to be backported.
Add a rwlock to control the server's path_parameter, to make sure
multiple threads don't set it at the same time, and it can't be seen in
an inconsistent state.
Also don't set the parameter every time, only set them if they have
changed, to prevent needless writes.
This does not need to be backported.
If a QUIC server is declared without ALPN, "h3" value is automatically
set during _srv_parse_finalize().
This patch adjusts this operation. Instead of relying on
ssl_sock_parse_alpn(), a plain strdup() is used. This is considered more
efficient as the ALPN string is constant in this case. This method is
already used for listeners on the frontend side.
Ensure that QUIC support is compiled into haproxy when a QUIC server is
configured. This check is performed during _srv_parse_finalize() so that
it is detected both on configuration parsing and when adding a dynamic
server via the CLI.
Note that this changes the behavior of srv_is_quic() utility function.
Previously, it always returned false when QUIC support wasn't compiled.
With this new check introduced, it is now guaranteed that a QUIC server
won't exist if compilation support is not active. Hence srv_is_quic()
does not rely anymore on USE_QUIC define.
Previously, QUIC servers were rejected if SSL was not explicitely
activated using 'ssl' configuration keyword.
Change this behavior : now SSL is automatically activated for QUIC
servers when the keyword is missing. A warning is displayed as it is
considered better to explicitely note that SSL is in use.
We normally taint the process when using experimental directives, but
a handful of places were missed so we don't always know that they are
in use. Let's fix these places (hint for future directives, just look
for places checking for "experimental_directives_allowed", and add
"mark_tainted(TAINTED_CONFIG_EXP_KW_DECLARED);").
On creation, schedule the server requeue once it's been created.
It is possible that when the server went up, it tried to queue itself
into the lb specific code, failed to do so, and expect the tasklet to
run to take care of that.
This should be backported to 3.2.
This is part of an attempt to fix github issue #3143.
This patch looks huge, but it has a very simple goal: protect all
accessed to shared stats pointers (either read or writes), because
we know consider that these pointers may be NULL.
The reason behind this is despite all precautions taken to ensure the
pointers shouldn't be NULL when not expected, there are still corner
cases (ie: frontends stats used on a backend which no FE cap and vice
versa) where we could try to access a memory area which is not
allocated. Willy stumbled on such cases while playing with the rings
servers upon connection error, which eventually led to process crashes
(since 3.3 when shared stats were implemented)
Also, we may decide later that shared stats are optional and should
be disabled on the proxy to save memory and CPU, and this patch is
a step further towards that goal.
So in essence, this patch ensures shared stats pointers are always
initialized (including NULL), and adds necessary guards before shared
stats pointers are de-referenced. Since we already had some checks
for backends and listeners stats, and the pointer address retrieval
should stay in cpu cache, let's hope that this patch doesn't impact
stats performance much.
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with outgoing connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Unknown or
forbidden algorithms will be ignored and the default one will continue to
be used.
Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.
With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.
Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.
This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):
struct before after delta
conn_hash_nodea 56 32 -24
srv_per_thread 96 72 -24
The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
Probably due to older code, there's a boolean variable used to set
another one which is then checked. Also the first check is made under
the lock, which is unnecessary. Let's simplify this and use a single
variable. This only makes the code clearer, it doesn't change the output
code.
We'll need to have access to the srv_per_thread element soon from this
function, and there's no particular reason for passing it list pointers
so let's pass the server and the thread so that it is autonomous. It
also makes the calling code simpler.
There were a few leftovers from an earlier version of the conn_hash_node
that was using ebmb nodes. A few calls to ebmb_first() and ebmb_entry()
were still present while acting on an eb64 tree. These are harmless as
one is just eb_first() and the other container_of(), but it's confusing
so let's clean them up.
The server ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct server for this node, plus 8 bytes in
struct proxy. The server struct is still 3904 bytes large (due to
alignment) and the proxy struct is 3072.
In srv_parse_id(), there's no point doing all the low-level work with
the tree functions to check for the existence of an ID, we already have
server_find_by_id() which does exactly this, so let's use it.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated server_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This is used to index the server name and it contains a copy of the
pointer to the server's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct server, which remains 3968
bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144.
It's worth noting that the current way duplicate names are handled remains
based on the previous mechanism where dups were permitted. Ideally we
should now reject them during insertion and use unique key trees instead.
This contains the text representation of the server's address, for use
with stick-tables with "srvkey addr". Switching them to a compact node
saves 24 more bytes from this structure. The key was moved to an external
pointer "addr_key" right after the node.
The server struct is now 3968 bytes (down from 4032) due to alignment, and
the proxy struct shrinks by 8 bytes to 3152.
Add a new field in struct server, path parameters. It will contain
connection informations for the server that are not expected to change.
For now, just store the ALPN negociated with the server. Each time an
handhskae is done, we'll update it, even though it is not supposed to
change. This will be useful when trying to send early data, that way
we'll know which mux to use.
Each time the server goes down or is disabled, those informations are
erased, as we can't be sure those parameters will be the same once the
server will be back up.
For HTTPS outgoing connections, the SNI is now automatically set using the
Host header value if no other value is already set (via the "sni" server
keyword). It is now the default behavior. It could be disabled with the
"no-sni-auto" server keyword. And eventually "sni-auto" server keyword may
be used to reset any previous "no-sni-auto" setting. This option can be
inherited from "default-server" settings. Finally, if no connection name is
set via "pool-conn-name" setting, the selected value is used.
The automatic selection of the SNI is enabled by default for all outgoing
connections. But it is concretely used for HTTPS connections only. The
expression used is "req.hdr(host),host_only".
This patch should paritally fix the issue #3081. It only covers the server
part. Another patch will add the feature for HTTP health-checks.
not all changes are concerned. But when the SSL is enabled or disabled for a
server, the healthcheck xprt must be eventually be updated too. This happens
when the healthcheck relies on the server settings.
In the same spirit, when the healthcheck address and port are updated, we
must fallback on the raw xprt if the SSL is not explicitly enabled for the
healthcheck with a "check-ssl" parameter.
This patch should be backported to all stable versions.