Introduce COUNTERS_UPDATE_MAX(), and use it instead of using
HA_ATOMIC_UPDATE_MAX() directly.
For now it just calls HA_ATOMIC_UPDATE_MAX(), but will later be modified
so that we can disable max calculation.
This can be backported up to 2.8 if the usage of COUNTERS_UPDATE_MAX()
generates too many conflicts.
In connect_server(), MUX initialization must be delayed if ALPN
negotiation is configured, unless ALPN can already be retrieved via the
server cache.
A readlock is used to consult the server cache. Prior to this patch, it
was always taken even if no ALPN is configured. The lock was thus used
for every new backend connection instantiation.
Rewrite the check so that now the lock is only used if ALPN is
configured. Thus, no lock access is done if SSL is not used or if ALPN
is not defined.
In practice, there will be no performance gain, as the read lock should
never block if ALPN is not configured. However, the code is cleaner as
it better reflect that only access to server nego_alpn requires the
path_params lock protection.
In connect_server(), when a new connection must be instantiated, MUX
initialization is delayed if an ALPN setting is present on the server
line configuration, as negotiation must be performed to select the
correct MUX. However, this is not the case if the ALPN can already be
retrieved on the server cache.
This check is performed too late however and may cause issue with the
QUIC stack. The problem can happen when the server ALPN is not yet set.
In the normal case, quic_conn layer is instantiated and MUX init is
delayed until the handshake completion. When the MUX is finally
instantiated, it reused without any issue app_ops from its quic_conn,
which is derived from the negotiated ALPN.
However, there is a race condition if another QUIC connection populates
the server ALPN cache. If this happens after the first quic_conn init
but prior to the MUX delay check, the MUX will thus immediately start in
connect_server(). When app_ops is retrieved from its quic_conn, a crash
occurs in qcc_install_app_ops() as the QUIC handshake is not yet
finalized :
#0 0x000055e242a66df4 in qcc_install_app_ops (qcc=0x7f127c39da90, app_ops=0x0) at src/mux_quic.c:1697
1697 if (app_ops->init && !app_ops->init(qcc)) {
[Current thread is 1 (Thread 0x7f12810f06c0 (LWP 25758))]
To fix this, MUX delay check is moved up in connect_server(). It is now
performed prior conn_prepare() which is responsible for the quic_conn
layer instantiation. Thus, it ensures consistency for the QUIC stack :
MUX init is always delayed if the quic_conn does not reuses itself the
SSL session and ALPN server cache (no quic_reuse_srv_params()).
This must be backported up to 3.3.
Add the ability to set connect, queue and tarpit timeouts from the
set-timeout action. This is especially useful when using set-dst to
dynamically connect to servers.
This patch also adds the relevant fe_/be_/cur_ sample fetches for these
timeouts.
This is a follow up to b6bdb2553 ("MEDIUM: backend: make "balance random"
consider req rate when loads are equal")
In the above patch, we used the global sess_per_sec metric to choose which
server we should be using. But the original intent was to use the per
thread group statistic.
No backport needed, the previous patch already improved the situation in
3.3, so let's not take the risk of breaking that.
Move backend compatibility checks performed during 'add server' in a
dedicated function be_supports_dynamic_srv(). This should simplify
addition of future restriction.
This function will be reused when implementing backend creation at
runtime.
As reported by Damien Claisse and Cédric Paillet, the "random" LB
algorithm can become particularly unfair with large numbers of servers
having few connections. It's indeed fairly common to see many servers
with zero connection in a thousand-server large farm, and in this case
the P2C algo consisting in checking the servers' loads doesn't help at
all and is basically similar to random(1). In this case, we only rely
on the distribution of server IDs in the random space to pick the best
server, but it's possible to observe huge discrepancies.
An attempt to model the problem clearly shows that with 1600 servers
with weight 10, for 1 million requests, the lowest loaded ones will
take 300 req while the most loaded ones will get 780, with most of
the values between 520 and 700.
In addition, only the first 28 lower bits of server IDs are used for
the key calculation, which means that node keys are more determinist.
Setting random keys in the lowest 28 bits only better packs values
with min around 530 and max around 710, with values mostly between
550 and 680.
This can only be compensated by increasing weights and draws without
being a perfect fix either. At 4 draws, the min is around 560 and the
max around 670, with most values bteween 590 and 650.
This patch takes another approach to this problem: when servers are on
tie regarding their loads, instead of arbitrarily taking the second one,
we now compare their current request rates, which is updated all the
time and smoothed over one second, and we pick the server with the
lowest request rate. Now with 2 draws, the curve is mostly flat, with
the min at 580 and the max at 628, and almost all values between 611
and 625. And 4 draws exclusively gives values from 614 to 624.
Other points will need to be addressed separately (bits of server ID,
maybe refine the hash algorithm), but these ones would affect how
caches are selected, and cannot be changed without an extra option.
For random however we can perform a change without impacting anyone.
This should be backported, probably only to 3.3 since it's where the
"random" algo became the default.
Contrarily to what was previously believed, there are corner cases where
the counters may not be allocated, and we may want to make them optional
at a later date, so we have to check if those counters are there.
However, just checking that shared.tg is non-NULL is enough, we can then
assume that shared.tg[tgid - 1] has properly been allocated too.
Also modify the various COUNTER_SHARED_* macros to make sure they check
for that too.
Before updating counters, a few tests are made to check if the counters
exits. but those counters should always exist at this point, so just
remmove them.
This commit should have no impact, but can easily be reverted with no
functional impact if various crashes appear.
Instead of statically allocating the per-thread group counters,
based on the max number of thread groups available, allocate
them dynamically, based on the number of thread groups actually
used. That way we can increase the maximum number of thread
groups without using an unreasonable amount of memory.
In 2.6, do_connect_server() was introduced by commit 0a4dcb65f ("MINOR:
stream-int/backend: Move si_connect() in the backend scope") and changed
the approach to work with a stream instead of a stream-interface. However
si_oc(si) was wrongly turned to &s->res instead of &s->req, which breaks
TFO by always inspecting the response channel to figure whether there are
data pending.
This fix can be backported to all versions till 2.6.
In 2.6, the retries counter on a stream was changed from retries left
to retries done via commit 731c8e6cf ("MINOR: stream: Simplify retries
counter calculation"). However, one comparison fell through the cracks
in order to detect whether or not we can use TFO (only first attempt),
resulting in TFO never working anymore.
This may be backported to all versions till 2.6.
Back in the mists of time, commit e91a526c8f decided that if we were trying
to stay on the same server than the previous request, and if there were
a connection available in the session, we'd remove its CO_FL_SESS_IDLE.
The reason for doing that has been long lost, probably it fixed a bug at some
point, but it was most probably not the right place to do that. And starting
with 3.3, this triggers a BUG_ON() because that flag is expected later on.
So just revert the commit, if the ancient bug shows up again, it will be
fixed another way.
This should be backported to 3.3. There is little reason to backport it
to previous versions, unless other patches depend on it.
The server's xprt is always defined and cannot be NULL. So there is no
reason to test it. It could lead to wrong assumptions later in the code.
This patch should fix a Coverity report from #3213.
The SNI of a new connection is now retrieved earlier, before the
initialization of the SSL context. So, concretely, it is now performed
before calling conn_prepare(). The SNI is then set just after.
When a SNI is set on a new connection, its hash is now saved in the
connection itself. To do so, a dedicated field was added into the connection
strucutre, called sni_hash. For now, this value is only used when the TLS
session is cached.
This reverts commit de29000e60.
The fix was in fact invalid. First it is not supprted by WolfSSL to call
SSL_set_tlsext_host_name with a hostname to NULL. Then, it is not specified
as supported by other SSL libraries.
But, by reviewing the root cause of this bug, it appears there is an issue
with the reuse of TLS sesisons. It must not be performed if the SNI does not
match. A TLS session created with a SNI must not be reused with another
SNI. The side effects are not clear but functionnaly speaking, it is
invalid.
So, for now, the commit above was reverted because it is invalid and it
crashes with WolfSSL. Then the init of the SSL connection must be reworked
to get the SNI earlier, to be able to reuse or not an existing TLS
session.
When a new SSL server connection is created, if no SNI is set, it is
possible to inherit from the one of the reused TLS session. The bug was
introduced by the commit 95ac5fe4a ("MEDIUM: ssl_sock: always use the SSL's
server name, not the one from the tid"). The mixup is possible between
regular connections but also with health-checks connections.
To fix the issue, when no SNI is set, for regular server connections and for
health-check connections, the SNI must explicitly be disabled by calling
ssl_sock_set_servername() with the hostname set to NULL.
Many thanks to Lukas for his detailed bug report.
This patch should fix the issue #3195. It must be backported as far as 3.0.
In assign_server_and_queue(), there's a rare case when the server was
full, so we created a pendconn, another server was considered but in the
meanwhile the pendconn was unqueued already, so we just left the
function. We did so, however, while still holding the queue lock, which
will ultimately lead to a deadlock, and ultimately the watchdog would
kill the process.
To fix that, just unlock the queue before leaving.
This should be backported to 3.2.
The target patch fixes a rare race condition which happen when a MUX IO
handler is working on a connection already moved into the purge list. In
this case, the handler will incorrectly moved back the connection into
the idle list.
To fix this, conn_delete_from_tree() was extended to remove flags along
with the connection from the idle list. This was performed when the
connection is moved into the purge list. However, it introduces another
issue related to the idle server connection accounting. Thus it is
necessary to revert it prior to the incoming newer fix.
This patch must be backported to every version where the original commit
is.
check-reuse-pool can only perform as expected if reuse policy on the
backend is set to aggressive or higher. Update the documentation to
reflect this and implement a server diag warning.
In connect_server(), defer the call to conn_xprt_start() until after we
had a chance to create the mux. The xprt can behave differently
depending on if a mux is or is not available at this point, as if it is,
it may want to wait until some data comes from the mux.
This does not need to be backported.
There's currently a function conn_delete_from_tree() which is used to
detach an idle connection from the tree it's currently attached to so
that it is no longer found. This function is used in three circumstances:
- when picking a new connection that no longer has any avail stream
- when temporarily working on the connection from an I/O handler,
in which case it's re-added at the end
- when killing a connection
The 2nd case above is quite specific, as it requires to preserve the
CO_FL_LIST_MASK flags so that the connection can be re-inserted into
the proper tree when leaving the handler. However, there's a catch.
When killing a connection, we want to be certain it will not be
reinserted into the tree. The flags preservation is causing a tiny
race if an I/O happens while the connection is in the kill list,
because in this case the I/O handler will note the connection flags,
do its work, then reinsert the connection where it believed it was,
then the connection gets purged, and another user can find it in the
tree.
The issue is very difficult to reproduce. On a 128-thread machine it
happens in H2 around 500k req/s after around 50M requests. In H1 it
happens after around 1 billion requests.
The fix here consists in passing an extra argument to the function to
indicate if the removal is permanent or not. When it's permanent, the
function will clear the associated flags. The callers were adjusted
so that all those dequeuing a connection in order to kill it do it
permanently and all other ones do it only temporarily.
A slightly different approach could have worked: the function could
always remove all flags, and the callers would need to restore them.
But this would require trickier modifications of the various call
places, compared to only passing 0/1 to indicate the permanent status.
This will need to be backported to all stable versions. The issue was
at least reproduced since 3.1 (not tested before). The patch will need
to be adjusted for 3.2 and older, because a 2nd argument "thr" was
added in 3.3, so the patch will not apply to older versions as-is.
Add a rwlock to control the server's path_parameter, to make sure
multiple threads don't set it at the same time, and it can't be seen in
an inconsistent state.
Also don't set the parameter every time, only set them if they have
changed, to prevent needless writes.
This does not need to be backported.
When trying to reuse a backend connection, a connection hash is
calculated to match an entry with similar parameters. Previously, this
operation was skipped if the stream content wasn't based on HTTP, as it
would have been incompatible with http-reuse.
With the introduction of SPOP backends, this condition was removed, so
that it can also benefit from connection reuse. However, this means that
now hash calcul is always performed when connecting to a server, even
for TCP or log backends. This is unnecessary as these proxies cannot
perform connection reuse.
Note also that reuse mode is resetted on postparsing for incompatible
backends. This at least guarantees that no tree lookup will be performed
via be_reuse_connection(). However, connection lookup is still performed
in the session via session_get_conn() which is another unnecessary
operation.
Thus, this patch restores the condition so that reuse operations are now
entirely skipped if a backend mode is incompatible. This is implemented
via a new utility function named be_supports_conn_reuse().
This could be backported up to 3.1, as this commit could be considered
as a performance regression for tcp/log backend modes.
In order to prepare for changing the way abortonclose works, let's
replace the direct flag check with a similarly named function
(proxy_abrt_close) which returns the on/off status of the directive
for the proxy. For now it simply reflects the flag's state.
In connect_server(), only avoid creating a mux when we're reusing a
connection, if that connection already has one. We can reuse a
connection with no mux, if we made a first attempt at connecting to the
server and it failed before we could create the mux (or during the mux
creation). The connection will then be reused when trying again.
This fixes a bug where a stream could stall if the first connection
attempt failed before the mux creation. It is easy to reproduce by
creating random memory allocation failure with -dmFail.
This was introduced by commit 4aaf0bfbce,
and thus does not need any backport as long as that commit is not
backported.
There is currently an srv_queue converter which is capable of taking the
output of a dynamic name and determining the queue length for a given
server. In addition there is a sample fetcher for whether a server is
currently up. This simply combines the two such that srv_is_up can be
used as a converter too.
Future work might extend this to other sample fetchers for servers, but
this is probably the most useful for acl routing.
In preparation of providing further server converters, split the code
for finding the server from the sample out.
Additionally, update the documentation for srv_queue converter to note
security concerns.
This patch looks huge, but it has a very simple goal: protect all
accessed to shared stats pointers (either read or writes), because
we know consider that these pointers may be NULL.
The reason behind this is despite all precautions taken to ensure the
pointers shouldn't be NULL when not expected, there are still corner
cases (ie: frontends stats used on a backend which no FE cap and vice
versa) where we could try to access a memory area which is not
allocated. Willy stumbled on such cases while playing with the rings
servers upon connection error, which eventually led to process crashes
(since 3.3 when shared stats were implemented)
Also, we may decide later that shared stats are optional and should
be disabled on the proxy to save memory and CPU, and this patch is
a step further towards that goal.
So in essence, this patch ensures shared stats pointers are always
initialized (including NULL), and adds necessary guards before shared
stats pointers are de-referenced. Since we already had some checks
for backends and listeners stats, and the pointer address retrieval
should stay in cpu cache, let's hope that this patch doesn't impact
stats performance much.
Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.
With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.
Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.
This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):
struct before after delta
conn_hash_nodea 56 32 -24
srv_per_thread 96 72 -24
The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
The connection lookup loop is made of two identical blocks, one looking
in the idle or safe lists and the other one looking into the safe list
only. The second one is skipped if a connection was found or if the request
looks for a safe one (since already done). Also the two are slightly
different due to leftovers from earlier versions in that the second one
checks for safe connections and not the first one, and the second one
sets is_safe which is not used later.
Let's just rationalize all this by placing them in a loop which checks
first from the idle conns and second from the safe ones, or skips the
first step if the request wants a safe connection. This reduces the
code and shortens the time spent under the lock.
Now that which ALPN gets negociated for a given server, use that to
decide if we can create the mux right away in connect_server(), and use
it in conn_install_mux_be().
That way, we may create the mux soon enough for early data to be sent,
before the handshake has been completed.
This commit depends on several previous commits, and it has not been
deemed important enough to backport.
The conditions to use early data on output are super tricky and
detected later, so that it's difficult to figure how this works. This
patch splits the condition in two parts, the one that can be performed
early that is based on config/client/etc. It is used to clear a variable
that allows early data to be used in case any condition is not satisfied.
It was purposely split into multiple independent and reviewable tests.
The second part remains where it was at the end, and is used to temporarily
clear the handshake flags to let the data layer use early data. This one
being tricky, a large comment explaining the principle was added.
The logic was not changed at all, only the code was made more readable.
Since 3.0 we have HAVE_SSL_0RTT precisely to avoid checking horribly
complicated and unmaintainable conditions to detect support for 0RTT.
Let's just drop the complex condition and use the macro instead.
Instead of trying to switch from delayed start to instant start based
on a single condition, let's do the opposite and preset the condition
to instant start and detect what could cause it to be delayed, thus
falling back to the slow mode. The condition remains exactly the
inverted one and better matches the comment about ALPN being the only
cause of such a delay.
The init_mux variable is currently used in a way that's not super easy
to grasp. It's set a bit too late and requires to know a lot of info at
once. Let's first rename it to "may_start_mux_now" to clarify its role,
as the purpose is not to *force* the mux to be initialized now but to
permit it to do it.
There is no reason to set the SNI for non-ssl connections. It is not really
an issue because ssl_sock_set_servername() function will do nothing. But
there is no reason to uselessly evaluate an expression.
No backport needed, because there is no bug.
For many years, an unset load balancing algorithm would use "roundrobin".
It was shown several times that "random" with at least 2 draws (the
default) generally provides better performance and fairness in that
it will automatically adapt to the server's load and capacity. This
was further described with numbers in this discussion:
https://www.mail-archive.com/haproxy@formilux.org/msg46011.htmlhttps://github.com/orgs/haproxy/discussions/3042
BTW there were no objection and only support for the change.
The goal of this patch is to change the default algo when none is
specified, from "roundrobin" to "random". This way, users who don't
care and don't set the load balancing algorithm will benefit from a
better one in most cases, while those who have good reasons to prefer
roundrobin (for session affinity or for reproducible sequences like used
in regtests) can continue to specify it.
The vast majority of users should not notice a difference.
When strict maxconn is enforced on a server, it may be necessary to kill
an idle connection to never exceed the limit. To be able to delete a
connection from any thread, takeover is first used to migrate it on the
current thread prior to its deletion.
As takeover is performed to delete a connection instead of reusing it,
<release> argument can be set to true. This removes unnecessary
allocations of resources prior to connection deletion. As such, this
patch is a small optimization for strict maxconn implementation.
Note that this patch depends on the previous one which removes any
assumption in takeover implementation that thread isolation is active if
<release> is true.
session_add_conn() uses three argument : connection and session
instances, plus a void pointer labelled as target. Typically, it
represents the server, but can also be a backend instance (for example
on dispatch).
In fact, this argument is redundant as <target> is already a member of
the connection. This commit simplifies session_add_conn() by removing
it. A BUG_ON() on target is extended to ensure it is never NULL.
Following commit 75e480d10 ("MEDIUM: stats: avoid 1 indirection by storing
the shared stats directly in counters struct"), in order to minimize the
impact of the recent sharded counters work, we try to push things a bit
further in this patch by storing and using "fast" pointers at the session
and stream levels when available to avoid costly indirections and
systematic "tgid" resolution (which can not be cached by the CPU due to
its THREAD-local nature).
Indeed, we know that a session/stream is tied to a given CPU, thanks to
this we know that the tgid for a given session/stream will never change.
Given that, we are able to store sharded frontend and listener counters
pointer at the session level (namely sess->fe_tgcounters and
sess->li_tgcounters), and once the backend and the server are selected,
we are also able to store backend and server sharded counters
pointer at the stream level (namely s->be_tgcounters and s->sv_tgcounters)
Everywhere we rely on these counters and the stream or session context is
available, we use the fast pointers it instead of the indirect pointers
path to make the pointer resolution a bit faster.
This optimization proved to bring a few percents back, and together with
the previous 75e480d10 commit we now fixed the performance regression (we
are back to back with 3.2 stats performance)
Between 3.2 and 3.3-dev we noticed a noticeable performance regression
due to stats handling. After bisecting, Willy found out that recent
work to split stats computing accross multiple thread groups (stats
sharding) was responsible for that performance regression. We're looking
at roughly 20% performance loss.
More precisely, it is the added indirections, multiplied by the number
of statistics that are updated for each request, which in the end causes
a significant amount of time being spent resolving pointers.
We noticed that the fe_counters_shared and be_counters_shared structures
which are currently allocated in dedicated memory since a0dcab5c
("MAJOR: counters: add shared counters base infrastructure")
are no longer huge since 16eb0fab31 ("MAJOR: counters: dispatch counters
over thread groups") because they now essentially hold flags plus the
per-thread group id pointer mapping, not the counters themselves.
As such we decided to try merging fe_counters_shared and
be_counters_shared in their parent structures. The cost is slight memory
overhead for the parent structure, but it allows to get rid of one
pointer indirection. This patch alone yields visible performance gains
and almost restores 3.2 stats performance.
counters_fe_shared_get() was renamed to counters_fe_shared_prepare() and
now returns either failure or success instead of a pointer because we
don't need to retrieve a shared pointer anymore, the function takes care
of initializing existing pointer.
This function doesn't just look at the name but also the ID when the
argument starts with a '#'. So the name is not correct and explains
why this function is not always used when the name only is needed,
and why the list-based findserver() is used instead. So let's just
call the function "server_find()", and rename its generation-id based
cousin "server_find_unique()".
Since proxy and server struct already have an internal last_change
variable and we cannot merge it with the shared counter one, let's
rename the last_change counter to be more specific and prevent the
mixup between the two.
last_change counter is renamed to last_state_change, and unlike the
internal last_change, this one is a shared counter so it is expected
to be updated by other processes in our back.
However, when updating last_state_change counter, we use the value
of the server/proxy last_change as reference value.
Same motivation as previous commit, proxy last_change is "abused" because
it is used for 2 different purposes, one for stats, and the other one
for process-local internal use.
Let's add a separate proxy-only last_change variable for internal use,
and leave the last_change shared (and thread-grouped) counter for
statistics.
On backend side, multiplexer layer is initialized during
connect_server(). However, this step is not performed if ALPN is used,
as the negotiated protocol may be unknown. Multiplexer initialization is
delayed after TLS handshake completion.
There are still exceptions though that forces the MUX to be initialized
even if ALPN is used. One of them was if <mux_proto> server field was
already set at this stage, which is the case when an explicit proto is
selected on the server line configuration. Remove this condition so that
now MUX init is delayed with ALPN even if proto is forced.
The scope of this change should be minimal. In fact, the only impact
concerns server config with both proto and ALPN set, which is pretty
unlikely as it is contradictory.
The main objective of this patch is to prepare QUIC support on the
backend side. Indeed, QUIC proto will be forced on the server if a QUIC
address is used, similarly to bind configuration. However, we still want
to delay MUX initialization after QUIC handshake completion. This is
mandatory to know the selected application protocol, required during
QUIC MUX init.
From connect_server(), QUIC protocol could not be retreived by protocol_lookup()
because of the PROTO_TYPE_STREAM default passed as argument. In place to support
QUIC srv->addr_type.proto_type may be safely passed.