Commit graph

8097 commits

Author SHA1 Message Date
Christopher Faulet
c9fa78e747 MINOR: stream: Replace last_rule_file/line fields by a more generic field
The last evaluated rule is now saved in a generic structure, named
last_entity, with a type to identify it. The idea is to be able to store
other kind of entity that may interrupt a specific processing.

The type of the last evaluated rule is set to 1. It will be replace later by
an enum to be more explicit. In addition, the pointer to the rule itself is
saved instead of its location.

The sample fetch "last_entity" was added to retrieve the information about
it. In this case, it is the rule localtion, the config file containing the
rule followed by the line where the rule is defined, separated by a
colon. This sample fetch is not documented yet.
2024-10-31 16:36:39 +01:00
Amaury Denoyelle
dcf334168c MINOR: quic: move qc_send_mux() prototype into quic_tx.h
qc_send_mux() is defined in quic_tx.c. As such, its prototype is moved
from quic_conn.h to quic_tx.h.
2024-10-31 15:35:31 +01:00
Tristan
18582ede05 MEDIUM: socket: add zero-terminated ABNS alternative
When an abstract unix socket is bound by HAProxy (using "abns@" prefix),
NUL bytes are appended at the end of its path until sun_path is filled
(for a total of 108 characters).

Here we add an alternative to pass only the non-NUL length of that path
to connect/bind calls, such that the effective path of the socket's name
is as humanly written. This may be useful to interconnect with existing
softwares that implement abstract sockets with this logic instead of the
default haproxy one.

This is achieved by implementing the "abnsz" socket prefix (instead of
"abns"), which stands for "zero-terminated ABNS". "abnsz" prefix may be
used anywhere "abns" is. Internally, haproxy uses the custom socket
family (AF_CUST_ABNS vs AF_CUST_ABNSZ) to differentiate default abns
sockets from zero-terminated ones.

Documentation was updated and regtest was added.

Fixes GH issues #977 and #2479

Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
2024-10-29 12:15:24 +01:00
Aurelien DARRAGON
43861e3234 MEDIUM: sock_unix: use per-family addrcmp function
Thanks to previous commit, we may now use dedicated addrcmp functions for
each UNIX address family. This allows to simplify sock_unix_addrcmp()
function and avoid useless checks in order to try to guess the socket
type.

In this patch we implement sock_abns_addrcmp() and sock_abnsz_addrcmp()
functions, which are respectively used for ABNS and ABNSZ custom families

sock_unix_addrcmp() now only holds regular UNIX socket comparing logic.
2024-10-29 12:15:09 +01:00
Willy Tarreau
d24768ab44 MINOR: protocol: create abnsz socket address family
For now it's the same as abns. We'll need to modify sock_unix_addrcmp(),
and a few other ones to support effective path length when dealing with
the \0. Let's check with Tristan's patch for this (upcoming patch).

Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
2024-10-29 12:14:50 +01:00
Willy Tarreau
78ac312bbd MEDIUM: protocol: make abns a custom unix socket address family
This is a pre-requisite to adding the abnsz socket address family:

in this patch we make use of protocol API rework started by 732913f
("MINOR: protocol: properly assign the sock_domain and sock_family") in
order to implement a dedicated address family for ABNS sockets (based on
UNIX parent family).

Thanks to this, it will become trivial to implement a new ABNSZ (for abns
zero) family which is essentially the same as ABNS but with a slight
difference when it comes to path handling (ABNS uses the whole sun_path
length, while ABNSZ's path is zero terminated and evaluation stops at 0)

It was verified that this patch doesn't break reg-tests and behaves
properly (tests performed on the CLI with show sess and show fd).

Anywhere relevant, AF_CUST_ABNS is handled alongside AF_UNIX. If no
distinction needs to be made, real_family() is used to fetch the proper
real family type to handle it properly.

Both stream and dgram were converted, so no functional change should be
expected for this "internal" rework, except that proto will be displayed
as "abns_{stream,dgram}" instead of "unix_{stream,dgram}".

Before ("show sess" output):
  0x64c35528aab0: proto=unix_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=00 epoch=0 age=0s calls=1 rate=0 cpu=0 lat=0 rq[f=848000h,i=0,an=00h,ax=] rp[f=80008000h,i=0,an=00h,ax=] scf=[8,0h,fd=21,rex=10s,wex=] scb=[8,1h,fd=-1,rex=,wex=] exp=10s rc=0 c_exp=

After:
  0x619da7ad74c0: proto=abns_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=00 epoch=0 age=0s calls=1 rate=0 cpu=0 lat=0 rq[f=848000h,i=0,an=00h,ax=] rp[f=80008000h,i=0,an=00h,ax=] scf=[8,0h,fd=22,rex=10s,wex=] scb=[8,1h,fd=-1,rex=,wex=] exp=10s rc=0 c_exp=

Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
2024-10-29 12:14:25 +01:00
William Lallemand
596db3ef86 BUG/MINOR: trace: stop rewriting argv with -dt
When using trace with -dt, the trace_parse_cmd() function is doing a
strtok which write \0 into the argv string.

When using the mworker mode, and reloading, argv was modified and the
trace won't work anymore because the first : is replaced by a '\0'.

This patch fixes the issue by allocating a temporary string so we don't
modify the source string directly. It also replace strtok by its
reentrant version strtok_r.

Must be backported as far as 2.9.
2024-10-29 11:01:47 +01:00
Aurelien DARRAGON
24131dee30 MINOR: tools: add strnlen2() helper
strnlen2() is functionally equivalent to strnlen(). Goal is to provide
an alternative to strnlen() which is not portable since it requires
_POSIX_C_SOURCE >= 200809L
2024-10-28 14:59:35 +01:00
Willy Tarreau
fba48e1c40 MINOR: pools: export the pools variable
We want it to be accessible from debuggers for inspection and it's
currently unavailable. Let's start by exporting it as a first step.
2024-10-24 16:12:46 +02:00
Christopher Faulet
362de90f3e BUG/MINOR: stconn: Don't disable 0-copy FF if EOS was reported on consumer side
There is no reason to disable the 0-copy data forwarding if an end-of-stream
was reported on the consumer side. Indeed, the consumer will send data in
this case. So there is no reason to check the read side here.

This patch may be backported as far as 2.9.
2024-10-24 12:07:50 +02:00
Amaury Denoyelle
7a02fcaf20 BUG/MEDIUM: server: fix race on servers_list during server deletion
Each server is inserted in a global list named servers_list on
new_server(). This list is then only used to finalize servers
initialization after parsing.

On dynamic server creation, there is no issue as new_server() is under
thread isolation. However, when a server is deleted after its refcount
reached zero, srv_drop() removes it from servers_list without lock
protection. In the longterm, this can cause list corruption and crashes,
especially if multiple adjacent servers are removed in parallel.

To fix this, convert servers_list to a mt_list. This should not impact
performance as servers_list is not used during runtime outside of server
creation/deletion.

This should fix github issue #2733. Thanks to Chris Staite who first
found the issue here.

This must be backported up to 2.6.
2024-10-24 11:35:57 +02:00
Valentine Krasnobaeva
ddb829bb51 MINOR: mworker/cli: split mworker_cli_proxy_create
There are two parts in mworker_cli_proxy_create(): allocating and setting up
MASTER proxy and allocating and setting up servers on ipc_fd[0] of the
sockpairs shared with workers.

So, let's split mworker_cli_proxy_create() into two functions respectively.
Each of them takes **errmsg as an argument to write an error message, which may
be triggered by some subcalls. The content of this errmsg will allow to extend
the final alert message shown to user, if these new functions will fail.

The main goals of this split is to allow to move these two parts independantly
in future and makes the code of haproxy initialization in haproxy.c more
transparent.
2024-10-24 11:32:20 +02:00
Willy Tarreau
19e4ec43b9 MINOR: filters: add per-filter call counters
The idea here is to record how many times a filter is being called on a
stream. We're incrementing the same counter all along, regardless of the
type of event, since the purpose is essentially to detect one that might
be misbehaving. The number of calls is reported in "show sess all" next
to the filter name. It may also help detect suboptimal processing. For
example compressing 1GB shows 138k calls to the compression filter, which
is roughly two calls per buffer. Maybe we wake up with incomplete buffers
and compress less. That's left for a future analysis.
2024-10-22 20:13:00 +02:00
Willy Tarreau
37d5c6fe3a MINOR: stream: maintain per-stream counters of the number of passes on code
Process_stream() is a complex function and a few times some lopos were
either witnessed or suspected. Each time this happens it's extremely
difficult to figure why because it involves combinations of analysers,
filters, errors etc.

Let's at least maintain a set of 4 counters per stream that report the
number of times we've been through each of the 4 most important blocks
(stconn changes, request analysers, response analysers, and propagation
of changes down). These ones are stored in the stream and reported in
"show sess all", just like they will be reported in panic dumps.
2024-10-22 20:13:00 +02:00
Willy Tarreau
da66c42f65 MINOR: debug: add a new debug macro COUNT_IF()
This macro works exactly like BUG_ON() except that it never logs anything
nor crashes, it only implements an atomic counter that is incremented on
every call. This can be used to count a number of unlikely events that are
worth checking at run time on setups showing unusual and unreproducible
behaviors.
2024-10-21 19:14:07 +02:00
Willy Tarreau
776fd03509 MEDIUM: debug: add match counters for BUG_ON/WARN_ON/CHECK_IF
These macros do not always kill the process, and sometimes it would be
nice to know if some match or not, and how many times (especially for the
CHECK_IF one).

This commit adds a new section "dbg_cnt" made of structs that contain
function name, file name, line number, check type, condition and match
count. A newe macro __DBG_COUNT() adds one to the counter, and is placed
inside _BUG_ON() and _BUG_ON_ONCE(). It's worth noting that the exact
type of the check is not very precise but in practice we don't care,
as most checks will cause the process to die anyway unless they're of
type _BUG_ON_ONCE() (used by CHECK_IF by default).

All of this is limited to !defined(USE_OBSOLETE_LINKER) because we're
creating a section, thus we need a modern linker to be able to scan
this section later. Doing so adds ~50kB to the executable due to the
~1266 BUG_ON() and others placed there. That's not huge in comparison
to the visibility it can provide.
2024-10-21 19:14:07 +02:00
Willy Tarreau
8844ed2009 CLEANUP: debug: make the BUG_ON() macros check the condition in the outer one
The BUG_ON() macros are made of two levels so as to resolve the condition
to a string. However this doesn't offer much flexibility for performing
other operations when the condition is validated, so let's adjust them so
that the condition is checked in the outer macro and the operations are
performed in the inner one.
2024-10-21 18:17:25 +02:00
Willy Tarreau
278b9613a3 MEDIUM: debug: on panic, make the target thread automatically allocate its buf
One main problem with panic dumps is that they're filling the dumping
thread's trash, and that the global thread_dump_buffer is too small to
catch enough of them.

Here we're proceeding differently. When dumping threads for a panic, we're
passing the magic value 0x2 as the buffer, and it will instruct the target
thread to allocate its own buffer using get_trash_chunk() (which is signal
safe), so that each thread dumps into its own buffer. Then the thread will
wait for the buffer to be consumed, and will assign its own thread_dump_buffer
to it. This way we can simply dump all threads' buffers from gdb like this:

  (gdb) set $t=0
        while ($t < global.nbthread)
          printf "%s\n", ha_thread_ctx[$t].thread_dump_buffer.area
          set $t=$t+1
        end

For now we make it wait forever since it's only called on panic and we
want to make sure the thread doesn't leave and continues to use that trash
buffer or do other nasty stuff. That way the dumping thread will make all
of them die.

This would be useful to backport to the most recent branches to help
troubleshooting. It backports well to 2.9, except for some trivial
context in tinfo-t.h for an updated comment. 2.8 and older would also
require TAINTED_PANIC. The following previous patches are required:

   MINOR: debug: make mark_tainted() return the previous value
   MINOR: chunk: drop the global thread_dump_buffer
   MINOR: debug: split ha_thread_dump() in two parts
   MINOR: debug: slightly change the thread_dump_pointer signification
   MINOR: debug: make ha_thread_dump_done() take the pointer to be used
   MINOR: debug: replace ha_thread_dump() with its two components
2024-10-19 16:01:52 +02:00
Willy Tarreau
afeac4bc02 MINOR: debug: replace ha_thread_dump() with its two components
At the few places we were calling ha_thread_dump(), now we're
calling separately ha_thread_dump_fill() and ha_thread_dump_done()
once the data are consumed.
2024-10-19 15:42:34 +02:00
Willy Tarreau
8e048603d1 MINOR: debug: make mark_tainted() return the previous value
Since mark_tainted() uses atomic ops to update the tainted status, let's
make it return the prior value, which will allow the caller to detect
if it's the first one to set it or not.
2024-10-19 15:13:47 +02:00
Willy Tarreau
84340d108b OPTIM: buffers: avoid a useless wrapping check for ofs == 0
As mentioned in previous commit, b_peek_ofs() performs a wrapping check
but is often called with ofs == 0 as a constant. We can detect this case
with __builtin_const_p() so it makes sense to use it. A test shows a size
reduction of about 320 bytes, which is not much, but it happens in hot code
paths, and each 16 bytes reduction indicates an eliminated conditional
branch.

Some clear winners are ci_getblk_nc() (-48 bytes), h2c_dec_hdrs (-141B),
h1_copy_msg_data (-124B), tcpcheck_spop_expect_hello (-80B),
h1_parse_msg_data (-44B). These ones will definitely benefit from doing
less conditional jumps.
2024-10-18 18:42:47 +02:00
Willy Tarreau
8b5a1fd1fc BUILD: buffers: keep b_getblk_nc() and b_peek_varint() in buf.h
Some large functions were moved to buf.c by commit ac66df4e2 ("REORG:
buffers: move some of the heavy functions from buf.h to buf.c"). However,
as found by Amaury, haring doesn't build anymore. Upon close inspection,
b_getblk_nc() isn't that big since it's very much inlinable, and a part
of its apparently large size comes from the BUG_ON_HOT() that were
implemented. Regarding b_peek_varint(), it doesn't have any dependency
and is used only at 4 places in the DNS code, so its loop will not have
big impacts, and the rest around can be optimised away by the compiler
so it remains relevant to keep it inlined. Also it can serve as a base
to deduplicate the code in b_get_varint().

No backport needed.
2024-10-18 17:53:25 +02:00
Dragan Dosen
f33e9079a9 MINOR: arg: add an argument type for identifier
The ARGT_ID argument type may now be used to set a custom resolve
function in order to help resolve the argument string value. If the
custom resolve function is not set, the behavior is the same as of
type ARGT_STR.
2024-10-18 14:30:24 +02:00
Frederic Lecaille
b1af5dabf0 BUG/MEDIUM: quic: avoid freezing 0RTT connections
This issue came with this commit:

	f627b92 BUG/MEDIUM: quic: always validate sender address on 0-RTT

and could be easily reproduced with picoquic QUIC client with -Q option
which splits a big ClientHello TLS message into two Initial datagrams.
A second condition must be fulfilled to reprodue this issue: picoquic
must not send the token provided by haproxy (NEW_TOKEN). To do that,
haproxy must be patched to prevent it to send such tokens.

Under these conditions, if haproxy has enough time to reply to the first Initial
datagrams, when it receives the second Initial datagram it sends a Retry paquet.
Then the client ignores the Retry paquet as mentionned by RFC 9000:

 17.2.5.2. Handling a Retry Packet
    A client MUST accept and process at most one Retry packet for each connection
    attempt. After the client has received and processed an Initial or Retry packet
    from the server, it MUST discard any subsequent Retry packets that it receives.

On its side, haproxy has closed the connection. When it receives the second
Initial datagram, it open a new connection but with Initial packets it
cannot decrypt (wrong ODCID) leaving the client without response.

To fix this, as the aim of the token (NEW_TOKEN) sent by haproxy is to validate
the peer address, in place of closing the connection when no token was received
for a 0RTT connection, one leaves this validation to the handshake process.
Indeed, the peer adress is validated during the handshake when a valid handshake
packet is received by the listener. But as one does not want haproxy to process
0RTT data when no token was received, one does not accept the connection before
the successful handshake completion. In addition to this, the 0RTT packets
are not released after successful handshake completion when no token was received
to leave a chance to haproxy to process these 0RTT data in such case (see
quic_conn_io_cb()).

Must be backported as far as 2.9.
2024-10-17 15:04:06 +02:00
Christopher Faulet
52a3d807fc BUG/MAJOR: filters/htx: Add a flag to state the payload is altered by a filter
When a filter is registered on the data, it means it may change the payload
length by rewritting data. It means consumers of the message cannot trust the
expected length of payload as announced by the producer. The commit 8bd835b2d2
("MEDIUM: filters/htx: Don't rely on HTX extra field if payload is filtered")
was pushed to solve this issue. When the HTTP payload of a message is filtered,
the extra field is set to 0 to be sure it will never be used by error by any
consumer. However, it is not enough.

Indeed, the filters must be called before fowarding some data. They cannot be
by-passed. But if a consumer is unable to flush the HTX message, some outgoing
data can remain blocked in the channel's buffer. If some new data are then
pushed because there is some room in the channel's buffe, the producer will set
the HTX extra field. At this stage, if the consumer is unblocked and can send
again data, it is possible to call it to forward outgoing data blocked in the
channel's buffer before waking the stream up to filter new input data. It is the
purpose of the data fast-forwarding. In this case, the HTX extra field will be
seen by the consumer. It is unexpected and leads to undefined behavior.

One consequence of this bug is to perform a wrong chunking on compressed
messages, leading to processing errors at the end of the message, reported as
"ID--" in logs.

To fix the bug, a HTX flag is added to state the payload of the current HTX
message is altered. When this flag is set (HTX_FL_ALTERED_PAYLOAD), the HTX
extra field must not be trusted. And to keep things simple, when this flag is
set, the HTX extra field is automatically set to 0 when the HTX message is
loaded, in htxbuf() function.

It is probably the less intrusive way to fix the bug for now. But this part must
be reviewed to save meta-info of the HTX message outside of the message itself.

This commit should solve the issue #2741. It must be backported as far as 2.9.
2024-10-17 13:54:54 +02:00
Valentine Krasnobaeva
b73a278df4 MINOR: mworker/cli: add _send_status to support state transition
In the new master-worker architecture, when a worker process is forked and
successfully initialized it needs somehow to communicate its "READY" state to
the master, in order to terminate the previous worker and workers, that might
exceeded max_reloads counter.

So, let's implement for this a new master CLI _send_status command. A new
worker can send its status string "READY" to the master, when it's about
entering to the run poll loop, thus it can start to receive data.

In _send_status() in the master context we update the status of the new worker:
PROC_O_INIT flag is withdrawn.

When TERM signal is sent to a worker, worker terminates and this triggers the
mworker_catch_sigchld() handler in master. This handler deletes the exiting
process entry from the processes list.

In _send_status() we loop over the processes list twice. At the first time, in
order to stop workers that exceeded the max_reloads counter. At the second time,
in order to stop the worker forked before the last reload. In the corner case,
when max_reloads=1, we avoid to send SIGTERM twice to the same worker by
setting sigterm_sent flag during the first loop.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
2bb07b913d MINOR: startup: rename and adapt reexec_on_failure
Previously reexec_on_failure() was called in cases when the process has failed
after reload, while it was parsing its configuration or it was trying to apply
it. reexec_on_failure() has called mworker_reexec() and the master process has
been reexecuted.

With the new architecture in such cases there is no longer need to reexecute
the master process after its reload again. It simply keeps the previous worker,
forked before the reload, and it lets the new one to exit with an error. But we
still need the code, which increments the number of failed reloads and which
notifies systemd with new "Reload failed!" status. So, let's reuse and adapt
for this reexec_on_failure() and let's rename it to on_new_child_failure().
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
646299fc95 MINOR: mworker: add and set state PROC_O_INIT for new worker
Here, to distinguish between the new worker and the previous one let's add a
new process state PROC_O_INIT and let's set it, when the memory is allocated
for the new worker in the processes list.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
cc1a631beb MINOR: mworker/cli: rename and clean mworker_cli_sockpair_new
Let's rename mworker_cli_sockpair_new() to
mworker_cli_global_proxy_new_listener() to outline that this function creates
the GLOBAL proxy, allocates the listener with "master-socket" bind conf and
attaches this listener to this GLOBAL proxy. Listener is bound to ipc_fd[1] of
the sockpair inherited in master and in worker (master CLI sockpair).
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
0fbf1973ad MINOR: mworker/cli: rename mworker_cli_proxy_new_listener
This is the first commit in a series to add the support of 4 primary reload
use-cases for the new master-worker architecture:

1. Newly forked worker process dies before any reload, due to some errors in
   the configuration. Newly forked worker process crashes before any reload
   after sending its "READY" state to master.

2. Newly forked worker process dies due to some errors in the new
   configuration. This happens after reload, when this new configuration was
   supplied, so the previous worker process is still here.

3. Newly forked worker process crashes after sending its "READY" state to
   master due to some bugs. This happens after reload, so the previous worker
   process is still here.

4. Newly forked worker process has sent its "READY" state to master and starts
   to receive traffic. This happens after reload, the old worker hasn't
   terminated yet, as it is waiting on some idle connection and it crashes.

Let's rename in this commit mworker_cli_proxy_new_listener() to
mworker_cli_master_proxy_new_listener() to outline, that this function creates
"master-socket" bind conf and allocates a listener. This listener is attached
to the MASTER proxy and it's bound to the ipc_fd[0] of the sockpair,
inherited in master and in worker processes (master CLI sockpair).
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
f9123e2183 MEDIUM: cfgparse: add KWF_DISCOVERY keyword flag
This commit is a part of the series to add a support of discovery mode in the
configuration parser and in initialization sequence.

So, let's add here KWF_DISCOVERY flag to distinguish the keywords,
which should be parsed in "discovery" mode and which are needed for master
process, from all others. Keywords, that should be parsed in "discovery" mode
have its dedicated parser funtions. Let's tag these functions with
KWF_DISCOVERY flag in keywords list. Like this, only these keyword parsers
might be called during the first configuration read in discovery mode.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
6769745fe5 MINOR: global: add MODE_DISCOVERY flag
This is the first commit from a series to add a support of discovery mode
in the configuration parser and in initialization sequence.

Discovery mode is the mode, when we read the configuration at the first time
and we parse and set runtime modes: daemon, zero-warning, master-worker. In
this mode we also parse some parameters needed for the master process to start,
in case if we are in the master-worker mode. Like this the master process
doesn't allocate any additional resources, which it doesn't use and it quickly
finishes its initialization and enters to its polling loop. The worker process
after its fork reads the rest of the configuration.

So, let's add in this commit MODE_DISCOVERY flag to check it in
configuration parser functions.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
fe75c1e12d MEDIUM: startup: remove MODE_MWORKER_WAIT
MODE_MWORKER_WAIT becames redundant with MODE_MWORKER, due to moving
master-worker fork in init(). This change allows master no longer perform
reexec just after forking in order to free additional memory.

As after the fork in the master process we set 'master' variable, we can
replace now MODE_MWORKER_WAIT in some 'if' statements by simple check of this
'master' variable.

Let's also continue to get rid of HAPROXY_MWORKER_WAIT_ONLY environment
variable, as it's no longer needed as well.

In cfg_program_postparser(), which is used to check if cmdline is defined to
launch a program, we completely remove the check of mode for now, because
the master process does not parse the configuration for the moment. 'program'
section parsing will be reintroduced in master later in the next commits.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
fb7bef781d MINOR: defaults: update MASTER_MAXCONN description
This is a one of the commits to prepare the removal of MODE_MWORKER_WAIT
support, as it became redundant with MODE_MWORKER due to moving master-worker
fork in init().
2024-10-16 22:02:39 +02:00
Willy Tarreau
2c2dac77aa DEBUG: mux-h2/flags: add H2_CF_DEM_RXBUF & H2_SF_EXPECT_RXDATA for the decoder
Both flags were recently added but missing from the decoders flags, so
they appeared in hex in dev/flags/flags output. No backport needed.
2024-10-16 18:32:52 +02:00
Aurelien DARRAGON
85298189bf BUG/MEDIUM: server: server stuck in maintenance after FQDN change
Pierre Bonnat reported that SRV-based server-template recently stopped
to work properly.

After reviewing the changes, it was found that the regression was caused
by a4d04c6 ("BUG/MINOR: server: make sure the HMAINT state is part of MAINT")

Indeed, HMAINT is not a regular maintenance flag. It was implemented in
b418c122 a4d04c6 ("BUG/MINOR: server: make sure the HMAINT state is part
of MAINT"). This flag is only set (and never removed) when the server FQDN
is changed from its initial config-time value. This can happen with "set
server fqdn" command as well as SRV records updates from the DNS. This
flag should ideally belong to server flags.. but it was stored under
srv_admin enum because cur_admin is properly exported/imported via server
state-file while regular server's flags are not.

Due to a4d04c6, when a server FQDN changes, the server is considered in
maintenance, and since the HMAINT flag is never removed, the server is
stuck in maintenance.

To fix the issue, we partially revert a4d04c6. But this latter commit is
right on one point: HMAINT flag was way too confusing and mixed-up between
regular MAINT flags, thus there's nothing to blame about a4d04c6 as it was
error-prone anyway.. To prevent such kind of bugs from happening again,
let's rename HMAINT to something more explicit (SRV_ADMF_FQDN_CHANGED) and
make it stand out under srv_admin enum so we're not tempted to mix it with
regular maintenance flags anymore.

Since a4d04c6 was set to be backported in all versions, this patch must
be backported there as well.
2024-10-16 14:26:57 +02:00
Amaury Denoyelle
0918c41ef6 BUG/MEDIUM: quic: support wait-for-handshake
wait-for-handshake http-request action was completely ineffective with
QUIC protocol. This commit implements its support for QUIC.

QUIC MUX layer is extended to support wait-for-handshake. A new function
qcc_handle_wait_for_hs() is executed during qcc_io_process(). It detects
if MUX processing occurs after underlying QUIC handshake completion. If
this is the case, it indicates that early data may be received. As such,
connection is flagged with CO_FL_EARLY_SSL_HS, which is necessary to
block stream processing on wait-for-handshake action.

After this, qcc subscribs on quic_conn layer for RECV notification. This
is used to detect QUIC handshake completion. Thus,
qcc_handle_wait_for_hs() can be reexecuted one last time, to remove
CO_FL_EARLY_SSL_HS and notify every streams flagged as
SE_FL_WAIT_FOR_HS.

This patch must be backported up to 2.6, after a mandatory period of
observation. Note that it relies on the backport of the two previous
patches :
- MINOR: quic: notify connection layer on handshake completion
- BUG/MINOR: stream: unblock stream on wait-for-handshake completion
2024-10-16 11:51:35 +02:00
Willy Tarreau
4eb3ff1d3b MAJOR: mux-h2: make streams use the connection's buffers
For now it seems to work as before, and even when artificially inflating
the number of allocatable buffers per stream. The number of allocated
slots is always the same as the max number of streams, which guarantees
that each stream will find one buffer. we only grant one buffer per
stream at this point, since the goal was to replace the existing single
rxbuf.

A new demux blocking flag, H2_CF_DEM_RXBUF, was added to indicate
a failure to get an rxbuf slot from the connection. It was lightly
tested (by forcing bl_init() to a lower number of buffers). It is not
yet certain whether it's more useful to have a new flag or to reuse
the existing H2_CF_DEM_SFULL which indicates the rxbuf is full,
but at least the new flag more accurately translates the condition,
that may make a difference in the future. However, given that when
RXBUF is set, most of the time it results in a failure to find more
room to demux and it sets SFULL, for now we have to always clear
SFULL when clearing RXBUF as well. This means that most of the time
we'll see 3 combinations:
  - none: everything's OK
  - SFULL: the unique rx buffer is full
  - RXBUF || (RXBUF|SFULL): cannot allocate more entries

Note that we need to be super careful in h2_frt_transfer_data() because
the htx_free_data_space() function doesn't guarantee that the room is
usable, so htx_add_data() may still fail despite an apparent room. For
this reason, h2_frt_transfer_data() maintains a "full" flag to indicate
that a transfer attempt failed and that a new buffer is required.
2024-10-12 16:29:16 +02:00
Willy Tarreau
3b5ac2b553 MINOR: mux-h2: move H2_CF_WAIT_IN_LIST flag away from the demux flags
It's not convenient to have this flag in the middle of the demux flags,
it easily hides other ones that need to be added. Let's move it after
the other ones.
2024-10-12 16:29:16 +02:00
Willy Tarreau
721ea5b06c MINOR: mux-h2: count within a connection, how many streams are receiving data
A stream is receiving data from after the HEADERS frame missing END_STREAM,
to the end of the stream or HREM (the presence of END_STREAM). We're now
adding a flag to the stream that indicates this state, as well as a counter
in the connection of streams currently receiving data. The purpose will be
to gauge at any instant the number of streams that might have to share the
available bandwidth and buffers count in order not to allocate too much flow
control to any single stream. For now the counter is kept up to date, and is
reported in "show fd".
2024-10-12 16:29:16 +02:00
Willy Tarreau
8f09bdce10 MINOR: buffer: add a buffer list type with functions
The buffer ring is problematic in multiple aspects, one of which being
that it is only usable by one entity. With multiplexed protocols, we need
to have shared buffers used by many entities (streams and connection),
and the only way to use the buffer ring model in this case is to have
each entity store its own array, and keep a shared counter on allocated
entries. But even with the default 32 buf and 100 streams per HTTP/2
connection, we're speaking about 32*101*32 bytes = 103424 bytes per H2
connection, just to store up to 32 shared buffers, spread randomly in
these tables. Some users might want to achieve much higher than default
rates over high speed links (e.g. 30-50 MB/s at 100ms), which is 3 to 5
MB storage per connection, hence 180 to 300 buffers. There it starts to
cost a lot, up to 1 MB per connection, just to store buffer indexes.

Instead this patch introduces a variant which we call a buffer list.
That's basically just a free list encoded in an array. Each cell
contains a buffer structure, a next index, and a few flags. The index
could be reduced to 16 bits if needed, in order to make room for a new
struct member. The design permits initializing a whole freelist at once
using memset(0).

The list pointer is stored at a single location (e.g. the connection)
and all users (the streams) will just have indexes referencing their
first and last assigned entries (head and tail). This means that with
a single table we can now have all our buffers shared between multiple
streams, irrelevant to the number of potential streams which would want
to use them. Now the 180 to 300 entries array only costs 7.2 to 12 kB,
or 80 times less.

Two large functions (bl_deinit() & bl_get()) were implemented in buf.c.
A basic doc was added to explain how it works.
2024-10-12 16:29:15 +02:00
Willy Tarreau
ac66df4e2e REORG: buffers: move some of the heavy functions from buf.h to buf.c
Over time, some of the buffer management functions grew quite a bit,
and were still forced to remain inlined since all defined in buf.h.
Let's create buf.c and move the heaviest ones there. All those moved
here were above 200 bytes.
2024-10-12 16:29:15 +02:00
Aurelien DARRAGON
1bdf6e884a MEDIUM: sink: implement sink_find_early()
sink_find_early() is a convenient function that can be used instead of
sink_find() during parsing time in order to try to find a matching
sink even if the sink is not defined yet.

Indeed, if the sink is not defined, sink_find_early() will try to create
it and mark it as forward-declared. It will also save informations from
the caller to better identify it in case of errors.

If the sink happens to be found in the config, it will transition from
forward-declared type to its final type. Else, it means that the sink
was not found in the config, in this case, during postresolve, we raise
an error to indicate that the sink was not found in the configuration.

It should help solve postresolving issue with rings, because for now only
log targets implement proper ring postresolving.. but rings may be used
at different places in the code, such as debug() converter or in "traces"
section.
2024-10-10 16:55:15 +02:00
Aurelien DARRAGON
0e271f1d2a MINOR: log: add do_log_parse_act() helper func
Function may be used from places where per-context actions are usually
registered (tcp_act.c, http_act.c, quic_rules.c.. to name a few) in
order to expose the do_log() action.
2024-10-04 21:38:08 +02:00
Aurelien DARRAGON
e63c7da508 MINOR: log: add do_log() logging helper
do_log() is quite similar to sess_log() or strm_log(), excepts that it
may be called at any time during session handling in an opportunistic
way as long as the session exists (the stream may or may not exist).

Also, it will try to emit the log as INFO by default, unless set-log-level
is used on the stream, or error origin flag is set.
2024-10-04 21:38:02 +02:00
Amaury Denoyelle
f6599cf5a6 MEDIUM: quic: decount out-of-order ACK data range for MUX txbuf window
This commit is the last one of a serie whose objective is to restore
QUIC transfer throughput performance to the state prior to the recent
QUIC MUX buffer allocator rework.

This gain is obtained by reporting received out-of-order ACK data range
to the QUIC MUX which can then decount room in its txbuf window. This is
implemented in QUIC streamdesc layer by adding a new invokation of
notify_room callback. This is done into qc_stream_buf_store_ack() which
handle out-of-order ACK data range.

Previous commit has introduced merging of overlapping ACK data range. As
such, it's easy to only report the newly acknowledged data range.

As with in-order ACKs, this new notification is only performed on
released streambuf. As such, when a streambuf instance is released,
notify_room notification now also reports the total length of
out-of-order ACK data range currently stored. This value is stored in a
new streambuf member <room> to avoid unnecessary tree lookup.

This <room> member also serves on in-order ACK notification to reduce
the notified room. This prevents to report invalid values when overlap
ranges are treated first out-of-order and then in-order, which would
cause an invalid QUIC MUX txbuf window value.

After this change has been implemented, performance has been
significantly improved, both with ngtcp2-client rate usage and on
interop goodput test. These values are now similar to the rate observed
on older haproxy version before QUIC MUX buffer allocator rework.
2024-10-04 18:09:51 +02:00
Amaury Denoyelle
e7578084b0 MINOR: quic: implement dedicated type for out-of-order stream ACK
QUIC streamdesc layer is responsible to handle reception of ACK for
streams. It removes stream data from the underlying buffers on ACK
reception.

Streamdesc layer treats ACK in order at the stream level. Out of order
ACKs are buffered in a tree until they can be handled on older data
acknowledgement reception. Previously, qf_stream instance which comes
from the quic_tx_packet was used as tree node to buffer such ranges.

Introduce a new type dedicated to represent out of order stream ack data
range. This type is named qc_stream_ack. It contains minimal infos only
relative to the acknowledged stream data range.

This allows to reduce size of frequently used quic_frame with the
removal of tree node from qf_stream. Another side effect of this change
is that now quic_frame are always released immediately on ACK reception,
both in-order and out-of-order. This allows to also release the
quic_tx_packet instance which should reduce memory consumption.

The drawback of this change is that qc_stream_ack instance must be
allocated on out-of-order ACK reception. As such, qc_stream_desc_ack()
may fail if an error happens on allocation. For the moment, such error
is silenly recovered up to qc_treat_rx_pkts() with the dropping of the
received packet containing the ACK frame. In the future, it may be
useful to close the connection as this error may only happens on low
memory usage.
2024-10-04 17:56:45 +02:00
Christopher Faulet
15a520d474 MINOR: config/trace: Add a 'traces' section to declare debug traces
It is no longer supported to declare debug traces, via 'trace' directive, in
a global section. A 'traces' directive must be used instead. The syntax of
the 'trace' directive in these sections remains the same. But it is no
longer experimental.

The main reason for this change is to avoid to have a ring section defined
before a global one. Indeed, for now, forward declarations of ring sections
are not supported. So to configure traces, you had to add a ring section
before the global one defining the traces. Most of time, that meant to have
two global sections :

  global
    [...] # global settings

  ring <name>
    [...]

  global
    [...] # trace config

In addition, it will be possible to easily extend the traces section by
adding some new directives.
2024-10-02 10:22:51 +02:00
Amaury Denoyelle
cc4384aeb7 MEDIUM: quic: handle out-of-order ACK at streamdesc layer
qc_stream_desc_ack() is the entrypoint for streamdesc layer to handle a
new acknowledgement of previously emitted STREAM data.

Previously, it was only able to deal with in-order ACK offset. The
caller was responsible to buffer out-of-order ACKs. Change this by
dealing with the latter case directly in qc_stream_desc_ack(). This
notably simplify ACK handling in quic_rx module.
2024-10-01 16:22:20 +02:00
Amaury Denoyelle
62558a9285 MINOR: quic: move buffered ACK to streambuf
QUIC streamdesc layer is used to manage QUIC MUX stream txbuf data
storage until acknowledgment. Currently, it only supports in-order
acknowledgment at the stream level. This requires to be able to buffer
out-of-order ACKs until they can be handled.

Previously, these ACKs were stored in a tree to the streamdesc instance.
Move this indexed storage at the streambuf instance.

This commit is purely an architecture change. However, it will allow to
extend ACK management in future patches, such as the ability to merge
overlapping out-of-order ACKs.
2024-10-01 16:19:42 +02:00
Amaury Denoyelle
943e48dadd MINOR: quic: store streambuf in a streamdesc tree
qc_stream_desc layer is used by QUIC MUX to store emitted STREAM data
until their acknowledgement. Each stream with Tx capability can allocate
its own qc_stream_desc. In turn, each stream desc can have one or
multiple data buffers. This is useful when a MUX stream releases a
buffer and allocate a new one, to preserve bandwith without waiting to
receive all acknowledgement of the previous buffer.

Each buffer is encapsulated in a qc_stream_buf structure. Previously, it
was stored as a list into qc_stream_desc. Change this storage to use a
tree instead. Each buffer is indexed by their offset.

This commit does not introduce functional changes. However, this
rearchitecture will be necessary for future commit to extend ACK
management which require fetching individual buffer instance, not just
the first or last element of a streamdesc, by their offset.
2024-10-01 16:19:41 +02:00
Amaury Denoyelle
f4a83fbb14 MINOR: quic: do not remove qc_stream_desc automatically on ACK handling
qc_stream_desc_ack() is used to handle ACK received for STREAM frame. It
removes acknowledged data from their underlying buffer.

If all data were removed after ACK handling, qc_stream_desc instance
would automatically be freed at the end of qc_stream_desc_ack().
However, this renders the function complicated to use. Simplify this by
removing this automatic removal. Now, caller is responsible to check
after ACK handling if qc_stream_desc instance can be removed. This is
easily done using qc_stream_desc_done() helper.
2024-10-01 16:19:25 +02:00
Amaury Denoyelle
db68f8ed86 MINOR: quic: refactor STREAM room notification
qc_stream_desc is an intermediary layer between QUIC MUX and quic_conn.
It is a facility which permits to store data to emit and keep them for
retransmission until acknowledgment. This layer is responsible to notify
QUIC MUX each time a buffer is freed. This is necessary as MUX buffer
allocation is limited by the underlying congestion window size.

Refactor this to use a mechanism similar to send notification. A new
callback notify_room can now be registered to qc_stream_desc instance.
This is set by QUIC MUX to qmux_ctrl_room(). On MUX QUIC free, special
care is now taken to reset notify_room callback to NULL.

Thanks to this refactoring, further adjustment have been made to refine
the architecture. One of them is the removal of qc_stream_desc
QC_SD_FL_OOB_BUF, which is now converted to a MUX layer flag
QC_SF_TXBUF_OOB.
2024-10-01 16:19:25 +02:00
Amaury Denoyelle
d7f4e5abf0 MEDIUM: quic: strengthen MUX send notification
Previous commit implement a refactor of MUX send notification from
quic_conn layer. With this new architecture, a proper callback is
defined for each qc_stream_desc instance.

This architecture change allows to simplify notification from quic_conn
layer. First, ensure the MUX callback to properly ignore retransmission
of an already emitted frame. Luckily, this can be handled easily by
comparing offsets and FIN status. Also, each QCS instance can now be
unregistered from send notification just prior qc_stream_desc releasing.
This ensures a QCS is never manipulated from quic_conn after its
emission ending. Both these changes render the send notification more
robust. As a nice effect, flag QUIC_FL_CONN_TX_MUX_CONTEXT can be
removed as it is now unneeded.
2024-10-01 16:19:25 +02:00
Amaury Denoyelle
6ad99af0a9 MINOR: quic: refactor MUX send notification
For STREAM emission, MUX QUIC generates one or several frames and emit
them via qc_send_mux(). Lower layer may use them as-is, or split them to
lower chunk to fit in a QUIC packet. It is then responsible to notify
the MUX to report the amount of data sent.

Previously, this was done via a direct call from quic_conn to MUX using
qcc_streams_sent_done(). Modify this to have a better isolation accross
layers. Define a send callback handled by the qc_stream_desc instance.
This allows the MUX to register each QCS instance individually to the
renamved qmux_ctrl_send() which replaces qcc_streams_sent_done().

At quic_conn layer, qc_stream_desc_send() can be used now. This is a
wrapper to qc_stream_desc layer to invoke the send callback if
registered.

This mechanism of qc_stream_desc callback should be extended later to
implement other notifications accross the QUIC stack.
2024-10-01 16:19:25 +02:00
Christopher Faulet
273d322b6f MINOR: stream/stats: Expose the total number of streams ever created in stats
A shared counter is added in the thread context to track the total number of
streams created on the thread. This number is then reported in stats. It
will be a useful information to diagnose some bugs.
2024-09-30 16:55:53 +02:00
Christopher Faulet
18ee22ff76 MINOR: stream/stats: Expose the current number of streams in stats
A shared counter is added in the thread context to track the current number
of streams. This number is then reported in stats. It will be a useful
information to diagnose some bugs.
2024-09-30 16:55:53 +02:00
Christopher Faulet
6a94b7419e MINOR: stream: Support dynamic changes of the number of connection retries
Thanks to the previous patch, it is now possible to add an action to
dynamically change the maxumum number of connection retires for a stream.
"set-retries" action may now be used to do so, from a "tcp-request content"
or a "http-request" rule. This action accepts an expression or an integer
between 0 and 100. The integer value is checked during the configuration
parsing and leads to an error if it is not in the expected range. However,
for the expression, the value is retrieve at runtime. So, invalid value are
just ignored.

Too high value is forbidden to avoid any trouble. 100 retries seems already
be an amazingly hight value. In addition, the option is only available on
backend or listen sections.

Because the max retries is limited to 100 at most, it can be stored as a
unsigned short. This save some space in the stream structure.
2024-09-30 16:55:53 +02:00
Christopher Faulet
91e785edc9 MINOR: stream: Rely on a per-stream max connection retries value
Instead of directly relying on the backend parameter to limit the number of
connection retries, we now use a per-stream value. This value is by default
inherited from the backend value when it is set. So for now, there is no
change except the stream value is used instead of the backend value. But
thanks to this change, it will be possible to dynamically change this value.
2024-09-30 16:55:53 +02:00
Christopher Faulet
0d91de2be4 MINOR: action: Export release_expr_int_action() release function
This function was only used by TCP actions and was private to tcp_act.c
file. However, it make sense to make it public to be used by any action
relying on an int-or-expression argument.
2024-09-30 16:55:53 +02:00
Willy Tarreau
7caf073faa MINOR: tools: do not attempt to use backtrace() on linux without glibc
The function is provided by glibc. Nothing prevents us from using our
own outside of glibc there (tested on aarch64 with musl). We still do
not enable it by default as we don't yet know if all archs work well,
but it's sufficient to pass USE_BACKTRACE=1 when building with musl to
verify it's OK.
2024-09-29 09:52:23 +02:00
Willy Tarreau
1c4776dbc3 BUILD: tools: only include execinfo.h for the real backtrace() function
No need to include this possibly non-existing file when using our own
backtrace() implementation, it's only needed for the libc-provided one.
Because of this it's currently not possible to build musl with backtrace
enabled.
2024-09-29 09:52:23 +02:00
Willy Tarreau
a4d04c649a BUG/MINOR: server: make sure the HMAINT state is part of MAINT
In 1.8 when adding "set server fqdn" with commit b418c1228c ("MINOR:
server: cli: Add server FQDNs to server-state file and stats socket."),
the HMAINT flag was not made part of the MAINT ones, so technically
speaking when changing the FQDN, the server is not completely considered
as in maintenance mode.

In its defense, the code location around that was completely messy, with
the aggregator flag being hidden between other values and purposely but
discretely ignoring one of the flags, so the comments were updated to
make the intent clearer (particularly regarding CMAINT which looked like
it was also forgotten while it was on purpose).

This can be backported anywhere.
2024-09-27 18:40:15 +02:00
Willy Tarreau
b8e3b0a18d BUG/MEDIUM: stream: make stream_shutdown() async-safe
The solution found in commit b500e84e24 ("BUG/MINOR: server: shut down
streams under thread isolation") to deal with inter-thread stream
shutdown doesn't work fine because there exists code paths involving
a server lock which can then deadlock on thread_isolate(). A better
solution then consists in deferring the shutdown to the stream itself
and just wake it up for that.

The only thing is that TASK_WOKEN_OTHER is a bit too generic and we
need to pass at least 2 types of events (SF_ERR_DOWN and SF_ERR_KILLED),
so we're now leveraging the new TASK_F_UEVT1 and _UEVT2 flags on the
task's state to convey these info. The caller only needs to wake the
task up with these flags set, and the stream handler will then finish
the job locally using stream_shutdown_self().

This needs to be carefully backported to all branches affected by the
dequeuing issue and containing any of the 5541d4995d ("BUG/MEDIUM:
queue: deal with a rare TOCTOU in assign_server_and_queue()"), and/or
b11495652e ("BUG/MEDIUM: queue: implement a flag to check for the
dequeuing").
2024-09-27 12:15:41 +02:00
Willy Tarreau
b5281283bb MINOR: task: define two new one-shot events for use with WOKEN_OTHER or MSG
TASK_WOKEN_MSG only says "someone sent you a message" but doesn't convey
any info about the message. TASK_WOKEN_OTHER says "you're woken for another
reason" but doesn't tell which one. Most often they're used as-is by the
task handlers to report very specific situations.

For some important control notifications, having the ability to modulate
the message a little bit is useful, so let's define two user event types
UEVT1 and UEVT2 to be used in conjunction with TASK_WOKEN_MSG or _OTHER
so that the application can know that a specific condition was explicitly
requested. It will be used this way:

  task_wakeup(s->task, TASK_WOKEN_MSG | TASK_F_UEVT1);
or:
  task_wakeup(s->task, TASK_WOKEN_OTHER | TASK_F_UEVT2);

Since events are cumulative, keep in mind not to consider a 3rd value
as the combination of EVT1+EVT2; these really mean that the two events
appeared (though in unspecified order).
2024-09-27 11:56:10 +02:00
Aurelien DARRAGON
4189eb7aca MINOR: log: add log_orig_proxy() helper function
Function may be used on proxy where log-steps are used to check if a given
log origin should be handled or not.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
c043d5d372 MINOR: log: introduce "log-steps" proxy keyword
For now it is only available for proxies with frontend capability because
log-steps are only evaluated under sess_log() or strm_log() which
essentially focus on the frontend side when it comes to log settings so
it's better to keep it this way for better consistency, at least for now.

For now the setting does nothing (it is not considered during runtime),
it will be implemented and documented in upcoming commits.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
9341792baf MINOR: proxy: add log_steps struct member
add proxy->conf.log_steps eb32 root tree which will be used to store the
log origin identifiers that should result in haproxy emitting a log as
configured by the user using upcoming "log-steps" proxy keyword.

It was chosen to use eb32 tree instead of simple bitfield because despite
the slight overhead it is more future-proof given that we already
implemented the prerequisites for seamless custom log origins registration
that will also be usable from "log-steps" proxy keyword.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
b882402a29 MINOR: log: support extra log origins for '%OG' alias
Following previous commits, let's improve log_orig_to_str() so that
extra log origins (registered through log_orig_register()) can be
translated to string from origin ID.

For that, it is required to add eb_32 tree node to log_origin struct in
order to enable quick integer lookup during runtime. Slow name lookup
using the list is acceptable for config parsing, but it is not the case
during runtime when log_orig_to_str() is expected to be used. Also, to
prevent duplicated info, get rid of ->id field and use ->tree.key instead
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
f8bb9d5c57 MINOR: log: explicitly handle extra log origins as error when relevant
Thanks to previous commit, we can know check for log_orig optional flags
in functions taking struct log_orig as parameter. Let's take this
opportunity to add the LOG_ORIG_FL_ERROR flag and check this flag at a
few places to handle the log message differently because if the flag is
set then the caller expects the log to be handled as an error explicitly.

e.g.: in _process_send_log_override(), if the flag is set, use the error
log format instead of the dedicated one.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
3c15ee05e9 MINOR: log: introduce log_orig flags
Rename 'enum log_orig' to 'enum log_orig_id', since this enum specifically
contains the log origin ids.

Add 'struct log_orig' which wraps 'enum log_orig' with optional flags
(no flags defined for now).

Add log_orig() helper func that takes id and flags as parameter and
returns log_orig struct initialized with input arguments.

Update functions taking log origin as parameter so they explicitly take
log orig id or log orig wrapper as argument depending on the level of
context expected by the function.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
818475c5cc MINOR: log: introduce extra log profile steps
add a way to register additional log origins using log_origin_register()
that may be used as log profile steps from log profile sections.

For now this does nothing as no extra origins are registered and extra log
origins are not yet considered for runtime logging paths.

When specifying an extra logging step for on <step> under log-profile
section, the logging step is stored within a binary tree for efficient
lookup during runtime. No performance impact should be expected if extra
log origins are not being used, and slight performance impact if extra
log origins are used.

Don't forget to update the documentation when new log origins are added
(both %OG log alias and on <step> log-profile keyword are concerned.
2024-09-26 16:53:07 +02:00
Oliver Dala
a889413f5e BUG/MEDIUM: cli: Deadlock when setting frontend maxconn
The proxy lock state isn't passed down to relax_listener
through dequeue_proxy_listeners, which causes a deadlock
in relax_listener when it tries to get that lock.

Backporting: Older versions didn't have relax_listener and directly called
resume_listener in dequeue_proxy_listeners. lpx should just be passed directly
to resume_listener then.

The bug was introduced in commit 001328873c

[cf: This patch should fix the issue #2726. It must be backported as far as
2.4]
2024-09-25 17:12:11 +02:00
Christopher Faulet
96edacc546 DEV: flags/applet: decode appctx flags
Decode APPCTX flags via appctx_show_flags() function.
2024-09-24 18:26:36 +02:00
Willy Tarreau
ccd1ecba1d MEDIUM: cfgparse: drop duplicate named defaults sections after use
It has never been permitted to explicitly reference named defaults
sections for which there are duplicate names. This means that when
a duplicate defaults section is found, there's no point in keeping
it since it will never be used for lookups, so it can be dropped.

However, some such defaults sections might have some rules in them
that are implicitly referenced by proxies placed after them. In this
case they cannot be removed.

What is done here is that upon each new named section creation, if
another one is found with the same name, its config location is stored
into the new proxy's {prev_file,prev_line} pair, and the old section is
either destroyed if its refcount is null, or just unindexed. The dup
check when creating a new proxy now consists in checking the prev_line
instead of performing a dup lookup on the defaults section.

This will guarantee that we can't find duplicate defaults sections in
their tree anymore, while still keeping track of what's allocated and
releasing everything upon exit.

Beyond the consistency gain, there are nice savings for large configs
involving many defaults sections: a test with 300k sections saved
about 1.9 GB of RAM, and started 25% faster likely thanks to spending
less time allocating memory.
2024-09-20 16:35:32 +02:00
Willy Tarreau
c8b813771d MINOR: proxy: add a list of orphaned defaults sections
We'll soon delete unreferenced and duplicated named defaults sections
from the list of proxies. The problem with this is that this list (in
fact a name-based tree) is used to release all of them at the end. Let's
add a list of orphaned defaults sections, typically those containing
"http-check send" statements or various other rules, and that are
implicitly inherited by a proxy hence have a non-zero refcount while
also having a name. These now makes it possible to remove them from
the name index while still keeping their memory around for the lifetime
of the process, and cleaning it at the end.
2024-09-20 15:59:04 +02:00
Willy Tarreau
b325453c36 MINOR: proxy: use the global file names for conf->file
Proxy file names are assigned a bit everywhere (resolvers, peers,
cli, logs, proxy). All these elements were enumerated and now use
copy_file_name(). The only ha_free() call was turned to drop_file_name().

As a bonus side effect, a 300k backend config saved 14 MB of RAM.
2024-09-19 15:38:19 +02:00
Willy Tarreau
9ab21a3c2d CLEANUP: stick-table: make the file location point to a global file name
The file name used to point to the calling function's stack for stick
tables, which was OK during parsing but remained dangling afterwards.
At least it was already marked const so as not to accidentally free it.
Let's make it point to a file_name_node now.
2024-09-19 15:38:19 +02:00
Willy Tarreau
d6c060c5ae MINOR: tools: add minimal file name management
In proxies, stick-tables, servers, etc... at plenty of places we store
a file name and a line number. Some file names are the result of strdup()
(e.g. in proxies), others not (e.g. stick-tables) and leave dangling
pointers at the end of parsing. The risk of double-free is not null
either.

In order to stop this, let's first add a simple tool that allows to
register short strings inside a global list, these strings happening
to be server names. The strings are either duplicated and stored upon
failure to find them, or just added to this storage. Since file names
are not expected to disappear before the end of the process, for now
we don't even implement refcounting, and we free them all at the end.
There's already a drop_file_name() function to reset the pointer like
ha_free() used to do, and even if not strictly needed it's a good
habit to get used to doing it.

The strings are returned as const so that they're stored as-is in
structs, and that nasty free() calls are easily caught. The pointer
points to the char[] storage inside the node itself. This way later
if we want to implement refcounting, it will be trivial to just look
up a string and change its associated node's refcount. If needed,
comparisons can also be made on pointers.

For now they're not used yet and are released on deinit().
2024-09-19 15:36:58 +02:00
Willy Tarreau
8df44eea6d BUILD: cebtree: silence a bogus gcc warning on impossible code paths
gcc-12 and above report a wrong warning about a negative length being
passed to memcmp() on an impossible code path when built at -O0. The
pattern is the same at a few places, basically:

  int foo(int op, const void *a, const void *b, size_t size, size_t arg)
  {
      if (op == 1) // arg is a strict multiple of size
          return memcmp(a, b, arg - size);
      return 0;
  }
  ...
  int bar()
  {
     return foo(0, a, b, sizeof(something), 0);
  }

It *might* be possible to invent dummy values for the "len" argument
above in the real code, but that significantly complexifies it and as
usual can easily result in introducing undesired bugs.

Here we take a different approach consisting in shutting the
-Wstringop-overread warning on gcc>=12 at -O0 since that's the only
condition that triggers it. The issue was reported to and confirmed by
the gcc team here:  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114622

No backport needed, but this should be upstreamed into cebtree after
checking that all involved macros are available.
2024-09-18 17:42:52 +02:00
Willy Tarreau
f793845f4a MEDIUM: clock: collect the monotonic time in clock_local_update_date()
Now we collect this clock in clock_local_update_date(), the closest from
the poller, which is also used when busy-polling, and the values is set
into the thread's curr_mono_time which did not exist before. Later,
clock_leaving_poll() just sets the prev_mono_time value from the curr_
one instead of retrieving the time at this specific point. It also means
that the monotonic time will now also cover the time needed to update
the global time, which should be negligible. Note that we don't collect
the CPU time in the clock_local_update_date() function even though it's
tempting, because when doing busy-polling, it would be collected on each
round while being useless.

Doing so will make sure that the local time always knows the monotonic
time when it is available.
2024-09-17 09:08:10 +02:00
Christopher Faulet
5fc12b0afd BUG/MEDIUM: sc_strm/applet: Wake applet after a successfull synchronous send
On a synchronous send from the stream to an applet, if some data were sent,
we must take care to wake the applet up. It is important because if
everything was sent at this stage, there is no other chance to wake the
applet up, mainly because SE_FL_WAIT_DATA flag is set on the applet's sedesc
in sc_update_tx() at the end of process_stream(). This flag prevent any
wakeup of the applet for a send event.

It is not necessary for a mux because the mux stream is called when a
syncrhonous send from the stream is performed. So it is reponsible to wake
the mux connection if necessary.

This patch must be backport to 3.0.
2024-09-16 22:55:40 +02:00
Willy Tarreau
5d350d1e50 OPTIM: vars: use multiple name heads in the vars struct
Given that the original list-based version was using a list head as the
root of the variables, while the tree is using a single pointer, it made
sense to reuse that space to place multiple roots, indexed on the lower
bits of the name hash. Two roots slightly increase the performance level,
but the best gain is obtained with 4 roots. The performance is now always
above that of the list, even with small counts, and with 100 vars, it's
21% higher than before, or 67% higher than with the list.

We keep the same lock (it could have made sense to use one lock per head),
because most of the variables in large configs are attached to a stream
or a session, hence are not shared between threads. Thus there's no point
in sharding the pointer.
2024-09-15 23:51:51 +02:00
Willy Tarreau
47ec7c681e OPTIM: vars: use a cebtree instead of a list for variable names
Configs involving many variables can start to eat a lot of CPU in name
lookups. The reason is that the names themselves are dynamic in that
they are relative to dynamic objects (sessions, streams, etc), so
there's no fixed index for example. The current implementation relies
on a standard linked list, and in order to speed up lookups and avoid
comparing strings, only a 64-bit hash of the variable's name is stored
and compared everywhere.

But with just 100 variables and 1000 accesses in a config, it's clearly
visible that variable name lookup can reach 56% CPU with a config
generated this way:

  for i in {0..100}; do
    printf "\thttp-request set-var(txn.var%04d) int(%d)" $i $i;
    for j in {1..10}; do [ $i -lt $j ] || printf ",add(txn.var%04d)" $((i-j)); done;
    echo;
  done

The performance and a 4-core skylake 4.4 GHz reaches 85k RPS with a perf
profile showing:

  Samples: 170K of event 'cycles', Event count (approx.): 142378815419
  Overhead  Shared Object            Symbol
    56.39%  haproxy                  [.] var_to_smp
     6.65%  haproxy                  [.] var_set.part.0
     5.76%  haproxy                  [.] sample_process_cnv
     3.23%  haproxy                  [.] sample_conv_var2smp
     2.88%  haproxy                  [.] sample_conv_arith_add
     2.33%  haproxy                  [.] __pool_alloc
     2.19%  haproxy                  [.] action_store
     2.13%  haproxy                  [.] vars_get_by_desc
     1.87%  haproxy                  [.] smp_dup

[above, var_to_smp() calls var_get() under the read lock].

By switching to a binary tree, the cost is significantly lower, the
performance reaches 117k RPS (+37%) with this profile:

  Samples: 170K of event 'cycles', Event count (approx.): 142323631229
  Overhead  Shared Object            Symbol
    40.22%  haproxy                  [.] cebu64_lookup
     7.12%  haproxy                  [.] sample_process_cnv
     6.15%  haproxy                  [.] var_to_smp
     4.75%  haproxy                  [.] cebu64_insert
     3.79%  haproxy                  [.] sample_conv_var2smp
     3.40%  haproxy                  [.] cebu64_delete
     3.10%  haproxy                  [.] sample_conv_arith_add
     2.36%  haproxy                  [.] action_store
     2.32%  haproxy                  [.] __pool_alloc
     2.08%  haproxy                  [.] vars_get_by_desc
     1.96%  haproxy                  [.] smp_dup
     1.75%  haproxy                  [.] var_set.part.0
     1.74%  haproxy                  [.] cebu64_first
     1.07%  [kernel]                 [k] aq_hw_read_reg
     1.03%  haproxy                  [.] pool_put_to_cache
     1.00%  haproxy                  [.] sample_process

The performance lowers a bit earlier than with the list however. What
can be seen is that the performance maintains a plateau till 25 vars,
starts degrading a little bit for the tree while it remains stable till
28 vars for the list. Then both cross at 42 vars and the list continues
to degrade doing a hyperbole while the tree resists better. The biggest
loss is at around 32 variables where the list stays 10% higher.

Regardless, given the extremely narrow band where the list is better, it
looks relevant to switch to this in order to preserve the almost linear
performance of large setups. For example at 1000 variables and 10k
lookups, the tree is 18 times faster than the list.

In addition this reduces the size of the struct vars by 8 bytes since
there's a single pointer, though it could make sense to re-invest them
into a secondary head for example.
2024-09-15 23:49:01 +02:00
Willy Tarreau
a0205f9de4 IMPORT: import cebtree (compact elastic binary trees)
This is an import of the compact elastic binary trees at commit
a9cd84a ("OPTIM: descent: better prefetch less and for writes when
deleting")

These will be used to replace certain lists (and possibly certain
tree nodes as well). They're as fast (or even faster) than ebtrees
for lookups, as fast for insertion and slower for deletion, and a
node only uses 2 pointers (like a list).

The only changes were cebtree.h where common/tools.h was replaced
with ebtree.h which we already have and already provides the needed
functions and macros, and the addition of a wrapper cebtree-prv.h in
src/ to redirect to import/cebtree-prv.h.
2024-09-15 23:44:59 +02:00
Willy Tarreau
6e92988e20 MINOR: vars: remove the emptiness tests in callers before pruning
All callers of vars_prune_* currently check the list for emptiness.
Let's leave that to vars_prune() itself, it will ease some changes in
the code. Thanks to the previous inlining of the vars_prune() function,
there's no performance loss, and even a very tiny 0.1% gain.
2024-09-15 23:44:16 +02:00
Willy Tarreau
2c1a9c3a43 OPTIM: vars: inline vars_prune() to avoid many calls
Many configs don't have variables and call it for no reason, and even
configs with variables don't necessarily have some in all scopes.
2024-09-15 23:42:09 +02:00
Willy Tarreau
b11495652e BUG/MEDIUM: queue: implement a flag to check for the dequeuing
As unveiled in GH issue #2711, commit 5541d4995d ("BUG/MEDIUM: queue:
deal with a rare TOCTOU in assign_server_and_queue()") does have some
side effects in that it can occasionally cause an endless loop.

As Christopher analysed it, the problem is that process_srv_queue(),
which uses a trylock in order to leave only one thread in charge of
the dequeueing process, can lose the lock race against pendconn_add().
If this happens on the last served request, then there's no more thread
to deal with the dequeuing, and assign_server_and_queue() will loop
forever on a condition that was initially exepected to be extremely
rare (and still is, except that now it can become sticky). Previously
what was happening is that such queued requests would just time out
and since that was very rare, nobody would notice.

The root of the problem really is that trylock. It was added so that
only one thread dequeues at a time but it doesn't offer only that
guarantee since it also prevents a thread from dequeuing if another
one is in the process of queuing. We need a different criterion.

What we're doing now is to set a flag "dequeuing" in the server, which
indicates that one thread is currently in the process of dequeuing
requests. This one is atomically tested, and only if no thread is in
this process, then the thread grabs the queue's lock and dequeues.
This way it will be serialized with pendconn_add() and no request
addition will be missed.

It is not certain whether the original race covered by the fix above
can still happen with this change, so better keep that fix for now.

Thanks to @Yenya (Jan Kasprzak) for the precise and complete report
allowing to spot the problem.

This patch should be backported wherever the patch above was backported.
2024-09-13 08:35:47 +02:00
Aurelien DARRAGON
68cfb222b5 BUG/MEDIUM: pattern: prevent UAF on reused pattern expr
Since c5959fd ("MEDIUM: pattern: merge same pattern"), UAF (leading to
crash) can be experienced if the same pattern file (and match method) is
used in two default sections and the first one is not referenced later in
the config. In this case, the first default section will be cleaned up.
However, due to an unhandled case in the above optimization, the original
expr which the second default section relies on is mistakenly freed.

This issue was discovered while trying to reproduce GH #2708. The issue
was particularly tricky to reproduce given the config and sequence
required to make the UAF happen. Hopefully, Github user @asmnek not only
provided useful informations, but since he was able to consistently
trigger the crash in his environment he was able to nail down the crash to
the use of pattern file involved with 2 named default sections. Big thanks
to him.

To fix the issue, let's push the logic from c5959fd a bit further. Instead
of relying on "do_free" variable to know if the expression should be freed
or not (which proved to be insufficient in our case), let's switch to a
simple refcounting logic. This way, no matter who owns the expression, the
last one attempting to free it will be responsible for freeing it.
Refcount is implemented using a 32bit value which fills a previous 4 bytes
structure gap:

        int                        mflags;               /*    80     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          lock;                 /*    88     8 */
(output from pahole)

Even though it was not reproduced in 2.6 or below by @asmnek (the bug was
revealed thanks to another bugfix), this issue theorically affects all
stable versions (up to c5959fd), thus it should be backported to all
stable versions.
2024-09-09 16:07:05 +02:00
Aaron Kuehler
50322dff81 MEDIUM: server: add init-state
Allow the user to set the "initial state" of a server.

Context:

Servers are always set in an UP status by default. In
some cases, further checks are required to determine if the server is
ready to receive client traffic.

This introduces the "init-state {up|down}" configuration parameter to
the server.

- when set to 'fully-up', the server is considered immediately available
  and can turn to the DOWN sate when ALL health checks fail.
- when set to 'up' (the default), the server is considered immediately
  available and will initiate a health check that can turn it to the DOWN
  state immediately if it fails.
- when set to 'down', the server initially is considered unavailable and
  will initiate a health check that can turn it to the UP state immediately
  if it succeeds.
- when set to 'fully-down', the server is initially considered unavailable
  and can turn to the UP state when ALL health checks succeed.

The server's init-state is considered when the HAProxy instance
is (re)started, a new server is detected (for example via service
discovery / DNS resolution), a server exits maintenance, etc.

Link: https://github.com/haproxy/haproxy/issues/51
2024-09-05 11:13:10 +02:00
Ilya Shipitsin
1f6e5f7a61 CLEANUP: assorted typo fixes in the code and comments
This is 43rd iteration of typo fixes
2024-09-03 17:49:21 +02:00
Christopher Faulet
a7f6b0ac03 MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates
Add a factor parameter to stick-tables, called "brates-factor", that is
applied to in/out bytes rates to work around the 32-bits limit of the
frequency counters. Thanks to this factor, it is possible to have bytes
rates beyond the 4GB. Instead of counting each bytes, we count blocks
of bytes. Among other things, it will be useful for the bwlim filter, to be
able to configure shared limit exceeding the 4GB/s.

For now, this parameter must be in the range ]0-1024].
2024-09-02 15:50:25 +02:00
Aperence
20efb856e1 MEDIUM: protocol: add MPTCP per address support
Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths.

Multipath TCP has been used for several use cases. On smartphones, MPTCP
enables seamless handovers between cellular and Wi-Fi networks while
preserving established connections. This use-case is what pushed Apple
to use MPTCP since 2013 in multiple applications [2]. On dual-stack
hosts, Multipath TCP enables the TCP connection to automatically use the
best performing path, either IPv4 or IPv6. If one path fails, MPTCP
automatically uses the other path.

To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [3]. To
use it on Linux, an application must explicitly enable it when creating
the socket. No need to change anything else in the application.

This attached patch adds MPTCP per address support, to be used with:

  mptcp{,4,6}@<address>[:port1[-port2]]

MPTCP v4 and v6 protocols have been added: they are mainly a copy of the
TCP ones, with small differences: names, proto, and receivers lists.

These protocols are stored in __protocol_by_family, as an alternative to
TCP, similar to what has been done with QUIC. By doing that, the size of
__protocol_by_family has not been increased, and it behaves like TCP.

MPTCP is both supported for the frontend and backend sides.

Also added an example of configuration using mptcp along with a backend
allowing to experiment with it.

Note that this is a re-implementation of Bjrn's work from 3 years ago
[4], when haproxy's internals were probably less ready to deal with
this, causing his work to be left pending for a while.

Currently, the TCP_MAXSEG socket option doesn't seem to be supported
with MPTCP [5]. This results in a warning when trying to set the MSS of
sockets in proto_tcp:tcp_bind_listener.

This can be resolved by adding two new variables:
sock_inet(6)_mptcp_maxseg_default that will hold the default
value of the TCP_MAXSEG option. Note that for the moment, this
will always be -1 as the option isn't supported. However, in the
future, when the support for this option will be added, it should
contain the correct value for the MSS, allowing to correctly
set the TCP_MAXSEG option.

Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2]
Link: https://www.mptcp.dev [3]
Link: https://github.com/haproxy/haproxy/issues/1028 [4]
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5]

Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be>
Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
2024-08-30 18:53:49 +02:00
Aperence
38618822e1 MINOR: server: add a alt_proto field for server
Add a new field alt_proto to the server structures that
specify if an alternate protocol should be used for this server.

This field can be transparently passed to protocol_lookup to get
an appropriate protocol structure.

This change allows thus to create servers with different protocols,
and not only TCP anymore.
2024-08-30 18:53:49 +02:00
Aperence
a7b04e383a MINOR: tools: extend str2sa_range to add an alt parameter
Add a new parameter "alt" that will store wether this configuration
use an alternate protocol.

This alt pointer will contain a value that can be transparently
passed to protocol_lookup to obtain an appropriate protocol structure.

This change is needed to allow for example the servers to know if it
need to use an alternate protocol or not.
2024-08-30 18:53:49 +02:00
Frederic Lecaille
f627b9272b BUG/MEDIUM: quic: always validate sender address on 0-RTT
It has been reported by Wedl Michael, a student at the University of Applied
Sciences St. Poelten, a potential vulnerability into haproxy as described below.

An attacker could have obtained a TLS session ticket after having established
a connection to an haproxy QUIC listener, using its real IP address. The
attacker has not even to send a application level request (HTTP3). Then
the attacker could open a 0-RTT session with a spoofed IP address
trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests.

To mitigate this vulnerability, one decided to use a token which can be provided
to the client each time it successfully managed to connect to haproxy. These
tokens may be reused for future connections to validate the address/path of the
remote peer as this is done with the Retry token which is used for the current
connection, not the next one. Such tokens are transported by NEW_TOKEN frames
which was not used at this time by haproxy.

So, each time a client connect to an haproxy QUIC listener with 0-RTT
enabled, it is provided with such a token which can be reused for the
next 0-RTT session. If no such a token is presented by the client,
haproxy checks if the session is a 0-RTT one, so with early-data presented
by the client. Contrary to the Retry token, the decision to refuse the
connection is made only when the TLS stack has been provided with
enough early-data from the Initial ClientHello TLS message and when
these data have been accepted. Hopefully, this event arrives fast enough
to allow haproxy to kill the connection if some early-data have been accepted
without token presented by the client.

quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN
frame with this newly implemented token to be transported inside.

quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre()
and modified to be reused and derive the secret for the new token implementation.

quic_token_validate() has been implemented to validate both the Retry and
the new token implemented by this patch. When this is a non-retry token
which could not be validated, the datagram received is marked as requiring
a Retry packet to be sent, and no connection is created.

When the Initial packet does not embed any non-retry token and if 0-RTT is enabled
the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon
as the TLS stack detects that some early-data have been provided and accepted by
the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from
ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted()
new function. The secret TLS handshake is interrupted as soon as possible returnin
0 from ha_quic_add_handshake_data(). The connection is also marked as
requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from
ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb())
knows how to behave: kill the connection after having sent a Retry packet.

About TLS stack compatibility, this patch is supported by aws-lc. It is
disabled for wolfssl which does not support 0-RTT at this time thanks
to HAVE_SSL_0RTT_QUIC.

This patch depends on these commits:

     MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
     MINOR: quic: Implement qc_ssl_eary_data_accepted().
     MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)
     BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder
     MINOR: quic: Token for future connections implementation.
     MINOR: quic: Implement quic_tls_derive_token_secret().
     MINOR: tools: Implement ipaddrcpy().

Must be backported as far as 2.6.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
609b124561 MINOR: quic: Implement qc_ssl_eary_data_accepted().
This function is a wrapper around SSL_get_early_data_status() for
OpenSSL derived stack and SSL_early_data_accepted() boringSSL derived
stacks like AWS-LC. It returns true for a TLS server if it has
accepted the early data received from a client.

Also implement quic_ssl_early_data_status_str() which is dedicated to be used
for debugging purposes (traces). This function converts the enum returned
by the two function mentionned above to a human readable string.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
e926378375 MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)
Modify qf_new_token structure to use a static buffer with QUIC_TOKEN_LEN
as size as defined by the token for future connections (quic_token.c).
Modify consequently the NEW_TOKEN frame parser (see quic_parse_new_token_frame()).
Also add comments to denote that the NEW_TOKEN parser function is used only by
clients and that its builder is used only by servers.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
f5b09dc452 MINOR: quic: Token for future connections implementation.
There exist two sorts of token used by QUIC. They are both used to validate
the peer address (path validation). Retry are used for the current
connection the client want to open. This patch implement the other
sort of tokens which after having been received from a connection, may
be provided for the next connection from the same IP address to validate
it (or validate the network path between the client and the server).

The token generation is implemented by quic_generate_token(), and
the token validation by quic_token_chek(). The same method
is used as for Retry tokens to build such tokens to be reused for
future connections. The format is very simple: one byte for the format
identifier to distinguish these new tokens for the Retry token, followed
by a 32bits timestamps. As this part is ciphered with AEAD as cryptographic
algorithm, 16 bytes are needed for the AEAD tag. 16 more random bytes
are added to this token and a salt to derive the AEAD secret used
to cipher the token. In addition to this salt, this is the client IP address
which is used also as AAD to derive the AEAD secret. So, the length of
the token is fixed: 37 bytes.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
74caa0eece MINOR: quic: Implement quic_tls_derive_token_secret().
This is function is similar to quic_tls_derive_retry_token_secret().
Its aim is to derive the secret used to cipher the token to be used
for future connections.

This patch renames quic_tls_derive_retry_token_secret() to a more
and reuses its code to produce a more generic one: quic_do_tls_derive_token_secret().
Two arguments are added to this latter to produce both quic_tls_derive_retry_token_secret()
and quic_tls_derive_token_secret() new function which calls
quic_do_tls_derive_token_secret().
2024-08-30 17:04:09 +02:00
Frederic Lecaille
fb7a092203 MINOR: tools: Implement ipaddrcpy().
Implement ipaddrcpy() new function to copy only the IP address from
a sockaddr_storage struct object into a buffer.
2024-08-30 17:04:09 +02:00
Nicolas CARPi
a33407b499 CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE
There was a typo in the macro name, where LENGTH was incorrectly
written. This didn't cause any issue because the typo appeared in all
occurrences in the codebase.
2024-08-30 14:58:59 +02:00
Christopher Faulet
62c9d51ca4 BUG/MINIR: proxy: Match on 429 status when trying to perform a L7 retry
Support for 429 was recently added to L7 retries (0d142e075 "MINOR: proxy:
Add support of 429-Too-Many-Requests in retry-on status"). But the
l7_status_match() function was not properly updated. The switch statement
must match the 429 status to be able to perform a L7 retry.

This patch must be backported if the commit above is backported. It is
related to #2687.
2024-08-30 12:13:32 +02:00
Christopher Faulet
0d142e0756 MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status
The "429" status can now be specified on retry-on directives. PR_RE_* flags
were updated to remains sorted.

This patch should fix the issue #2687. It is quite simple so it may safely
be backported to 3.0 if necessary.
2024-08-28 10:05:34 +02:00
William Lallemand
e8fecef0ff MEDIUM: ssl: capture the signature_algorithms extension from Client Hello
Activate the capture of the TLS signature_algorithms extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.
2024-08-26 15:17:40 +02:00
William Lallemand
ce7fb6628e MEDIUM: ssl: capture the supported_versions extension from Client Hello
Activate the capture of the TLS supported_versions extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.
2024-08-26 15:12:42 +02:00
Valentine Krasnobaeva
7b78e1571b MINOR: mworker: restore initial env before wait mode
This patch is the follow-up of 1811d2a6ba (MINOR: tools: add helpers to
backup/clean/restore env).

In order to avoid unexpected behaviour in master-worker mode during the process
reload with a new configuration, when the old one has contained '*env' keywords,
let's backup its initial environment before calling parse_cfg() and let's clean
and restore it in the context of master process, just before it enters in a wait
polling loop.

This will garantee that new workers will have a new updated environment and not
the previous one inherited from the master, which does not read the configuration,
when it's in a wait-mode.
2024-08-23 17:06:59 +02:00
Valentine Krasnobaeva
1811d2a6ba MINOR: tools: add helpers to backup/clean/restore env
'setenv', 'presetenv', 'unsetenv', 'resetenv' keywords in configuration could
modify the process runtime environment. In case of master-worker mode this
creates a problem, as the configuration is read only once before the forking a
worker and then the master process does the reexec without reading any config
files, just to free the memory. So, during the reload a new worker process will
be created, but it will inherited the previous unchanged environment from the
master in wait mode, thus it won't benefit the changes in configuration,
related to '*env' keywords. This may cause unexpected behavior or some parser
errors in master-worker mode.

So, let's add a helper to backup all process env variables just before it will
read its configuration. And let's also add helpers to clean up the current
runtime environment and to restore it to its initial state (as it was before
parsing the config).
2024-08-23 17:06:33 +02:00
Willy Tarreau
2a799b64b0 MINOR: protocol: add the real address family to the protocol
For custom families, there's sometimes an underlying real address and
it would be nice to be able to directly use the real family in calls
to bind() and connect() without having to add explicit checks for
exceptions everywhere.

Let's add a .real_family field to struct proto_fam for this. For now
it's always equal to the family except for non-transferable ones such
as rhttp where it's equal to the custom one (anything else could fit).
2024-08-21 17:37:46 +02:00
Willy Tarreau
ba4a416c66 MINOR: protocol: add a family lookup
At plenty of places we have access to an address family which may
include some custom addresses but we cannot simply convert them to
the real families without performing some random protocol lookups.

Let's simply add a proto_fam table like we have for the protocols.
The protocols could even be indexed there, but for now it's not worth
it.
2024-08-21 16:46:15 +02:00
Willy Tarreau
732913f848 MINOR: protocol: properly assign the sock_domain and sock_family
When we finally split sock_domain from sock_family in 2.3, something
was not cleanly finished. The family is what should be stored in the
address while the domain is what is supposed to be passed to socket().
But for the custom addresses, we did the opposite, just because the
protocol_lookup() function was acting on the domain, not the family
(both of which are equal for non-custom addresses).

This is an API bug but there's no point backporting it since it does
not have visible effects. It was visible in the code since a few places
were using PF_UNIX while others were comparing the domain against AF_MAX
instead of comparing the family.

This patch clarifies this in the comments on top of proto_fam, addresses
the indexing issue and properly reconfigures the two custom families.
2024-08-21 16:46:15 +02:00
Willy Tarreau
67bf1d6c9e MINOR: quic: support a tolerance for spurious losses
Tests performed between a 1 Gbps connected server and a 100 mbps client,
distant by 95ms showed that:

  - we need 1.1 MB in flight to fill the link
  - rare but inevitable losses are sufficient to make cubic's window
    collapse fast and long to recover
  - a 100 MB object takes 69s to download
  - tolerance for 1 loss between two ACKs suffices to shrink the download
    time to 20-22s
  - 2 losses go to 17-20s
  - 4 losses reach 14-17s

At 100 concurrent connections that fill the server's link:
  - 0 loss tolerance shows 2-3% losses
  - 1 loss tolerance shows 3-5% losses
  - 2 loss tolerance shows 10-13% losses
  - 4 loss tolerance shows 23-29% losses

As such while there can be a significant gain sometimes in setting this
tolerance above zero, it can also significantly waste bandwidth by sending
far more than can be received. While it's probably not a solution to real
world problems, it repeatedly proved to be a very effective troubleshooting
tool helping to figure different root causes of low transfer speeds. In
spirit it is comparable to the no-cc congestion algorithm, i.e. it must
not be used except for experimentation.
2024-08-21 08:34:30 +02:00
Willy Tarreau
fab0e99aa1 MINOR: quic: store the lost packets counter in the quic_cc_event element
Upon loss detection, qc_release_lost_pkts() notifies congestion
controllers about the event and its final time. However it does not
pass the number of lost packets, that can provide useful hints for
some controllers. Let's just pass this option.
2024-08-21 08:02:44 +02:00
Amaury Denoyelle
0d6112b40b MINOR: mux-quic: retry after small buf alloc failure
Previous commit switch to small buffers for HTTP/3 HEADERS emission.
This ensures that several parallel streams can allocate their own buffer
without hitting the connection buffer limit based now on the congestion
window size.

However, this prevents the transmission of responses with uncommonly
large headers. Indeed, if all headers cannot be encoded in a single
buffer, an error is reported which cause the whole connection closure.

Adjust this by implementing a realloc API exposed by QUIC MUX. This
allows application layer to switch from a small to a default buffer and
restart its processing. This guarantees that again headers not longer
than bufsize can be properly transferred.
2024-08-20 18:12:27 +02:00
Amaury Denoyelle
885e4c5cf8 MINOR: quic: support sbuf allocation in quic_stream
This patch extends qc_stream_desc API to be able to allocate small
buffers. QUIC MUX API is similarly updated as ultimatly each application
protocol is responsible to choose between a default or a smaller buffer.

Internally, the type of allocated buffer is remembered via qc_stream_buf
instance. This is mandatory to ensure that the buffer is released in the
correct pool, in particular as small and standard buffers can be
configured with the same size.

This commit is purely an API change. For the moment, small buffers are
not used. This will changed in a dedicated patch.
2024-08-20 18:12:27 +02:00
Amaury Denoyelle
d0d8e57d47 MINOR: quic: define sbuf pool
Define a new buffer pool reserved to allocate smaller memory area. For
the moment, its usage will be restricted to QUIC, as such it is declared
in quic_stream module.

Add a new config option "tune.bufsize.small" to specify the size of the
allocated objects. A special check ensures that it is not greater than
the default bufsize to avoid unexpected effects.
2024-08-20 18:12:27 +02:00
Amaury Denoyelle
1de5f718cf MINOR: quic/config: adapt settings to new conn buffer limit
QUIC MUX buffer allocation limit is now directly based on the underlying
congestion window size. previous static limit based on conn-tx-buffers
is now unused. As such, this commit adds a warning to users to prevent
that it is now obsolete.

Secondly, update max-window-size setting. It is now the main entrypoint
to limit both the maximum congestion window size and the number of QUIC
MUX allocated buffer on emission. Remove its special value '0' which was
used to automatically adjust it on now unused conn-tx-buffers.
2024-08-20 17:59:35 +02:00
Amaury Denoyelle
aeb8c1ddc3 MAJOR: mux-quic: allocate Tx buffers based on congestion window
Each QUIC MUX may allocate buffers for MUX stream emission. These
buffers are then shared with quic_conn to handle ACK reception and
retransmission. A limit on the number of concurrent buffers used per
connection has been defined statically and can be updated via a
configuration option. This commit replaces the limit to instead use the
current underlying congestion window size.

The purpose of this change is to remove the artificial static buffer
count limit, which may be difficult to choose. Indeed, if a connection
performs with minimal loss rate, the buffer count would limit severely
its throughput. It could be increase to fix this, but it also impacts
others connections, even with less optimal performance, causing too many
extra data buffering on the MUX layer. By using the dynamic congestion
window size, haproxy ensures that MUX buffering corresponds roughly to
the network conditions.

Using QCC <buf_in_flight>, a new buffer can be allocated if it is less
than the current window size. If not, QCS emission is interrupted and
haproxy stream layer will subscribe until a new buffer is ready.

One of the criticals parts is to ensure that MUX layer previously
blocked on buffer allocation is properly woken up when sending can be
retried. This occurs on two occasions :

* after an already used Tx buffer is cleared on ACK reception. This case
  is already handled by qcc_notify_buf() via quic_stream layer.

* on congestion window increase. A new qcc_notify_buf() invokation is
  added into qc_notify_send().

Finally, remove <avail_bufs> QCC field which is now unused.

This commit is labelled MAJOR as it may have unexpected effect and could
cause significant behavior change. For example, in previous
implementation QUIC MUX would be able to buffer more data even if the
congestion window is small. With this patch, data cannot be transferred
from the stream layer which may cause more streams to be shut down on
client timeout. Another effect may be more CPU consumption as the
connection limit would be hit more often, causing more streams to be
interrupted and woken up in cycle.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
000976af58 MINOR: mux-quic: define buf_in_flight
Define a new QCC counter named <buf_in_flight>. Its purpose is to
account the current sum of all allocated stream buffer size used on
emission.

For this moment, this counter is updated and buffer allocation and
deallocation. It will be used to replace <avail_bufs> once congestion
window is used as limit for buffer allocation in a future commit.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
4c4bf26f44 MEDIUM: mux-quic: implement API to ignore txbuf limit for some streams
Define a new qc_stream_desc flag QC_SD_FL_OOB_BUF. This is to mark
streams which are not subject to the connection limit on allocated MUX
stream buffer.

The purpose is to simplify handling of QUIC MUX streams which do not
transfer data and as such are not driven by haproxy layer, for example
HTTP/3 control stream. These streams interacts synchronously with QUIC
MUX and cannot retry emission in case of temporary failure.

This commit will be useful once connection buffer allocation limit is
reimplemented to directly rely on the congestion window size. This will
probably cause the buffer limit to be reached more frequently, maybe
even on QUIC MUX initialization. As such, it will be possible to mark
control streams and prevent them to be subject to the buffer limit.

QUIC MUX expose a new function qcs_send_metadata(). It can be used by an
application protocol to specify which streams are used for control
exchanges. For the moment, no such stream use this mechanism.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
f4d1bd0b76 MINOR: mux-quic: account stream txbuf in QCC
A limit per connection is put on the number of buffers allocated by QUIC
MUX for emission accross all its streams. This ensures memory
consumption remains under control. This limit is simply explained as a
count of buffers which can be concurrently allocated for each
connection.

As such, quic_conn structure was used to account currently allocated
buffers. However, a quic_conn nevers allocates new stream buffers. This
is only done at QUIC MUX layer. As such, this commit moves buffer
accounting inside QCC structure. This simplifies the API, most notably
qc_stream_buf_alloc() usage.

Note that this commit inverts the accounting. Previously, it was
initially set to 0 and increment for each allocated buffer. Now, it is
set to the maximum value and decrement for each buf usage. This is
considered as clearer to use.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
c24c8667b2 MINOR: quic: define max-window-size config setting
Define a new global keyword tune.quic.frontend.max-window-size. This
allows to set globally the maximum congestion window size for each QUIC
frontend connections.

The default value is 0. It is a special value which automatically derive
the size from the configured QUIC connection buffer limit. This is
similar to the previous "quic-cc-algo" behavior, which can be used to
override the maximum window size per bind line.
2024-08-20 17:02:29 +02:00
Valentine Krasnobaeva
8b1dfa9def MINOR: cfgparse: limit file size loaded via /dev/stdin
load_cfg_in_mem() can continuously reallocate memory in order to load an
extremely large input from /dev/stdin, until it fails with ENOMEM, which means
that process has consumed all available RAM. In case of containers and
virtualized environments it's not very good.

So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will
limit the size of input supplied via /dev/stdin.
2024-08-20 14:28:34 +02:00
Nathan Wehrman
fd48b28315 MINOR: Implements new log format of option tcplog clf
Some systems require log formats in the CLF format and that meant that I
could not send my logs for proxies in mode tcp to those servers.  This
implements a format that uses log variables that are compatble with TCP
mode frontends and replaces traditional HTTP values in the CLF format
to make them stand out. Instead of logging method and URI like this
"GET /example HTTP/1.1" it will log "TCP " and for a response code I
used "000" so it would be easy to separate from legitimate HTTP
traffic. Now your log servers that require a CLF format can see the
timings for TCP traffic as well as HTTP.
2024-08-20 07:46:34 +02:00
Aurelien DARRAGON
f8299bc5ea MINOR: log: "drop" support for log-profile steps
It is now possible to use "drop" keyword for "on" lines under a
log-profile section to specify that no log at all should be emitted for
the specified step (setting an empty format was not sufficient to do so
because only the log payload would be empty, not the log header, thus the
log would still be emitted).

It may be useful to selectively disable logging at specific steps for a
given log target (since the log profile may be set on log directives):

log-profile myprof
  on request format "blabla" sd "custom sd"
  on response drop

New testcase was added to reg-tests/log/log_profiles.vtc
2024-08-19 18:53:01 +02:00
William Lallemand
b2a8e8731d MINOR: channel: implement ci_insert() function
ci_insert() is a function which allows to insert a string <str> of size
<len> at <pos> of the input buffer. This is the equivalent of
ci_insert_line2() but without inserting '\r\n'
2024-08-08 17:29:37 +02:00
Valentine Krasnobaeva
c6cfa7cb4a MINOR: startup: rename readcfgfile in parse_cfg
As readcfgfile no longer opens configuration files and reads them with fgets,
but performs only the parsing of provided data, let's rename it to parse_cfg by
analogy with read_cfg in haproxy.c.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
5b52df4c4d MEDIUM: startup: load and parse configs from memory
Let's call load_cfg_in_ram() helper for each configuration file to load it's
content in some area in memory. Adapt readcfgfile() parser function
respectively. In order to limit changes in its scope we give as an argument a
cfgfile structure, already filled in init_args() and in load_cfg_in_ram() with
file metadata and content.

Parser function (readcfgfile()) uses now fgets_from_mem() instead of standard
fgets from libc implementations.

SPOE filter parses its own configuration file, pointed by 'config' keyword in
the configuration already loaded in memory. So, let's allocate and fill for
this a supplementary cfgfile structure, which is not referenced in cfg_cfgfiles
list. This structure and the memory with content of SPOE filter configuration
are freed immediately in parse_spoe_flt(), when readcfgfile() returns.

HAProxy OpenTracing filter also uses its own configuration file. So, let's
follow the same logic as we do for SPOE filter.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
007f7f2f02 MINOR: tools: add fgets_from_mem
Add fgets_from_mem() helper to read lines from configuration files, stored now
as memory chunks. In order to limit changes in the first-level parser code
(readcfgfile()), it is better to reimplement the standard fgets, i.e. to
have a fgets, which can read the serialized data line by line from some memory
area, instead of file stream, and can keep the same behaviour as libc
implementations fgets.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
5b9ed6e4be MINOR: cfgparse: add load_cfg_in_mem
Add load_cfg_in_mem() helper, which allows to store the content of a given file
in memory.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
bafb0ce272 MINOR: startup: adapt list_append_word to use cfgfile
list_append_word() helper was used before only to chain configuration file names
in a list. As now we start to use cfgfile structure which represents entire file
in memory and its metadata, let's adapt this helper to use this structure and
let's rename it to list_append_cfgfile().

Adapt functions, which process configuration files and directories to use
cfgfile structure and list_append_cfgfile() instead of wordlist.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
39f2a19620 REORG: tools: move list_append_word to cfgparse
Let's move list_append_word to cfgparse.c as it is used only to fill
cfg_cfgfiles list with configuration file names.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
70b842e847 MINOR: cfgparse: add struct cfgfile to represent config in memory
This and following commits serve to prepare loading configuration files in
memory, before parsing them, as we may need to parse some parts of
configuration in different moments of the startup sequence. This is a case of
the new master-worker initialization process. Here we need to read at first
only the global and the program sections and only after some steps
(forking worker, etc) the rest of the configuration.

Add a new structure cfgfile to keep configuration files metadata and content,
loaded somewhere in a memory. Instances of filled cfgfile structures could be
chained in a list, as the order in which they were loaded is important.
2024-08-07 18:41:41 +02:00
Willy Tarreau
10c8baca44 MINOR: trace: add a per-source helper to pre-fill the context
Now sources which want to do it can provide a helper that can pre-fill
some fields in the context based on their knowledge (e.g. mux streams).
2024-08-07 16:02:59 +02:00
Willy Tarreau
7d55a70f5a MINOR: trace: move the known trace context into a dedicated struct
We now have a trace_ctx to hold the sess, conn, qc, stream and so on.
This will allow us to pass it across layers so that other helpers can
help fill them.

Ideally it should be passed as an argument to __trace_enabled() by
__trace() so that it can be passed back to the trace callback. But
it seems that trace callbacks are smart enough to figure all their
info when they need them.
2024-08-07 16:02:59 +02:00
Willy Tarreau
d465610ec3 MEDIUM: trace: implement a "follow" mechanism
With "follow" from one source to another, it becomes possible for a
source to automatically follow another source's tracked pointer. The
best example is the session:
  - the "session" source is enabled and has a "lockon session"
    -> its lockon_ptr is equal to the session when valid
  - other sources (h1,h2,h3 etc) are configured for "follow session"
    and will then automatically check if session's lockon_ptr matches
    its own session, in which case tracing will be enabled for that
    trace (no state change).

It's not necessary to start/pause/stop traces when using this, only
"follow" followed by a source with lockon enabled is needed. Some
combinations might work better than others. At the moment the session
is almost never known from the backend, but this may improve.

The meta-source "all" is supported for the follower so that all sources
will follow the tracked one.
2024-08-07 16:02:59 +02:00
Amaury Denoyelle
9f829ea3f3 MINOR: mux-quic: measure QCS lifetime and its blocking state
Reuse newly defined tot_time structure to measure various values related
to a QCS lifetime.

First, a timer is used to comptabilize the total QCS lifetime. Then, two
other timers are used to account the total time during which Tx from
stream layer to MUX is blocked, either on lack of buffer or due to
flow-control.

These three timers are reported in qmux_dump_qcs_info(). Thus, they are
available in traces and for QUIC MUX debug string sample.
2024-08-07 15:40:52 +02:00
Amaury Denoyelle
a6e2523ca1 MINOR: time: define tot_time structure
Define a new utility type tot_time. Its purpose is to be able to account
elapsed time accross multiple periods. Functions are defined to easily
start and stop measures, and return the current value.
2024-08-07 15:40:52 +02:00
Amaury Denoyelle
663416b4ef MINOR: quic: dump quic_conn debug string for logs
Define a new xprt_ops callback named dump_info. This can be used to
extend MUX debug string with infos from the lower layer.

Implement dump_info for QUIC stack. For now, only minimal info are
reported : bytes in flight and size of the sending window. This should
allow to detect if the congestion controller is fine. These info are
reported via QUIC MUX debug string sample.
2024-08-07 15:40:52 +02:00
Amaury Denoyelle
eb4dfa3b36 MINOR: mux-quic: define dump functions for QCC and QCS
Extract trace code to dump QCC and QCS instances into dedicated
functions named qmux_dump_qc{c,s}_info(). This will allow to easily
print QCC/QCS infos outside of traces.
2024-08-07 15:40:52 +02:00
Willy Tarreau
921e04bf87 MINOR: stconn: add a new pair of sf functions {bs,fs}.debug_str
These are passed to the underlying mux to retrieve debug information
at the mux level (stream/connection) as a string that's meant to be
added to logs.

The API is quite complex just because we can't pass any info to the
bottom function. So we construct a union and pass the argument as an
int, and expect the callee to fill that with its buffer in return.

Most likely the mux->ctl and ->sctl API should be reworked before
the release to simplify this.

The functions take an optional argument that is a bit mask of the
layers to dump:
  muxs=1
  muxc=2
  xprt=4
  conn=8
  sock=16

The default (0) logs everything available.
2024-08-07 14:07:41 +02:00
Amaury Denoyelle
e177cf341c BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM
STREAM frames have dedicated handling on retransmission. A special check
is done to remove data already acked in case of duplicated frames, thus
only unacked data are retransmitted.

This handling is faulty in case of an empty STREAM frame with FIN set.
On retransmission, this frame does not cover any unacked range as it is
empty and is thus discarded. This may cause the transfer to freeze with
the client waiting indefinitely for the FIN notification.

To handle retransmission of empty FIN STREAM frame, qc_stream_desc layer
have been extended. A new flag QC_SD_FL_WAIT_FOR_FIN is set by MUX QUIC
when FIN has been transmitted. If set, it prevents qc_stream_desc to be
freed until FIN is acknowledged. On retransmission side,
qc_stream_frm_is_acked() has been updated. It now reports false if
FIN bit is set on the frame and qc_stream_desc has QC_SD_FL_WAIT_FOR_FIN
set.

This must be backported up to 2.6. However, this modifies heavily
critical section for ACK handling and retransmission. As such, it must
be backported only after a period of observation.

This issue can be reproduced by using the following socat command as
server to add delay between the response and connection closure :
  $ socat TCP-LISTEN:<port>,fork,reuseaddr,crlf SYSTEM:'echo "HTTP/1.1 200 OK"; echo ""; sleep 1;'

On the client side, ngtcp2 can be used to simulate packet drop. Without
this patch, connection will be interrupted on QUIC idle timeout or
haproxy client timeout with ERR_DRAINING on ngtcp2 :
  $ ngtcp2-client --exit-on-all-streams-close -r 0.3 <host> <port> "http://<host>:<port>/?s=32o"

Alternatively to ngtcp2 random loss, an extra haproxy patch can also be
used to force skipping the emission of the empty STREAM frame :

diff --git a/include/haproxy/quic_tx-t.h b/include/haproxy/quic_tx-t.h
index efbdfe687..1ff899acd 100644
--- a/include/haproxy/quic_tx-t.h
+++ b/include/haproxy/quic_tx-t.h
@@ -26,6 +26,8 @@ extern struct pool_head *pool_head_quic_cc_buf;
 /* Flag a sent packet as being probing with old data */
 #define QUIC_FL_TX_PACKET_PROBE_WITH_OLD_DATA (1UL << 5)

+#define QUIC_FL_TX_PACKET_SKIP_SENDTO (1UL << 6)
+
 /* Structure to store enough information about TX QUIC packets. */
 struct quic_tx_packet {
 	/* List entry point. */
diff --git a/src/quic_tx.c b/src/quic_tx.c
index 2f199ac3c..2702fc9b9 100644
--- a/src/quic_tx.c
+++ b/src/quic_tx.c
@@ -318,7 +318,7 @@ static int qc_send_ppkts(struct buffer *buf, struct ssl_sock_ctx *ctx)
 		tmpbuf.size = tmpbuf.data = dglen;

 		TRACE_PROTO("TX dgram", QUIC_EV_CONN_SPPKTS, qc);
-		if (!skip_sendto) {
+		if (!skip_sendto && !(first_pkt->flags & QUIC_FL_TX_PACKET_SKIP_SENDTO)) {
 			int ret = qc_snd_buf(qc, &tmpbuf, tmpbuf.data, 0, gso);
 			if (ret < 0) {
 				if (gso && ret == -EIO) {
@@ -354,6 +354,7 @@ static int qc_send_ppkts(struct buffer *buf, struct ssl_sock_ctx *ctx)
 					qc->cntrs.sent_bytes_gso += ret;
 			}
 		}
+		first_pkt->flags &= ~QUIC_FL_TX_PACKET_SKIP_SENDTO;

 		b_del(buf, dglen + QUIC_DGRAM_HEADLEN);
 		qc->bytes.tx += tmpbuf.data;
@@ -2066,6 +2067,17 @@ static int qc_do_build_pkt(unsigned char *pos, const unsigned char *end,
 				continue;
 			}

+			switch (cf->type) {
+			case QUIC_FT_STREAM_8 ... QUIC_FT_STREAM_F:
+				if (!cf->stream.len && (qc->flags & QUIC_FL_CONN_TX_MUX_CONTEXT)) {
+					TRACE_USER("artificially drop packet with empty STREAM frame", QUIC_EV_CONN_TXPKT, qc);
+					pkt->flags |= QUIC_FL_TX_PACKET_SKIP_SENDTO;
+				}
+				break;
+			default:
+				break;
+			}
+
 			quic_tx_packet_refinc(pkt);
 			cf->pkt = pkt;
 		}
2024-08-07 11:03:32 +02:00
Amaury Denoyelle
714009b7bc MINOR: quic: implement function to check if STREAM is fully acked
When a STREAM frame is retransmitted, a check is performed to remove
range of data already acked from it. This is useful when STREAM frames
are duplicated and splitted to cover different data ranges. The newly
retransmitted frame contains only unacked data.

This process is performed similarly in qc_dup_pkt_frms() and
qc_build_frms(). Refactor the code into a new function named
qc_stream_frm_is_acked(). It returns true if frame data are already
fully acked and retransmission can be avoided. If only a partial range
of data is acknowledged, frame content is updated to only cover the
unacked data.

This patch does not have any functional change. However, it simplifies
retransmission for STREAM frames. Also, it will be reused to fix
retransmission for empty STREAM frames with FIN set from the following
patch :
  BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM

As such, it must be backported prior to it.
2024-08-07 10:57:10 +02:00
Amaury Denoyelle
bb9ac256a1 MINOR: quic: convert qc_stream_desc release field to flags
qc_stream_desc had a field <release> used as a boolean. Convert it with
a new <flags> field and QC_SD_FL_RELEASE value as equivalent.

The purpose of this patch is to be able to extend qc_stream_desc by
adding newer flags values. This patch is required for the following
patch
  BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM

As such, it must be backported prior to it.
2024-08-06 18:00:17 +02:00
Amaury Denoyelle
7b89aa5b19 BUG/MINOR: h1: do not forward h2c upgrade header token
haproxy supports tunnel establishment through HTTP Upgrade mechanism.
Since the following commit, extended CONNECT is also supported for
HTTP/2 both on frontend and backend side.

  commit 9bf957335e
  MEDIUM: mux_h2: generate Extended CONNECT from htx upgrade

As specified by HTTP/2 rfc, "h2c" can be used by an HTTP/1.1 client to
request an upgrade to HTTP/2. In haproxy, this is not supported so it
silently ignores this. However, Connection and Upgrade headers are
forwarded as-is on the backend side.

If using HTTP/1 on the backend side and the server supports this upgrade
mechanism, haproxy won't be able to parse the HTTP response. If using
HTTP/2, mux backend tries to incorrectly convert the request to an
Extended CONNECT with h2c protocol, which may also prevent the response
to be transmitted.

To fix this, flag HTTP/1 request with "h2c" or "h2" token in an upgrade
header. On converting the header list to HTX, the upgrade header is
skipped if any of this token is present and the H1_MF_CONN_UPG flag is
removed.

This issue can easily be reproduced using curl --http2 argument to
connect to an HTTP/1 frontend.

This must be backported up to 2.4 after a period of observation.
2024-08-01 18:23:32 +02:00
Amaury Denoyelle
4b0bda42f7 MINOR: flags/mux-quic: decode qcc and qcs flags
Decode QUIC MUX connection and stream elements via qcc_show_flags() and
qcs_show_flags(). Flags definition have been moved outside of USE_QUIC
to ease compilation of flags binary.
2024-07-31 17:59:35 +02:00
Frederic Lecaille
1733dff42a MINOR: tcp_sample: Move TCP low level sample fetch function to control layer
Add ->get_info() new control layer callback definition to protocol struct to
retreive statiscal counters information at transport layer (TCPv4/TCPv6) identified by
an integer into a long long int.
Move the TCP specific code from get_tcp_info() to the tcp_get_info() control layer
function (src/proto_tcp.c) and define it as  the ->get_info() callback for
TCPv4 and TCPv6.
Note that get_tcp_info() is called for several TCP sample fetches.
This patch is useful to support some of these sample fetches for QUIC and to
keep the code simple and easy to maintain.
2024-07-31 10:29:42 +02:00
William Lallemand
f76e8e50f4 BUILD: ssl: replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC
Replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC in the code source, so we
won't need to set USE_OPENSSL_AWSLC in the Makefile on the long term.
2024-07-30 18:53:08 +02:00
William Lallemand
56eefd6827 BUG/MEDIUM: ssl: reactivate 0-RTT for AWS-LC
Then reactivate HAVE_SSL_0RTT and HAVE_SSL_0RTT_QUIC for AWS-LC, which
were wrongly deactivated in f5353f2c ("MINOR: ssl: add HAVE_SSL_0RTT
constant").

Must be backported to 3.0.
2024-07-30 18:53:08 +02:00
Willy Tarreau
1a8f3a368f MINOR: queue: add a function to check for TOCTOU after queueing
There's a rare TOCTOU case that happens from time to time with maxconn 1
and multiple threads. Between the moment we see the queue full and the
moment we queue a request, it's possible that the last request on the
server or proxy ended and that no other one is left to offer it its place.

Given that all this code path is performance-critical and we cannot afford
to increase the lock duration, better recheck for the condition after
queueing. For this we need to be able to check for the condition and
cleanly dequeue a request. That's what this patch provides via the new
function pendconn_must_try_again(). It will catch more requests than
absolutely needed though it will catch them all. It may find that around
1/1000 of requests are at risk, though testing shows that in practice,
it's around 1 per million that really gets stuck (other ones benefit
from timing and finishing late requests). Maybe in the future some
conditions might be refined but it's harmless.

What happens to such requests is that they're dequeued and their pendconn
freed, so that the caller can decide to try to LB or queue them again. For
now the function is not used, it's just added separately for easier tracking.
2024-07-29 09:27:01 +02:00
Frederic Lecaille
76ff8afa2d MINOR: quic: Add information to "show quic" for CUBIC cc.
Add ->state_cli() new callback to quic_cc_algo struct to define a
function called by the "show quic (cc|full)" commands to dump some information
about the congestion algorithm internal state currently in use by the QUIC
connections.

Implement this callback for CUBIC algorithm to dump its internal variables:
   - K: (the time to reach the cubic curve inflexion point),
   - last_w_max: the last maximum window value reached before intering
     the last recovery period. This is also the window value at the
     inflexion point of the cubic curve,
   - wdiff: the difference between the current window value and last_w_max.
     So negative before the inflexion point, and positive after.
2024-07-26 16:42:44 +02:00
Willy Tarreau
2dab1ba84b MEDIUM: h1: allow to preserve keep-alive on T-E + C-L
In 2.5-dev9, commit 631c7e866 ("MEDIUM: h1: Force close mode for invalid
uses of T-E header") enforced a recently arrived new security rule in the
HTTP specification aiming at preventing a class of content-smuggling
attacks involving HTTP/1.0 agents. It consists in handling the very rare
T-E + C-L requests or responses in close mode.

It happens it does have an impact of a rare few and very old clients
(probably running insecure TLS stacks by the way) that continue to send
both with their POST requests. The impact is that for each and every
request they'll have to reconnect, possibly negotiating a full TLS
handshake that becomes harmful to the machine in terms of CPU computation.

This commit adds a new option "h1-do-not-close-on-insecure-transfer-encoding"
that does exactly what it says, it just asks not to close on such messages,
even though the message continues to be sanitized and C-L dropped. It means
that the risk is only between the sender and haproxy, which is limited, and
might be the only acceptable solution for such environments having to deal
with broken implementations.

The cases are so rare that it should not need to be backported, or in the
worst case, to the latest LTS if there is any demand.
2024-07-26 15:59:35 +02:00
Amaury Denoyelle
08515af9df MINOR: quic: implement send-retry quic-initial rules
Define a new quic-initial "send-retry" rule. This allows to force the
emission of a Retry packet on an initial without token instead of
instantiating a new QUIC connection.
2024-07-25 15:39:39 +02:00
Amaury Denoyelle
69d7e9f3b7 MINOR: quic: implement reject quic-initial action
Define a new quic-initial action named "reject". Contrary to dgram-drop,
the client is notified of the rejection by a CONNECTION_CLOSE with
CONNECTION_REFUSED error code.

To be able to emit the necessary CONNECTION_CLOSE frame, quic_conn is
instantiated, contrary to dgram-drop action. quic_set_connection_close()
is called immediatly after qc_new_conn() which prevents the handshake
startup.
2024-07-25 15:39:39 +02:00
Amaury Denoyelle
f91be2657e MINOR: quic: pass quic_dgram as obj_type for quic-initial rules
To extend quic-initial rules, pass quic_dgram instance to argument for
the various actions. As such, quic_dgram is now supported as an obj_type
and can be used in session origin field.
2024-07-25 15:39:39 +02:00
Amaury Denoyelle
1259700763 MINOR: quic: support ACL for quic-initial rules
Add ACL condition support for quic-initial rules. This requires the
extension of quic_parse_quic_initial() to parse an extra if/unless
block.

Only layer4 client samples are allowed to be used with quic-initial
rules. However, due to the early execution of quic-initial rules prior
to any connection instantiation, some samples are non supported.

To be able to use the 4 described samples, a dummy session is
instantiated before quic-initial rules execution. Its src and dst fields
are set from the received datagram values.
2024-07-25 15:39:39 +02:00
Amaury Denoyelle
cafe596608 MEDIUM: quic: implement quic-initial rules
Implement a new set of rules labelled as quic-initial.

These rules as specific to QUIC. They are scheduled to be executed early
on Initial packet parsing, prior a new QUIC connection instantiation.
Contrary to tcp-request connection, this allows to reject traffic
earlier, most notably by avoiding unnecessary QUIC SSL handshake
processing.

A new module quic_rules is created. Its main function
quic_init_exec_rules() is called on Initial packet parsing in function
quic_rx_pkt_retrieve_conn().

For the moment, only "accept" and "dgram-drop" are valid actions. Both
are final. The latter drops silently the Initial packet instead of
allocating a new QUIC connection.
2024-07-25 15:39:39 +02:00
William Lallemand
28cb01f8e8 MEDIUM: quic: implement CHACHA20_POLY1305 for AWS-LC
With AWS-LC, the aead part is covered by the EVP_AEAD API which
provides the correct EVP_aead_chacha20_poly1305(), however for header
protection it does not provides an EVP_CIPHER for chacha20.

This patch implements exceptions in the header protection code and use
EVP_CIPHER_CHACHA20 and EVP_CIPHER_CTX_CHACHA20 placeholders so we can
use the CRYPTO_chacha_20() primitive manually instead of the EVP_CIPHER
API.

This requires to check if we are using EVP_CIPHER_CTX_CHACHA20 when
doing EVP_CIPHER_CTX_free().
2024-07-25 13:45:39 +02:00
William Lallemand
177c84808c MEDIUM: quic: add key argument to header protection crypto functions
In order to prepare the code for using Chacha20 with the EVP_AEAD API,
both quic_tls_hp_decrypt() and quic_tls_hp_encrypt() need an extra key
argument.

Indeed Chacha20 does not exists as an EVP_CIPHER in AWS-LC, so the key
won't be embedded into the EVP_CIPHER_CTX, so we need an extra parameter
to use it.
2024-07-25 13:45:39 +02:00
William Lallemand
d55a297b85 MINOR: quic: rename confusing wording aes to hp
Some of the crypto functions used for headers protection in QUIC are
named with an "aes" name even thought they are not used for AES
encryption only.

This patch renames these "aes" to "hp" so it is clearer.
2024-07-25 13:45:38 +02:00
William Lallemand
31c831e29b MEDIUM: ssl/quic: implement quic crypto with EVP_AEAD
The QUIC crypto is using the EVP_CIPHER API in order to achieve
authenticated encryption, this was the API which was used with OpenSSL.
With libraries that inspires from BoringSSL (libreSSL and AWS-LC), the
AEAD algorithms are implemented using the EVP_AEAD API.

This patch converts the call to the EVP_CIPHER API when called in the
contex of AEAD cryptography for QUIC.

The patch defines some QUIC_AEAD macros that can be either EVP_CIPHER or
EVP_AEAD depending on the library.

This was mainly done for AWS-LC but this could be useful for other
libraries. This should finally allow to use CHACHA20_POLY1305 with
AWS-LC.

This patch allows to use the following ciphers with the EVP_AEAD API:
- TLS1_3_CK_AES_128_GCM_SHA256
- TLS1_3_CK_AES_256_GCM_SHA384

AWS-LC does not implement TLS1_3_CK_AES_128_CCM_SHA256 and
TLS1_3_CK_CHACHA20_POLY1305_SHA256 requires some hack for headers
protection which will come in another patch.
2024-07-25 13:45:38 +02:00
Aurelien DARRAGON
709b3db941 MINOR: sink: add processed events counter in sft
Add a new struct member to sft structure named e_processed in order to
track the total number of events processed by sft applets.

sink_forward_oc_io_handler() and sink_forward_io_handler() now make use
of ring_dispatch_messages() optional value added in the previous commit
in order to increase the number of processed events.
2024-07-24 17:59:08 +02:00
Aurelien DARRAGON
47323e64ad MINOR: ring: count processed messages in ring_dispatch_messages()
ring_dispatch_messages() now takes an optional argument <processed> which
must point to a size_t counter when provided.

When provided, the value is updated to the number of messages processed
by the function.
2024-07-24 17:59:03 +02:00
Christopher Faulet
2f3c4d1b6c MINOR: spoe: export the list of SPOP error reasons
The strings representing the human-readable version for SPOP errors are now
exported. It is now an array of IST to ease manipulation.
2024-07-24 14:19:10 +02:00
Christopher Faulet
f8fed07d3a MINOR: spoe: Add a function to validate a version is supported
spoe_check_vsn() function can now be used to check if a version, converted
to an integer, via spoe_str_to_vsn() for instance, is supported. To do so,
the list of all supported version is now exported.
2024-07-24 14:19:10 +02:00
Christopher Faulet
f93828f229 MEDIUM: vars: Be able to parse parent scopes for variables
Add session/stream scopes related to the parent. To do so, "psess", "ptxn",
"preq" or "pres" must be used instead of tranditionnal scopes (without the
first "p"). the "proc" scope is not concerned by this change because it is
not linked to a stream. When such scopes are used, a specific flags is added
on the variable description during the variable parsing.

For now, theses scopes are parsed and the variable description is updated
accordingly. But at the end, any operation on the variable value fails.
2024-07-18 16:39:39 +02:00
Christopher Faulet
d430edcda3 MINOR: vars: Use a description to set/unset a variable instead of its hash and scope
Now a variable description is retrieved when a variable is parsed, we can
use it to set or unset the variable value. It is mandatory to be able to
know the parent stream, if any, must be used, instead of the current one.
2024-07-18 16:39:38 +02:00
Christopher Faulet
eb2d71614f MINOR: vars: Fill a description instead of hash and scope when a name is parsed
A variable description is now used to parse a variable and extract its name
and its scope. It is mandatory to be able to add some flags on the variable
when it is evaluated (set or get). Among other things, this will be used to
know the parent stream, if any, must be used, instead of the current one.
2024-07-18 16:39:38 +02:00
Christopher Faulet
b020bb73a0 MINOR: stream: Add a pointer to set the parent stream
A pointer to a parent stream was added in the stream structure. For now,
this pointer is never set, but the idea is to have an access to a stream
environment from another one from the moment there is a parent/child
relationship betwee these streams.

Concretely, for now, there is nothing to formalize this relationship.
2024-07-18 16:39:38 +02:00
Aurelien DARRAGON
d3d35f0fc6 BUILD: tree-wide: cast arguments to tolower/toupper to unsigned char (2)
Fix build warning on NetBSD by reapplying f278eec37a ("BUILD: tree-wide:
cast arguments to tolower/toupper to unsigned char").

This should fix issue #2551.
2024-07-18 13:29:52 +02:00
William Lallemand
344c3ce8fc MEDIUM: ssl: add extra_chain to ckch_data
The extra_chain member is a pointer to the 'issuers-chain-path' file
that completed the chain.

This is useful to get what chain file was used.
2024-07-17 16:52:06 +02:00
Valentine Krasnobaeva
665dde6481 MINOR: debug: use LIM2A to show limits
It is more handy to use LIM2A in debug_parse_cli_show_dev(), as it allows to
show a custom string ("unlimited"), if a given limit value equals to 0.

normalize_rlim() handler is needed to convert properly RLIM_INFINITY to zero,
with the respect of type sizes, as rlim_t is always 4 bytes on 32bit and
64bit arch.
2024-07-16 14:04:41 +02:00
Willy Tarreau
75b335abc7 MINOR: fd: don't scan the full fdtab on all threads
During tests, it's pretty visible that with many threads and a large
number of FDs, the process may take time to be ready. The reason for
this is that the full fdtab array is scanned by each and every thread
at boot in fd_reregister_all() in order to make each thread-local
poller adopt the FDs that are relevant to it. The problem is that
when dealing with 1-2M FDs and 64+ threads, it starts to represent
quite a number of loops, and usually the fdtab array doesn't entirely
fit in the CPU's L3 cache, causing extra memory accesses.

It's particularly visible when issuing debugging commands to the CLI
because usually the first one fails while the CPU is at 100% for half
a second (which also is socat's timeout). A quick test with this:

    global
        stats socket /tmp/sock1 level admin mode 666
        stats timeout 1h
        maxconn 2000000

And the following script started in another window:

    while ! time socat -t5 - /tmp/sock1 <<< "show version";do date -Ins;done

shows that it takes 1.58s for the socat instance that succeeds on an
Ampere Altra with 80 cores, this requires to change the timeout (defaults
to half a second) otherwise it returns nothing. In addition it also means
that during reloads, some CPU spikes will be noticed.

Adding a prefetch of the current FD + 16 improves the startup time by 30%
but that's far from being sufficient.

In practice all of this is performed at boot time, a moment at which we
know that extremely few FDs are registered (basically just the listeners),
so FD numbers are usually very low and the rest of the table is scanned
for no benefit. Ideally, knowing upfront how many FDs we have should be
sufficient.

A first approach would consist in counting the entries on a single thread
before registering pollers. It's not necessarily efficient and would take
time anyway.

This patch takes a different approach. It consists in keeping a thread-local
max ("fd_highest") that is updated whenever fd_insert() is called with a
larger number. Of course this is not correct once all threads have started,
but it will remain valid during boot since the same value is used during
startup and is cloned for each thread, and no scheduling happens anywhere
during this period, so that all threads are aware of the highest FD they've
seen registered, even if it had been done in some init code, and this without
having to deal with a shared variable.

Here on the test platform, the script gets its response in 10ms vs 1580
before.
2024-07-15 19:19:13 +02:00
Christopher Faulet
a492e08e62 CLEANUP: spoe: Uniformize function definitions
SPOE functions definitions were splitted on 2 or more lines, with the return
type alone on the first line. It is unusual in the HAProxy code.

The related issue is #2502.
2024-07-12 15:27:05 +02:00
Christopher Faulet
cab98784d8 MAJOR: spoe: Rewrite SPOE applet to use the SPOP mux
It is the huge part of the series. The patch is not so huge, it removes
functions to produce or consume frames. The SPOE applet is pretty light
now. But since this patch, the SPOP multiplexer is now used. The SPOP mode
is now automatically ised for SPOP backends. So if there are bugs in the
SPOP multiplexer, they will be visible now.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
1bea73612a MEDIUM: check/spoe: Use SPOP multiplexer to perform SPOP health-checks
The SPOP health-checks are now performed using the SPOP multiplexer. This
will be fixed later, but for now, it is considered as a L4 health-check and
no specific status code is reported. It means the corresponding vtest script
is marked as broken for now.

Functionnaly speaking, the same is performed. A connection is opened, a
HELLO frame is sent to the agent and we wait for the HELLO frame from the
agent in reply. But only L4OK, L4KO or L4TOUT will be reported.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
7e1bb7283b MEDIUM: mux-spop: Introduce the SPOP multiplexer
It is no possible yet to use it. Idles connections and pipelining mode are
not supported for now. But it should be possible to open a SPOP connection,
perform the HELLO handshake, send a NOTIFY frame based on data produced by
the client side and receive the corresponding ACK frame to transfer its
content to the client side.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
d0d23a7a66 MINOR: spoe: Move spoe_str_to_vsn() into the header file
The function used to convert the SPOE version from a string to an integer is
now located in spoe-t.h header file.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
08b522d6ac MINOR: spoe: Move all stuff regarding the filter/applet in the C file
Structures describing the SPOE applet context, the SPOE filter configuration
and context and the SPOE messages and groups are moved in the C file. In
spoe-t.h file, it remains the structure describing an SPOE agent and flags
used by both sides.

In addition, the SPOE frontend, created for a given SPOE engine, is moved
from the SPOE filter configuration to the SPOE agent structure.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
e6145a0ea1 MINOR: spoe: Dynamically alloc the message list per event of an agent
The inline array used to store, the configured messages per event in the
SPOE agent structure, is replaced by a dynamic array, allocated during the
configuration parsing. The main purpose of this change is to be able to move
all stuff regarding the SPOE filter and applet in the C file.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
ce53bb6284 MINOR: spoe: Rename some flags and constant to use SPOP prefix
A SPOP multiplexer will be added. Many flags, constants and structures will
be remove from the applet scope. So the "SPOP" prefix is used instead of
"SPOE", to be consistent.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
51ebf644e5 MINOR: stconn: Use a dedicated function to get the opposite sedesc
se_opposite() function is added to let an endpoint retrieve the opposite
endpoint descriptor. Muxes supportng the zero-copy forwarding can now use
it. The se_shutdown() function too. This will be use by the SPOP multiplexer
to be able to retrieve the SPOE agent configuration attached to the applet
on client side.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
4b8098bf48 MINOR: connection: No longer include stconn type header in connection-t.h
It is a small change, but it is cleaner to no include stconn-t.h header in
connection-t.h, mainly to avoid circular definitions.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
33ac3dabcb MEDIUM: applet: Add a .shut callback function for applets
Applets can now define a shutdown callback function, just like the
multiplexer. It is especially usefull to get the abort reason. This will be
pretty useful to get the status code from the SPOP stream to report it at
the SPOe filter level.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
1538c4aa82 MEDIUM: proxy/spoe: Add a SPOP mode
The SPOE was significantly lightened. It is now possible to refactor it to
use a dedicated multiplexer. The first step is to add a SPOP mode for
proxies. The corresponding multiplexer mode is also added.

For now, there is no SPOP multiplexer, so it is only declarative. But at the
end, the SPOP multiplexer will be automatically selected for servers inside
a SPOP backend.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
b986952a75 MINOR: spoe: Remove the dedicated SPOE applet task
The dedicated task per SPOE applet is no longer used. So it is removed.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
4e589095d9 MAJOR: spoe: Remove idle applets and pipelining support
Management of idle applets is removed. Consequently, the pipelining support
is also removed. It is a huge change but it should be transparent for the
agents, except regarding the performances. Of course, being able to reuse
already openned connections and being able to multiplex frames on a given
connection is a must have. These features will be restored later.

hello and idle timeout are not longer used. Because an applet is spawned to
process a NOTIFY frame and closed after receiving the ACK reply, the
processing timeout is the only one required. In addition, the parameters to
limit the SPOE applet creation are no longer used too.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
2405881ab0 MINOR: spoe: Remove debugging
All the SPOE debugging is removed. The code will be easier to rework this
way and the debugging will be mainly moved in the SPOP multiplexter via the
trace API.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
d37489abef MINOR: spoe: Use only a global engine-id per agent
Because the async mode was removed, it is no longer mandatory to announce a
different engine identifiers per thread for a given SPOE agent. This was
used to be sure requests and the corresponding responses are stuck on the
same thread.

So, now, a SPOE agent only announces one engine identifier on all
connections. No changes should be expected for agents.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
52ad7eb79e MEDIUM: spoe: Remove async mode support
The support for asynchronous mode, the ability to send messages on a
connection and receive the responses on any other connections, is removed.
It appears this feature was a bit overkill. And it is a problem for this
refactoring. This feature is removed and will not be restored at the end.

It is not a big deal for agent supporting the async mode because it is
usable if it is announced on both sides. HAProxy stops to announce it. This
should be transparent for agents.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
e3c92209f7 MEDIUM: spoe: Remove fragmentation support
It is the first patch of a long series to refactor the SPOE filter. The idea
is to rely on a dedicated multiplexer instead of hakcing HAProxy with a list
of applets processing a message queue.

First of all, optionnal features will be removed. Some will be restored at
the end, some others will just be removed. It is the case here. The frame
fragmentation support is removed. The only purpose of this feature is to be
able to support the streaming. Because it is out of the scope of this
refactoring, the fragmentation is removed.

The related issue is #2502.
2024-07-12 15:27:04 +02:00
Christopher Faulet
249a547f37 CLEANUP: stconn: Fix a typo in comments for SE_ABRT_SRC_*
Just a little typo: s/set bu/ set by/
2024-07-12 15:27:04 +02:00
Valentine Krasnobaeva
9302869c95 BUG/MINOR: limits: fix license type in limits.h
Need to use LGPL-2.1-or-later in headers since our hedaers default
to LGPL.
2024-07-11 18:15:48 +02:00
Amaury Denoyelle
3be58fc720 CLEANUP: quic: rename TID affinity elements
This commit is the renaming counterpart of the previous one, this time
for quic_conn module. Several elements related to TID affinity update
from quic_conn has been renamed : public functions, but also flag
renamed to QUIC_FL_CONN_TID_REBIND and trace event to
QUIC_EV_CONN_BIND_TID.

This should be backported with the same instruction as the previous
commit.
2024-07-11 15:14:06 +02:00
Amaury Denoyelle
9fbe8b0334 CLEANUP: proto: rename TID affinity callbacks
Since the following patch, protocol API to update a connection TID
affinity has been extended.
  commit 1a43b9f32c
  MINOR: proto: extend connection thread rebind API

The single callback set_affinity has been splitted in 3 different
functions which are called at different stages during listener_accept(),
depending on accept queue push success or not. However, the naming was
rendered confusing by the usage of function prefix 1 and 2.

Rename proto callback related to TID affinity update and use the
following names :

* bind_tid_prep
* bind_tid_commit
* bind_tid_reset

This commit should probably be backported at least up to 3.0 with the
above patch. This is because the fix was recently backported and it
would allow to keep changes minimal between the two versions. It could
even be backported up to 2.8 if there is no major conflict.
2024-07-11 15:14:06 +02:00
Amaury Denoyelle
b0990b38f8 MINOR: quic: add counters of sent bytes with and without GSO
Add a sent bytes counter for each quic_conn instance. A secondary field
which only account bytes sent via GSO which is useful to ensure if this
is activated.

For the moment, these counters are reported on "show quic" but not
aggregated on proxy quic module stats.
2024-07-11 11:02:44 +02:00
Amaury Denoyelle
d0ea173e35 MEDIUM: quic: implement GSO fallback mechanism
UDP GSO on Linux is not implemented in every network devices. For
example, this is not available for veth devices frequently used in
container environment. In such case, EIO is reported on send()
invocation.

It is impossible to test at startup for proper GSO support in this case
as a listener may be bound on multiple network interfaces. Furthermore,
network interfaces may change during haproxy lifetime.

As such, the only option is to react on send syscall error when GSO is
used. The purpose of this patch is to implement a fallback when
encountering such conditions. Emission can be retried immediately by
trying to send each prepared datagrams individually.

To support this, qc_send_ppkts() is able to iterate over each datagram
in a so-called non-GSO fallback mode. Between each emission, a datagram
header is rewritten in front of the buffer which allows the sending loop
to proceed until last datagram is emitted.

To complement this, quic_conn listener is flagged on first GSO send
error with value LI_F_UDP_GSO_NOTSUPP. This completely disables GSO for
all future emission with QUIC connections using this listener.

For the moment, non-GSO fallback mode is activated when EIO is reported
after GSO has been set. This is the error reported for the veth usage
described above.
2024-07-11 11:02:44 +02:00
Amaury Denoyelle
448d3d388a MINOR: quic: add GSO parameter on quic_sock send API
Add <gso_size> parameter to qc_snd_buf(). When non-null, this specifies
the value for socket option SOL_UDP/UDP_SEGMENT. This allows to send
several datagrams in a single call by splitting data multiple times at
<gso_size> boundary.

For now, <gso_size> remains set to 0 by caller, as such there should not
be any functional change.
2024-07-11 11:02:44 +02:00
Amaury Denoyelle
96a34d79d9 MINOR: quic: define quic_cc_path MTU as constant
Future commits will implement GSO support to be able to emit multiple
datagrams in a single syscall invocation. This will be used every time
there is more data to sent than the UDP network MTU.

No change will be done for Tx buffer encoding, in particular when using
extra metadata datagram header. When GSO will be used, length field will
contain the total length of all datagrams to emit in a single GSO
syscall send. As such, QUIC send functions will detect that GSO is in
use if total length is greater than MTU.

This last assumption forces to ensure that MTU is constant. Indeed, in
case qc_send() is interrupted, Tx buffer will be left with prepared
datagrams. These datagrams will be emitted at the next qc_send()
invocation. If MTU would change during these two calls, it would be
impossible to know if GSO was used or not. To prevent this, mark <mtu>
field of quic_cc_path as constant.
2024-07-11 11:02:44 +02:00
Amaury Denoyelle
35470d5185 MINOR: quic: activate UDP GSO for QUIC if supported
Add a startup test for GSO support in quic_test_socketopts() and
automatically activate it in qc_prep_pkts() when building datagrams as
big as MTU.

Also define a new config option tune.quic.disable-udp-gso. This is
useful to prevent warning on older platform or to debug an issue which
may be related to GSO.
2024-07-11 11:02:44 +02:00