Commit graph

18279 commits

Author SHA1 Message Date
Valentine Krasnobaeva
df68f7ec96 BUG/MINOR: cfgparse-global: fix allowed args number for setenv
Keywords setenv and presetenv take 2 arguments: variable name and value.
So, the total number, that should be passed to alertif_too_many_args is 2
("setenv <name> <value>") instead of 3. For alertif_too_many_args the first
argument index is 0.

This should be backported in all stable versions.
2024-10-01 10:35:09 +02:00
Christopher Faulet
273d322b6f MINOR: stream/stats: Expose the total number of streams ever created in stats
A shared counter is added in the thread context to track the total number of
streams created on the thread. This number is then reported in stats. It
will be a useful information to diagnose some bugs.
2024-09-30 16:55:53 +02:00
Christopher Faulet
18ee22ff76 MINOR: stream/stats: Expose the current number of streams in stats
A shared counter is added in the thread context to track the current number
of streams. This number is then reported in stats. It will be a useful
information to diagnose some bugs.
2024-09-30 16:55:53 +02:00
Christopher Faulet
6a94b7419e MINOR: stream: Support dynamic changes of the number of connection retries
Thanks to the previous patch, it is now possible to add an action to
dynamically change the maxumum number of connection retires for a stream.
"set-retries" action may now be used to do so, from a "tcp-request content"
or a "http-request" rule. This action accepts an expression or an integer
between 0 and 100. The integer value is checked during the configuration
parsing and leads to an error if it is not in the expected range. However,
for the expression, the value is retrieve at runtime. So, invalid value are
just ignored.

Too high value is forbidden to avoid any trouble. 100 retries seems already
be an amazingly hight value. In addition, the option is only available on
backend or listen sections.

Because the max retries is limited to 100 at most, it can be stored as a
unsigned short. This save some space in the stream structure.
2024-09-30 16:55:53 +02:00
Christopher Faulet
91e785edc9 MINOR: stream: Rely on a per-stream max connection retries value
Instead of directly relying on the backend parameter to limit the number of
connection retries, we now use a per-stream value. This value is by default
inherited from the backend value when it is set. So for now, there is no
change except the stream value is used instead of the backend value. But
thanks to this change, it will be possible to dynamically change this value.
2024-09-30 16:55:53 +02:00
Christopher Faulet
0d91de2be4 MINOR: action: Export release_expr_int_action() release function
This function was only used by TCP actions and was private to tcp_act.c
file. However, it make sense to make it public to be used by any action
relying on an int-or-expression argument.
2024-09-30 16:55:53 +02:00
Christopher Faulet
688abb6f30 BUG/MINOR: mcli: Pretend the mux have more data to deliver between two commands
Since the commit "OPTIM: stconn: Don't pretend mux have more data to deliver
on EOI/EOS/ERROR", the SC no longer pretend its mux have more data to
deliver when one of EOI/EOS/ERROR flags are set on its sedesc.

However, for the master cli, it is an issue because any EOI/EOS at the end
of a command is in fact detected on the attempt to get the next command. To
do so, the stream is reset. Because if the commit above, the next received
is never performed. To fix the issue, when the stream is reset, the front SC
pretend its mux have more data to deliver.

This patch must only be bacported if the commit above is backported.
2024-09-30 16:55:53 +02:00
Christopher Faulet
bca5e14235 OPTIM: stconn: Don't pretend mux have more data to deliver on EOI/EOS/ERROR
Doing some benchs on the 3.0, we encountered a small loss on requests/sec on
small objects compared to the 2.8 . After bisecting the issue, it appeared
that this was introduced when the mux-to-mux zero-copy data forwarding was
implemented in 2.9-dev8. Extra subscribes on receives at the end of the
message were responsible of the loss.

A basic configuration, sending H2 requests to a H1 server returning
responses without payload is enough to observe the issue. With the following
command, we can observe a huge increase of epoll_ctl calls on 2.9/3.x:

  h2load -c 100 -m 10 -n 100000 http://...

On 2.8 we have around 3200 calls to epoll_ctl against more than 20k on 3.1.

The fix seems obvious. After a receive, there is no reason to state a mux
have more data to deliver if EOI/EOS/ERROR flag was set on the
stream-endpoint descriptor. With this change, extra calls to epoll_ctl
disappear. However it is a sensitive part so it is important to keep an eye
on it and to not backport it.

Thanks to Willy and Emeric to have spot the issue.
2024-09-30 16:55:48 +02:00
Willy Tarreau
11051ed9c7 OPTIM: channel: speed up co_getline()'s search of the end of line
Previously, co_getline() was essentially used for occasional parsing
in peers's banner or Lua, so it could afford to read one character at
a time. However now it's also used on the TCP log path, where it can
consume up to 40% CPU as mentioned in GH issue #2731. Let's speed it
up by using memchr() to look for the LF, and copying the data at once
using memcpy().

Previously it would take 2.44s to consume 1 GB of log on a single
thread of a Core i7-8650U, now it takes 1.56s (-36%).
2024-09-30 11:36:39 +02:00
Willy Tarreau
1d403caf8a MINOR: server: make srv_shutdown_sessions() call pendconn_redistribute()
When shutting down server sessions, the queue was not considered, which
is a problem if some element reached the queue at the moment the server
was going down, because there will be no more requests to kick them out
of it. Let's always make sure we scan the queue to kick these streams
out of it and that they can possibly find a more suitable server. This
may make a difference in the time it takes to shut down a server on the
CLI when lots of servers are in the queue.

It might be interesting to backport this to 3.0 but probably not much
further.
2024-09-27 19:01:38 +02:00
Willy Tarreau
1385e33eb0 BUG/MINOR: queue: make sure that maintenance redispatches server queue
Turning a server to maintenance currently doesn't redispatch the server
queue unless there's an explicit "option redispatch" and no "option
persist", while the former has never really been the purpose of this
test. Better refine this so that forced maintenance also causes the
queue to be flushed, and possibly redispatched unless the proxy has
option persist. This way now when turning a server to maintenance,
the queue is immediately flushed and streams can decide what to do.

This can be backported, though there's no need to go far since it was
never directly reported and only noticed as part of debugging some
rare "shutdown sessions" strangeness, which it might participate to.
2024-09-27 18:54:07 +02:00
Willy Tarreau
b8e3b0a18d BUG/MEDIUM: stream: make stream_shutdown() async-safe
The solution found in commit b500e84e24 ("BUG/MINOR: server: shut down
streams under thread isolation") to deal with inter-thread stream
shutdown doesn't work fine because there exists code paths involving
a server lock which can then deadlock on thread_isolate(). A better
solution then consists in deferring the shutdown to the stream itself
and just wake it up for that.

The only thing is that TASK_WOKEN_OTHER is a bit too generic and we
need to pass at least 2 types of events (SF_ERR_DOWN and SF_ERR_KILLED),
so we're now leveraging the new TASK_F_UEVT1 and _UEVT2 flags on the
task's state to convey these info. The caller only needs to wake the
task up with these flags set, and the stream handler will then finish
the job locally using stream_shutdown_self().

This needs to be carefully backported to all branches affected by the
dequeuing issue and containing any of the 5541d4995d ("BUG/MEDIUM:
queue: deal with a rare TOCTOU in assign_server_and_queue()"), and/or
b11495652e ("BUG/MEDIUM: queue: implement a flag to check for the
dequeuing").
2024-09-27 12:15:41 +02:00
Willy Tarreau
d1c398b786 Revert "BUG/MINOR: server: shut down streams under thread isolation"
This reverts commit b500e84e24.

Thread isolation does not work well for this, there exists code paths
which already hold the server's lock and result in a deadlock. Let's
revert that and address it better without isolation.
2024-09-27 10:17:31 +02:00
Aurelien DARRAGON
e3eb6a9035 MEDIUM: log: consider log-steps proxy setting for existing log origins
During tcp/http transaction processing, haproxy may produce logs at
different steps during the processing (accept, connect, request,
response, close). But the behavior is hardly configurable because
haproxy will only emit a single log per transaction, and by default
it will try to produce the log once all log aliases or fetches used
in the logformat could be satisfied, which means the log is often
emitted during connection teardown, unless "option logasap" is used.

We were often asked to have a way to emit multiple logs for a single
transaction, like for instance emit log during accept, then request,
response and close for instance, see GH #401 for more context.

Thanks to "log-steps" keyword introduced by commit "MINOR: log:
introduce "log-steps" proxy keyword", it is now possible to explictly
configure when logs should be generated by haproxy when processing a
transaction. This commit adds the required checks so that log-steps
proxy option is properly considered for existing logs generated by
haproxy. If "log-steps" is not specified on the proxy, the old behavior
is preserved.

Note: a slight cpu overhead should only be visible when "log-steps"
keyword will be used due to the implementation relying on eb32 lookup
instead of basic bitfield check as described in "MINOR: proxy: add
log_steps struct member". However, the default behavior shouldn't be
affected.

When combining log-steps with log-profiles, user has the ability to
explicitly control how and when haproxy should generate logs during
requests handling.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
4189eb7aca MINOR: log: add log_orig_proxy() helper function
Function may be used on proxy where log-steps are used to check if a given
log origin should be handled or not.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
c043d5d372 MINOR: log: introduce "log-steps" proxy keyword
For now it is only available for proxies with frontend capability because
log-steps are only evaluated under sess_log() or strm_log() which
essentially focus on the frontend side when it comes to log settings so
it's better to keep it this way for better consistency, at least for now.

For now the setting does nothing (it is not considered during runtime),
it will be implemented and documented in upcoming commits.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
9341792baf MINOR: proxy: add log_steps struct member
add proxy->conf.log_steps eb32 root tree which will be used to store the
log origin identifiers that should result in haproxy emitting a log as
configured by the user using upcoming "log-steps" proxy keyword.

It was chosen to use eb32 tree instead of simple bitfield because despite
the slight overhead it is more future-proof given that we already
implemented the prerequisites for seamless custom log origins registration
that will also be usable from "log-steps" proxy keyword.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
b882402a29 MINOR: log: support extra log origins for '%OG' alias
Following previous commits, let's improve log_orig_to_str() so that
extra log origins (registered through log_orig_register()) can be
translated to string from origin ID.

For that, it is required to add eb_32 tree node to log_origin struct in
order to enable quick integer lookup during runtime. Slow name lookup
using the list is acceptable for config parsing, but it is not the case
during runtime when log_orig_to_str() is expected to be used. Also, to
prevent duplicated info, get rid of ->id field and use ->tree.key instead
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
f8bb9d5c57 MINOR: log: explicitly handle extra log origins as error when relevant
Thanks to previous commit, we can know check for log_orig optional flags
in functions taking struct log_orig as parameter. Let's take this
opportunity to add the LOG_ORIG_FL_ERROR flag and check this flag at a
few places to handle the log message differently because if the flag is
set then the caller expects the log to be handled as an error explicitly.

e.g.: in _process_send_log_override(), if the flag is set, use the error
log format instead of the dedicated one.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
3c15ee05e9 MINOR: log: introduce log_orig flags
Rename 'enum log_orig' to 'enum log_orig_id', since this enum specifically
contains the log origin ids.

Add 'struct log_orig' which wraps 'enum log_orig' with optional flags
(no flags defined for now).

Add log_orig() helper func that takes id and flags as parameter and
returns log_orig struct initialized with input arguments.

Update functions taking log origin as parameter so they explicitly take
log orig id or log orig wrapper as argument depending on the level of
context expected by the function.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
6567e37680 MINOR: log: handle extra log origins in _process_send_log_override()
Thanks to the previous commit, it is now possible to register additional
log origins that may be used from log-profile section as 'on' steps.

As such, let's make _process_send_log_override() function aware of them
by trying to lookup in the tree of extra logging steps in the default
switch-case catchall. If the log origin id matches with the id of the
extra logging step, we use the associated log format instead of the
"any" log format.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
818475c5cc MINOR: log: introduce extra log profile steps
add a way to register additional log origins using log_origin_register()
that may be used as log profile steps from log profile sections.

For now this does nothing as no extra origins are registered and extra log
origins are not yet considered for runtime logging paths.

When specifying an extra logging step for on <step> under log-profile
section, the logging step is stored within a binary tree for efficient
lookup during runtime. No performance impact should be expected if extra
log origins are not being used, and slight performance impact if extra
log origins are used.

Don't forget to update the documentation when new log origins are added
(both %OG log alias and on <step> log-profile keyword are concerned.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
facf259d88 MINOR: log: fix indent in strm_log()
8f34320e15 ("MINOR: log: provide log origin in logformat expressions
using '%OG'") caused wrong indent in strm_log()
2024-09-26 16:53:07 +02:00
Oliver Dala
a889413f5e BUG/MEDIUM: cli: Deadlock when setting frontend maxconn
The proxy lock state isn't passed down to relax_listener
through dequeue_proxy_listeners, which causes a deadlock
in relax_listener when it tries to get that lock.

Backporting: Older versions didn't have relax_listener and directly called
resume_listener in dequeue_proxy_listeners. lpx should just be passed directly
to resume_listener then.

The bug was introduced in commit 001328873c

[cf: This patch should fix the issue #2726. It must be backported as far as
2.4]
2024-09-25 17:12:11 +02:00
Christopher Faulet
14a413033c BUG/MEDIUM: cli: Be sure to catch immediate client abort
A client abort while nothing was sent is properly handled except when this
immediately happens after the connection was accepted. The read0 event is
caught before the CLI applet is created. In that case, the shutdown is not
handled and the applet is no longer wakeup. In that case, the stream remains
blocked and no timeout are armed.

The bug was due to the fact that when the applet I/O handler was called for
the first time, the applet context was initialized and nothing more was
performed. A shutdown, if any, would be handled on the next call. In that
case, it was too late.

Now, afet the init step, we loop to eval the first command. There is no
command here but the shutdown will be tested.

This patch should fix the issue #2727. It must be backported to 3.0.
2024-09-24 18:01:38 +02:00
Aurelien DARRAGON
d622f9d5b6 MEDIUM: mailers: warn about deprecated legacy mailers
As mentioned in 2.8 announce on the mailing list [1] and on the wiki [2],
use of legacy mailers is now deprecated and will not be supported anymore
starting with version 3.3. Use of Lua script (AKA Lua mailers) is now
encouraged (and fully supported since 2.8) for this purpose, as it offers
more flexibility (e.g: alerts can be customized) and is more future-proof.

Configurations relying on legacy mailers will now raise a warning.

Users willing to keep their existing mailers config in a working state
should simply add the following line to their global section:

   # mailers.lua file as provided in the git repository
   # adjust path as needed
   lua-load examples/lua/mailers.lua

[1]: https://www.mail-archive.com/haproxy@formilux.org/msg43600.html
[2]: https://github.com/haproxy/wiki/wiki/Breaking-changes
2024-09-23 20:16:27 +02:00
Willy Tarreau
fdf38ed7fc BUG/MINOR: proxy: also make the cli and resolvers use the global name
As detected by ASAN on the CI, two places still using strdup() on the
proxy names were left by commit b325453c3 ("MINOR: proxy: use the global
file names for conf->file").

No backport is needed.
2024-09-21 20:08:06 +02:00
Willy Tarreau
b500e84e24 BUG/MINOR: server: shut down streams under thread isolation
Since the beginning of thread support, the shutdown of streams attached
to a server was run under the server's lock, but that's not sufficient.
It indeed turns out that shutting down streams (either from the CLI using
"shutdown sessions server XXX" or due to "on-error shutdown-sessions")
iterates over all the streams to shut them down, but stream_shutdown()
has no way to protect its actions against concurrent actions from the
stream itself on another thread, and streams offer no such provisions
anyway.

The impact is some rare but possible crashes when shutting down streams
from the CLI in cmopetition with high server traffic. The probability
is low enough to mark it minor, though it was observed in the field.

At least since 2.4 the streams are arranged in per-thread lists, so it
likely would be possible using the event subsystem to delegate these
events to dedicated per-thread tasks which would address the problem.
But server streams don't get killed often enough to justify such extra
complexity, so better just run the loop under thread isolation.

It also shows that the internal API could probably be improved to
support a lighter thread exclusion instead of full isolation: various
places want to only exclude one thread and here it could work. But
again there's no point doing this for now.

This patch should be backported to all stable branches. It's important
to carefully check that this srv_shutdowns_streams() function is never
called itself under isolation in older versions (though at first glance
it looks OK).
2024-09-21 19:35:35 +02:00
Willy Tarreau
e77c73316a MEDIUM: cfgparse: warn about deprecated use of duplicate server names
As discussed below, there are too many problems and limitations caused
by still supporting duplicate server names. That's already particularly
complicated and dissuasive to use since it requires these servers to
have explicit IDs to be accept. Let's now warn on any duplicate, even
with explicit IDs and remind that this will become forbidden in 3.3.

Link: https://www.mail-archive.com/haproxy@formilux.org/msg45185.html
2024-09-20 17:15:11 +02:00
Willy Tarreau
029d75df1e OPTIM: cfgparse: speed up duplicate server detection
Surprisingly, the duplicate server name detection has never made use
of the names tree, so lookups were still in O(N^2). It took 1 second
to validate 50k servers spread into 25 backends at 2k per backend.

By simply using the tree (and since the current server already is in
the tree), we just have to walk using ebpt_prev_dup to visit previous
servers with the same name. We can then detect which ones conflict
without having an ID set and error. The config check time is now 1/4
of the previous one for 2k servers per backend, and more importantly
it will make it simpler to check for any duplicates later.
2024-09-20 17:14:50 +02:00
Willy Tarreau
ccd1ecba1d MEDIUM: cfgparse: drop duplicate named defaults sections after use
It has never been permitted to explicitly reference named defaults
sections for which there are duplicate names. This means that when
a duplicate defaults section is found, there's no point in keeping
it since it will never be used for lookups, so it can be dropped.

However, some such defaults sections might have some rules in them
that are implicitly referenced by proxies placed after them. In this
case they cannot be removed.

What is done here is that upon each new named section creation, if
another one is found with the same name, its config location is stored
into the new proxy's {prev_file,prev_line} pair, and the old section is
either destroyed if its refcount is null, or just unindexed. The dup
check when creating a new proxy now consists in checking the prev_line
instead of performing a dup lookup on the defaults section.

This will guarantee that we can't find duplicate defaults sections in
their tree anymore, while still keeping track of what's allocated and
releasing everything upon exit.

Beyond the consistency gain, there are nice savings for large configs
involving many defaults sections: a test with 300k sections saved
about 1.9 GB of RAM, and started 25% faster likely thanks to spending
less time allocating memory.
2024-09-20 16:35:32 +02:00
Willy Tarreau
c8b813771d MINOR: proxy: add a list of orphaned defaults sections
We'll soon delete unreferenced and duplicated named defaults sections
from the list of proxies. The problem with this is that this list (in
fact a name-based tree) is used to release all of them at the end. Let's
add a list of orphaned defaults sections, typically those containing
"http-check send" statements or various other rules, and that are
implicitly inherited by a proxy hence have a non-zero refcount while
also having a name. These now makes it possible to remove them from
the name index while still keeping their memory around for the lifetime
of the process, and cleaning it at the end.
2024-09-20 15:59:04 +02:00
Willy Tarreau
cb4c236fac BUG/MINOR: cfgparse: detect another uncaught case of duplicate defaults
The following sequence was not properly caught:

   defaults def
   backend back from def
   defaults def

But this one was:

   defaults def
   defaults def
   backend back from def

Let's check when defaults are declared that they're not already
referenced.

Better not backport this. While it will catch broken configs (possibly
some with backends pasted after the wrong defaults), these might still
work by accident. It may be reported as a diag warning though.
2024-09-20 15:58:10 +02:00
Willy Tarreau
5b221d1e41 CLEANUP: cfgparse: factor proxy vs log-forward collisions
This simplifies the check added in 1a38684fbc ("MEDIUM: cfgparse:
detect collisions between defaults and log-forward"), by factoring it
with the other existing one.

The tests are ugly in that code because a first block tests pure
proxies, a second one proxies or defaults and inside that one we
have special cases for defaults. Let's just move the tests to the
"any proxy type" block.
2024-09-20 14:13:14 +02:00
Willy Tarreau
b325453c36 MINOR: proxy: use the global file names for conf->file
Proxy file names are assigned a bit everywhere (resolvers, peers,
cli, logs, proxy). All these elements were enumerated and now use
copy_file_name(). The only ha_free() call was turned to drop_file_name().

As a bonus side effect, a 300k backend config saved 14 MB of RAM.
2024-09-19 15:38:19 +02:00
Willy Tarreau
9ab21a3c2d CLEANUP: stick-table: make the file location point to a global file name
The file name used to point to the calling function's stack for stick
tables, which was OK during parsing but remained dangling afterwards.
At least it was already marked const so as not to accidentally free it.
Let's make it point to a file_name_node now.
2024-09-19 15:38:19 +02:00
Willy Tarreau
d6c060c5ae MINOR: tools: add minimal file name management
In proxies, stick-tables, servers, etc... at plenty of places we store
a file name and a line number. Some file names are the result of strdup()
(e.g. in proxies), others not (e.g. stick-tables) and leave dangling
pointers at the end of parsing. The risk of double-free is not null
either.

In order to stop this, let's first add a simple tool that allows to
register short strings inside a global list, these strings happening
to be server names. The strings are either duplicated and stored upon
failure to find them, or just added to this storage. Since file names
are not expected to disappear before the end of the process, for now
we don't even implement refcounting, and we free them all at the end.
There's already a drop_file_name() function to reset the pointer like
ha_free() used to do, and even if not strictly needed it's a good
habit to get used to doing it.

The strings are returned as const so that they're stored as-is in
structs, and that nasty free() calls are easily caught. The pointer
points to the char[] storage inside the node itself. This way later
if we want to implement refcounting, it will be trivial to just look
up a string and change its associated node's refcount. If needed,
comparisons can also be made on pointers.

For now they're not used yet and are released on deinit().
2024-09-19 15:36:58 +02:00
Willy Tarreau
1a38684fbc MEDIUM: cfgparse: detect collisions between defaults and log-forward
Sadly, when log-forward were introduced they took great care of avoiding
collision with regular proxies but defaults were missed (they need to be
explicitly checked for). So now we have to move them to a warning for 3.1
instead of rejecting them.
2024-09-18 18:08:15 +02:00
Willy Tarreau
d8f4b07e40 MEDIUM: cfgparse: warn about colliding names between defaults and proxies
In order to complete the checks added in 303a66573d ("MEDIUM: cfgparse:
warn about proxies having the same names"), we also need to warn about
regular proxies having the same name as defaults sections as well as
defaults sections having the same name as proxies, since defaults
sections are inherently proxies, albeit stored in a separate list for
now.
2024-09-18 18:08:06 +02:00
Amaury Denoyelle
fcd6d29acf BUG/MINOR: mux-quic: report glitches to session
Glitch counter was implemented for QUIC/HTTP3. The counter is stored in
the QCC MUX connection instance. However, this is never reported at the
session level which is necessary if glitch counter is tracked via a
stick-table.

To fix this, use session_add_glitch_ctr() in various QUIC MUX functions
which may increment glitch counter.

This should be backported up to 3.0.
2024-09-18 16:11:03 +02:00
Willy Tarreau
303a66573d MEDIUM: cfgparse: warn about proxies having the same names
As discussed below, there are too many problems and uncaught bugs
in the parser when trying to support proxies having similar names
but different types. There's specific code to detect the presence
of stick-tables in a pair of such proxies for example. It's even
possible that certain combinations of backend+listen that were not
previously detected have some nasty side effects.

According to the proposal in the discussion, this is now deprecated
in 3.1 (thus we emit a warning) and will become forbidden in 3.3.

A backport might be useful, but reporting a diag_warning only, not a
classical warning, so as not to break setups running in zero-warning
mode.

It was verified with a config involving all 9 combinations of
(frontend,backend,listen) followed by one of the same three that all
collisions are now properly blocked and that only back+front are kept
and emit a warning.

Link: https://www.mail-archive.com/haproxy@formilux.org/msg45185.html
2024-09-17 19:55:00 +02:00
Willy Tarreau
c70906c8a1 BUG/MINOR: cfgparse: detect incorrect overlap of same backend names
As reported below, it's possible to declare a backend then a proxy with
the same name, because for the proxy we check a frontend capability (the
first one to be tested):

   backend b
   listen b
        bind :8888

Let's check the two capabilities in this case and not just the frontend.

Better not backport this, as there's a risk of breakage of existing
setups that work by accident. It might make sense to report them as
diag warnings though.

Link: https://www.mail-archive.com/haproxy@formilux.org/msg45185.html
2024-09-17 19:55:00 +02:00
Aurelien DARRAGON
17e52c922b BUG/MINOR: cfgparse-listen: fix option httpslog override warning message
"option httpslog" override warning messaged used to be reported as
"option httplog", probably as a result of copy paste without adjusting
the context. Let's fix that to prevent emitting confusing warning messages

The issue exists since 98b930d ("MINOR: ssl: Define a default https log
format"), thus it should be backported up to 2.6
2024-09-17 15:40:02 +02:00
Aurelien DARRAGON
bc4bf5779f BUG/MINOR: fix missing "'option httpslog' overrides previous 'option tcplog clf'..." detection
Same as b85edd44db0 ("BUG/MINOR: fix missing "log-format overrides
previous 'option tcplog clf'..." detection") but for "option httpslog"
keyword.

No backport needed unless fd48b28 ("MINOR: Implements new log format of
option tcplog clf") is.
2024-09-17 15:40:02 +02:00
Aurelien DARRAGON
607b9adc9b BUG/MINOR: fix missing "log-format overrides previous 'option tcplog clf'..." detection
In commit fd48b28315 ("MINOR: Implements new log format of option tcplog clf")
"option tcplog clf" detection was correcly added for "option tcplog" and
"option httplog", but "log-format" case was overlooked. Thus, this config
would report erroneous warning message:

  defaults
    option tcplog clf
    log-format "ok"

[WARNING]  (727893) : config : parsing [test.conf:3]: 'log-format' overrides previous 'log-format' in 'defaults' section.

No backport needed unless fd48b28315 is.
2024-09-17 14:41:58 +02:00
Willy Tarreau
499e057644 MEDIUM: clock: don't compute before_poll when using monotonic clock
There's no point keeping both clocks up to date; if the monotonic clock
is ticking, let's just refrain from updating the wall clock one before
polling since we won't use it. We still do it after polling however as
we need a wall clock time to communicate with outside.

This saves one gettimeofday() call per loop and two timeval comparisons.
2024-09-17 09:08:10 +02:00
Willy Tarreau
24496803d1 MEDIUM: clock: use the monotonic clock for idle time calculation
By just keeping a copy of the last known value before entering
polling, we can apply the same algorithm as we're currently using,
except that it's now applied to the monotonic clock instead of the
wall clock, when it's detected that it's ticking. This improves
idle time calculation accuracy by making it independent on the
wall clock.
2024-09-17 09:08:10 +02:00
Willy Tarreau
4150851ce5 MEDIUM: clock: opportunistically use CLOCK_MONOTONIC for the internal time
We already collect CLOCK_MONOTONIC when it's available when leaving the
poller, but it's only used for profiling. The functions that return it
set the value to zero when it's not available, so we can use that to
detect if it works or not. The idea is that if the monotonic time is
non-zero, it is ticking and usable, then we use if for now_ns, otherwise
we use the corrected date. We continue to apply the now_offset to the
returned value because it helps forcing an early time wrap-around.

Proceeding like this presents two benefits:
  - on systems supporting this, the time is much more robust against
    time changes
  - when it works, it saves us from having to go through the time
    correction code, which is usually cheap, but better avoided anyway.

Note that idle time calculation continues to rely on the wall-clock
time.
2024-09-17 09:08:10 +02:00
Willy Tarreau
f793845f4a MEDIUM: clock: collect the monotonic time in clock_local_update_date()
Now we collect this clock in clock_local_update_date(), the closest from
the poller, which is also used when busy-polling, and the values is set
into the thread's curr_mono_time which did not exist before. Later,
clock_leaving_poll() just sets the prev_mono_time value from the curr_
one instead of retrieving the time at this specific point. It also means
that the monotonic time will now also cover the time needed to update
the global time, which should be negligible. Note that we don't collect
the CPU time in the clock_local_update_date() function even though it's
tempting, because when doing busy-polling, it would be collected on each
round while being useless.

Doing so will make sure that the local time always knows the monotonic
time when it is available.
2024-09-17 09:08:10 +02:00
Willy Tarreau
42e699903e MINOR: clock: test all clock_gettime() return values
Till now we were only using clock_gettime() for profiling, so if it
would fail it was no big deal. We intend to use it as the main clock
as well now, so we need to more reliably detect its absence or failure
and gracefully fall back to other options. Without the test we would
return anything present in the stack, which is neither clean nor easy
to detect.
2024-09-17 09:08:10 +02:00
Christopher Faulet
afc50f2445 BUG/MEDIUM: cache/stats: Wait to have the request before sending the response
It seems obvious. On a classical workflow, the request headers analysis is
finished when these applets are woken up for the first time. So they don't
take care to really have the request to start to process it and to send the
response. But with a filter, it is possible to stop the request analysis
after the applet creation.

If this happens for the stats applet, this leads to a crash because we
retrieve the request start-line without checking if it is available. For the
cache applet, the response is just immediatly sent. And here it is a problem
if the compression is enabled. In that case too, this may lead to a crash
because the compression may be enabled but not initialized.

For a true server, there is no issue because the connection cannot be
established. The server is chosen only after the request analysis. The issue
with applets is that once created, an applet is quickly switched to the
established state. So it is probably a point that must be carefully reviewed
and probably reworked.

In the mean time, as a fix, in the cache and the stats applet, we just take
care to have the request before sending the response. This will do the
trick.

The patch must be backported as far as 2.6. On 2.6, the patch must be adapted.
2024-09-16 22:55:40 +02:00
Christopher Faulet
4de6632693 MINOR: proxy: Rename accept-invalid-http-* options
With these options, it is possible to accept some invalid messages that may
considered as unsafe and may result as vulnerabilities. The naming is not
explicit enough on this point. These option must really be considered as
dangerous and only used as a temporary workaround. Unfortunately, when used,
it is probably because there are some legacy and unsupported applications in
place. Nevermind. The documentation warns about the use of these
options. Now the name of the options itself is a warning.

So now, "accept-invalid-http-request" and "accept-invalid-http-response"
options are deprecated and replaced by
"accept-unsafe-violations-in-http-request" and
"accept-unsafe-violations-in-http-response" options.
2024-09-16 22:55:25 +02:00
Aurelien DARRAGON
1e0920f855 BUG/MINOR: peers: local entries updates may not be advertised after resync
Since commit 864ac3117 ("OPTIM: stick-tables: check the stksess without
taking the read lock"), when entries for a local table are learned from
another peer upon resynchro, and this is the only peer haproxy speaks to,
local updates on such entries are not advertised to the peer anymore,
until they eventually expire and can be recreated upon local updates.

This is due to the fact that ts->seen is always set to 0 when creating
new entry, and also when touch_remote is performed on the entry.

Indeed, while 864ac3117 attempts to avoid useless updates, it didn't
consider entries learned from a remote peer. Such entries are exclusively
learned in peer_treat_updatemsg(): once the entry is created (or updated)
with new data, touch_remote is used to commit the change. However, unlike
touch_local, entries committed using touch_remote will not be advertised
to the peer from which the entry was just learned (otherwise we would
enter a looping situation). Due to the above patch, once an entry is
learned from the (unique) remote peer, 'seen' will be stuck to 0 so it
will never be advertised for its whole lifetime.

Instead, when entries are learned from a peer, we should consider that
the peer that taught us the entry has seen it.

To do this, let's set seen=1 in peer_treat_updatemsg() after calling
touch_remote(). This way, if we happen to perform updates on this entry,
it will be properly advertized to relevant peers. This patch should not
affect the performance gain documented in 864ac3117 given that the test
scenario didn't involved entries learned by remote peers, but solely
locally created entries advertised to remote peers upon updates.

This should be backported in 3.0 with 864ac3117.
2024-09-16 14:06:39 +02:00
Willy Tarreau
5d350d1e50 OPTIM: vars: use multiple name heads in the vars struct
Given that the original list-based version was using a list head as the
root of the variables, while the tree is using a single pointer, it made
sense to reuse that space to place multiple roots, indexed on the lower
bits of the name hash. Two roots slightly increase the performance level,
but the best gain is obtained with 4 roots. The performance is now always
above that of the list, even with small counts, and with 100 vars, it's
21% higher than before, or 67% higher than with the list.

We keep the same lock (it could have made sense to use one lock per head),
because most of the variables in large configs are attached to a stream
or a session, hence are not shared between threads. Thus there's no point
in sharding the pointer.
2024-09-15 23:51:51 +02:00
Willy Tarreau
47ec7c681e OPTIM: vars: use a cebtree instead of a list for variable names
Configs involving many variables can start to eat a lot of CPU in name
lookups. The reason is that the names themselves are dynamic in that
they are relative to dynamic objects (sessions, streams, etc), so
there's no fixed index for example. The current implementation relies
on a standard linked list, and in order to speed up lookups and avoid
comparing strings, only a 64-bit hash of the variable's name is stored
and compared everywhere.

But with just 100 variables and 1000 accesses in a config, it's clearly
visible that variable name lookup can reach 56% CPU with a config
generated this way:

  for i in {0..100}; do
    printf "\thttp-request set-var(txn.var%04d) int(%d)" $i $i;
    for j in {1..10}; do [ $i -lt $j ] || printf ",add(txn.var%04d)" $((i-j)); done;
    echo;
  done

The performance and a 4-core skylake 4.4 GHz reaches 85k RPS with a perf
profile showing:

  Samples: 170K of event 'cycles', Event count (approx.): 142378815419
  Overhead  Shared Object            Symbol
    56.39%  haproxy                  [.] var_to_smp
     6.65%  haproxy                  [.] var_set.part.0
     5.76%  haproxy                  [.] sample_process_cnv
     3.23%  haproxy                  [.] sample_conv_var2smp
     2.88%  haproxy                  [.] sample_conv_arith_add
     2.33%  haproxy                  [.] __pool_alloc
     2.19%  haproxy                  [.] action_store
     2.13%  haproxy                  [.] vars_get_by_desc
     1.87%  haproxy                  [.] smp_dup

[above, var_to_smp() calls var_get() under the read lock].

By switching to a binary tree, the cost is significantly lower, the
performance reaches 117k RPS (+37%) with this profile:

  Samples: 170K of event 'cycles', Event count (approx.): 142323631229
  Overhead  Shared Object            Symbol
    40.22%  haproxy                  [.] cebu64_lookup
     7.12%  haproxy                  [.] sample_process_cnv
     6.15%  haproxy                  [.] var_to_smp
     4.75%  haproxy                  [.] cebu64_insert
     3.79%  haproxy                  [.] sample_conv_var2smp
     3.40%  haproxy                  [.] cebu64_delete
     3.10%  haproxy                  [.] sample_conv_arith_add
     2.36%  haproxy                  [.] action_store
     2.32%  haproxy                  [.] __pool_alloc
     2.08%  haproxy                  [.] vars_get_by_desc
     1.96%  haproxy                  [.] smp_dup
     1.75%  haproxy                  [.] var_set.part.0
     1.74%  haproxy                  [.] cebu64_first
     1.07%  [kernel]                 [k] aq_hw_read_reg
     1.03%  haproxy                  [.] pool_put_to_cache
     1.00%  haproxy                  [.] sample_process

The performance lowers a bit earlier than with the list however. What
can be seen is that the performance maintains a plateau till 25 vars,
starts degrading a little bit for the tree while it remains stable till
28 vars for the list. Then both cross at 42 vars and the list continues
to degrade doing a hyperbole while the tree resists better. The biggest
loss is at around 32 variables where the list stays 10% higher.

Regardless, given the extremely narrow band where the list is better, it
looks relevant to switch to this in order to preserve the almost linear
performance of large setups. For example at 1000 variables and 10k
lookups, the tree is 18 times faster than the list.

In addition this reduces the size of the struct vars by 8 bytes since
there's a single pointer, though it could make sense to re-invest them
into a secondary head for example.
2024-09-15 23:49:01 +02:00
Willy Tarreau
a0205f9de4 IMPORT: import cebtree (compact elastic binary trees)
This is an import of the compact elastic binary trees at commit
a9cd84a ("OPTIM: descent: better prefetch less and for writes when
deleting")

These will be used to replace certain lists (and possibly certain
tree nodes as well). They're as fast (or even faster) than ebtrees
for lookups, as fast for insertion and slower for deletion, and a
node only uses 2 pointers (like a list).

The only changes were cebtree.h where common/tools.h was replaced
with ebtree.h which we already have and already provides the needed
functions and macros, and the addition of a wrapper cebtree-prv.h in
src/ to redirect to import/cebtree-prv.h.
2024-09-15 23:44:59 +02:00
Willy Tarreau
6e92988e20 MINOR: vars: remove the emptiness tests in callers before pruning
All callers of vars_prune_* currently check the list for emptiness.
Let's leave that to vars_prune() itself, it will ease some changes in
the code. Thanks to the previous inlining of the vars_prune() function,
there's no performance loss, and even a very tiny 0.1% gain.
2024-09-15 23:44:16 +02:00
Willy Tarreau
2c1a9c3a43 OPTIM: vars: inline vars_prune() to avoid many calls
Many configs don't have variables and call it for no reason, and even
configs with variables don't necessarily have some in all scopes.
2024-09-15 23:42:09 +02:00
Willy Tarreau
aad6b771dd OPTIM: vars: remove the unneeded lock in vars_prune_*
vars_prune() and vars_prune_all() take the variable lock while purging
all variables from a head. However this is not needed:
  - proc scope variables are only purged during deinit, hence no lock
    is needed ;
  - all other scopes are attached to entities bound to a single thread
    so no lock is needed either.

Removing the lock saves about 0.5% CPU on variables-intensive setups,
but above all simplify the code, so let's do it.
2024-09-15 23:05:50 +02:00
Willy Tarreau
51ade2f1db OPTIM: sample: don't check casts for samples of same type
Originally when converters were created, they were mostly for casting
types. Nowadays we have many artithmetic converters to perform operations
on integers, and a number of converters operating on strings. Both of
these categories most often do not need any cast since the input and
output types are the same, which is visible as the cast function is
c_none. However, profiling shows that when heavily using arithmetic
converters, it's possible to spend up to ~7% of the time in
sample_process_cnv(), a good part of which is only in accessing the
sample_casts[] array. Simply avoiding this lookup when input and ouput
types are equal saves about 2% CPU on such setups doing intensive use
of converters.
2024-09-15 12:43:56 +02:00
Willy Tarreau
b11495652e BUG/MEDIUM: queue: implement a flag to check for the dequeuing
As unveiled in GH issue #2711, commit 5541d4995d ("BUG/MEDIUM: queue:
deal with a rare TOCTOU in assign_server_and_queue()") does have some
side effects in that it can occasionally cause an endless loop.

As Christopher analysed it, the problem is that process_srv_queue(),
which uses a trylock in order to leave only one thread in charge of
the dequeueing process, can lose the lock race against pendconn_add().
If this happens on the last served request, then there's no more thread
to deal with the dequeuing, and assign_server_and_queue() will loop
forever on a condition that was initially exepected to be extremely
rare (and still is, except that now it can become sticky). Previously
what was happening is that such queued requests would just time out
and since that was very rare, nobody would notice.

The root of the problem really is that trylock. It was added so that
only one thread dequeues at a time but it doesn't offer only that
guarantee since it also prevents a thread from dequeuing if another
one is in the process of queuing. We need a different criterion.

What we're doing now is to set a flag "dequeuing" in the server, which
indicates that one thread is currently in the process of dequeuing
requests. This one is atomically tested, and only if no thread is in
this process, then the thread grabs the queue's lock and dequeues.
This way it will be serialized with pendconn_add() and no request
addition will be missed.

It is not certain whether the original race covered by the fix above
can still happen with this change, so better keep that fix for now.

Thanks to @Yenya (Jan Kasprzak) for the precise and complete report
allowing to spot the problem.

This patch should be backported wherever the patch above was backported.
2024-09-13 08:35:47 +02:00
Willy Tarreau
adaba6f904 BUG/MINOR: clock: validate that now_offset still applies to the current date
We want to make sure that now_offset is still valid for the current
date: another thread could very well have updated it by detecting a
backwards jump, and at the very same moment the time got fixed again,
that we retrieve and add to the new offset, which results in a larger
jump. Normally, for this to happen, it would mean that before_poll
was also affected by the jump and was detected before and bounded
within 2 seconds, resulting in max 2 seconds perturbations.

Here we try to detect this situation and fall back to re-adjusting the
offset instead.

It's more of a strengthening of what's done by commit e8b1ad4c2b
("BUG/MEDIUM: clock: also update the date offset on time jumps") than a
pure fix, in that the issue was not direclty observed but it's visibly
possible by reading the code, so this should be backported along with
the patch above. This is related to issue GH #2704.

Note that this could be simplified in terms of operations by migrating
the deadlines to nanoseconds, but this was the path to least intrusive
changes.
2024-09-12 19:09:19 +02:00
Willy Tarreau
af48e4cc6b BUG/MINOR: clock: make time jump corrections a bit more accurate
Since commit e8b1ad4c2b ("BUG/MEDIUM: clock: also update the date offset
on time jumps") we try to update the now_offet based on the last known
valid date. But if it's off compared to the global_now_ns date shared
by other threads, we'll get the time off a little bit. When this happens,
we should consider the most recent of these dates so that if the global
date was already known to be more recent, we should use it and stick to
it. This will avoid setting too large an offset that could in turn provoke
a larger jump on another thread.

This is related to issue GH #2704.

This can be backported to other branches having the patch above.
2024-09-12 18:27:03 +02:00
Willy Tarreau
ad98edd00a BUG/MINOR: polling: fix time reporting when using busy polling
Since commit beb859abce ("MINOR: polling: add an option to support
busy polling") the time and status passed to clock_update_local_date()
were incorrect. Indeed, what is considered is the before_poll date
related to the configured timeout which does not correspond to what
is passed to the poller. That's not correct because before_poll+the
syscall's timeout will be crossed by the current date 100 ms after
the start of the poller. In practice it didn't happen when the poller
was limited to 1s timeout but at one minute it happens all the time.

That's particularly visible when running a multi-threaded setup with
busy polling and only half of the threads working (bind ... thread even).
In this case, the fixup code of clock_update_local_date() is executed
for each round of busy polling. The issue was made really visible
starting with recent commit e8b1ad4c2b ("BUG/MEDIUM: clock: also
update the date offset on time jumps") because upon a jump, the
shared offset is reset, while it should not be in this specific
case.

What needs to be done instead is to pass the configured timeout of
the poller (and not of the syscall), and always pass "interrupted"
set so as to claim we got an event (which is sort of true as it just
means the poller returned instantly). In this case we can still
detect backwards/forward jumps and will use a correct boundary
for the maximum date that covers the whole loop.

This can be backported to all versions since the issue was introduced
with busy-polling in 1.9-dev8.
2024-09-12 17:47:13 +02:00
Christopher Faulet
1900ca475f MEDIUM: h1: Accept invalid T-E values with accept-invalid-http-response option
Since the 2.6, A parsing error is reported when the chunked encoding is
found twice. As stated in RFC9112, A sender must not apply the chunked
transfer coding more than once to a message body. It means only one chunked
coding must be found. In addition, empty values are also rejected becaues it
is forbidden by RFC9110.

However, in both cases, it may be useful to relax the rules for trusted
legacy servers when accept-invalid-http-response option is set. Especially
because it was accepted on 2.4 and older. In addition, T-E header is now
sanitized before sending it. It is not a problem Because it is a hop-by-hop
header

Note that it remains invalid on client side because there is no good reason
to relax the parsing on this side. We can argue a server is trusted so we
can decide to support some legacy behavior. It is not true on client side
and it is highly suspicious if a client is sending an invalid T-E header.

Note also we continue to reject unsupported T-E values (so all codings except
"chunked"). Because the "TE" header is sanitized and cannot contain other value
than "Trailers", there is absolutely no reason for a server to use something
else.

This patch should fix the issue #2677. It could probably be backported as
far as 2.6 if necessary.
2024-09-12 09:21:57 +02:00
Willy Tarreau
2b95c77c08 DOC: server: document what to check for when adding new server keywords
It's too easy to overlook the dynamic servers when adding new server
keywords, and the fields on each keyword line are totally obscure. This
commit adds a title to each column of the table and explains what is
expected and what to check for when adding a keyword.
2024-09-10 18:50:12 +02:00
Damien Claisse
ce6a621ae3 MINOR: server: allow init-state for dynamic servers
Commit 50322df introduced the init-state keyword, but it didn't enable
it for dynamic servers. However, this feature is perfectly desirable
for virtual servers too, where someone would like a server inlived
through "set server be1/srv1 state ready" to be put out of maintenance
in down state until the next health check succeeds.
At reading the code, it seems that it's only a matter of allowing this
keyword for dynamic servers, as current code path calls
srv_adm_set_ready() which incidentally triggers a call to
_srv_update_status_adm().
2024-09-10 18:18:38 +02:00
Willy Tarreau
9f8d9c9e8b BUG/MINOR: pattern: do not leave a leading comma on "set" error messages
Commit 4f2493f355 ("BUG/MINOR: pattern: pat_ref_set: fix UAF reported by
coverity") dropped the condition to concatenate error messages and as
such introduced a leading comma in front of all of them. Then commit
911f4d93d4 ("BUG/MINOR: pattern: pat_ref_set: return 0 if err was found")
changed the behavior to stop at the first error anyway, so all the
mechanics dedicated to the concatenation of error messages is no longer
needed and we can simply return the error as-is, without inserting any
comma.

This should be backported where the patches above are backported.
2024-09-10 08:55:29 +02:00
Christopher Faulet
a99d58819f BUG/MINOR: h1-htx: Don't flag response as bodyless when a tunnel is established
This reverts commit 225a4d02e1.

When a 200-OK response is replied to a CONNECT request or a
101-Switching-protocol, a tunnel is considered as established between the
client and the server. However, we must not declare the reponse as
bodyless. Of course, there is no payload, but tunneled data are expected.

Because of this bug, the zero-copy forwarding is disabled on the server
side.

This patch must be backported as far as 2.9.
2024-09-09 19:01:47 +02:00
Christopher Faulet
f6e193f1b0 BUG/MAJOR: mux-h1: Wake SC to perform 0-copy forwarding in CLOSING state
When the mux is woken up on I/O events, if the zero-copy forwarding is
enabled, receives are blocked. In this case, the SC is woken up to be able
to perform 0-copy forwarding to the other side. This works well, except for
the H1C in CLOSING state.

Indeed, in that case, in h1_process(), the SC is not woken up because only
RUNNING H1 connections are considered. As consequence, the mux will ignore
connection closure. The H1 connection remains blocked, waiting for the
shutdown timeout. If no timeout is configured, the H1 connection is never
closed leading to a leak.

This patch should fix leak reported by Damien Claisse in the issue #2697. It
should be backported as far as 2.8.
2024-09-09 19:01:47 +02:00
William Lallemand
021ac6a108 MEDIUM: ssl/cli: "dump ssl cert" allow to dump a certificate in PEM format
The new "dump ssl cert" CLI command allows to dump a certificate stored
into HAProxy memory. Until now it was only possible to dump the
description of the certificate using "show ssl cert", but with this new
command you can dump the PEM content on the filesystem.

This command is only available on a admin stats socket.

$ echo "@1 dump ssl cert cert.pem" | socat /tmp/master.sock -
-----BEGIN PRIVATE KEY-----
[...]
-----END PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
[...]
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
[...]
-----END CERTIFICATE-----
2024-09-09 16:54:48 +02:00
Aurelien DARRAGON
68cfb222b5 BUG/MEDIUM: pattern: prevent UAF on reused pattern expr
Since c5959fd ("MEDIUM: pattern: merge same pattern"), UAF (leading to
crash) can be experienced if the same pattern file (and match method) is
used in two default sections and the first one is not referenced later in
the config. In this case, the first default section will be cleaned up.
However, due to an unhandled case in the above optimization, the original
expr which the second default section relies on is mistakenly freed.

This issue was discovered while trying to reproduce GH #2708. The issue
was particularly tricky to reproduce given the config and sequence
required to make the UAF happen. Hopefully, Github user @asmnek not only
provided useful informations, but since he was able to consistently
trigger the crash in his environment he was able to nail down the crash to
the use of pattern file involved with 2 named default sections. Big thanks
to him.

To fix the issue, let's push the logic from c5959fd a bit further. Instead
of relying on "do_free" variable to know if the expression should be freed
or not (which proved to be insufficient in our case), let's switch to a
simple refcounting logic. This way, no matter who owns the expression, the
last one attempting to free it will be responsible for freeing it.
Refcount is implemented using a 32bit value which fills a previous 4 bytes
structure gap:

        int                        mflags;               /*    80     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          lock;                 /*    88     8 */
(output from pahole)

Even though it was not reproduced in 2.6 or below by @asmnek (the bug was
revealed thanks to another bugfix), this issue theorically affects all
stable versions (up to c5959fd), thus it should be backported to all
stable versions.
2024-09-09 16:07:05 +02:00
Aurelien DARRAGON
8157c1caf2 BUG/MEDIUM: pattern: prevent uninitialized reads in pat_match_{str,beg}
Using valgrind when running map_beg or map_str, the following error is
reported:

==242644== Conditional jump or move depends on uninitialised value(s)
==242644==    at 0x2E4AB1: pat_match_str (pattern.c:457)
==242644==    by 0x2E81ED: pattern_exec_match (pattern.c:2560)
==242644==    by 0x343176: sample_conv_map (map.c:211)
==242644==    by 0x27522F: sample_process_cnv (sample.c:1330)
==242644==    by 0x2752DB: sample_process (sample.c:1373)
==242644==    by 0x319917: action_store (vars.c:814)
==242644==    by 0x24D451: http_req_get_intercept_rule (http_ana.c:2697)

In fact, the error is legit, because in pat_match_{beg,str}, we
dereference the buffer on len+1 to check if a value was previously set,
and then decide to force NULL-byte if it wasn't set.

But the approach is no longer compatible with current architecture:
data past str.data is not guaranteed to be initialized in the buffer.
Thus we cannot dereference the value, else we expose us to uninitialized
read errors. Moreover, the check is useless, because we systematically
set the ending byte to 0 when the conditions are met.

Finally, restoring the older value after the lookup is not relevant:
indeed, either the sample is marked as const and in such case it
is already duplicated, or the sample is not const and we forcefully add
a terminating NULL byte outside from the actual string bytes (since we're
past str.data), so as we didn't alter effective string data and that data
past str.data cannot be dereferenced anyway as it isn't guaranteed to be
initialized, there's no point in restoring previous uninitialized data.

It could be backported in all stable versions. But since this was only
detected by valgrind and isn't known to cause issues in existing
deployments, it's probably better to wait a bit before backporting it
to avoid any breakage.. although the fix should be theoretically harmless.
2024-09-09 15:57:30 +02:00
Aurelien DARRAGON
3449525a02 BUG/MINOR: pattern: prevent const sample from being tampered in pat_match_beg()
This is a complementary patch to a68affeaa ("BUG/MINOR: pattern: a sample
marked as const could be written"). Indeed the same logic from
pat_match_str() is used there, but we lack the check to ensure that the
sample is not const before writing data to it.

It could be backported to all stable versions.
2024-09-09 15:57:23 +02:00
Willy Tarreau
ef8d8215de BUG/MEDIUM: clock: detect and cover jumps during execution
After commit e8b1ad4c2 ("BUG/MEDIUM: clock: also update the date offset
on time jumps"), @firexinghe mentioned that the issue was still present
in their case. In fact it depends on the load, which affects the
probability that the time changes between two poll() calls vs that it
changes during poll(). The time correction code used to only deal with
the latter. But under load if it changes between two poll() calls, what
happens then is that before_poll is off, and after returning from poll(),
the date is within bounds defined by before_poll, so no correction is
applied.

After many tests, it turns out that the most reliable solution without
using CLOCK_MONOTONIC is to prevent before_poll from being earlier than
the previous after_poll (trivial), and to cover forward jumps, we need
to enforce a margin. Given that the watchdog kills a looping task within
2 seconds and that no sane setup triggers it, it seems that 2 seconds
remains a safe enough margin. This means that in the worst case, some
forward jumps of up to 2 seconds will not be corrected, leading to an
apparent fast time and low rates. But this is supposed to be an exceptional
event anyway (typically an admin or crontab running ntpdate).

For future versions, given that we now opportunistically call
now_mono_time() before and after poll(), that returns zero if not
supported, we could imagine relying on this one for the thread's local
time when it's non-null.
2024-09-08 19:15:38 +02:00
Christopher Faulet
001fb1a548 BUG/MEDIUM: mux-h1/mux-h2: Reject upgrades with payload on H2 side only
Since 1d2d77b27 ("MEDIUM: mux-h1: Return a 501-not-implemented for upgrade
requests with a body"), it is no longer possible to perform a protocol
upgrade for requests with a payload. The main reason was to be able to
support protocol upgrade for H1 client requesting a H2 server. In that case,
the upgrade request is converted to a CONNECT request. So, it is not
possible to convey a payload in that case.

But, it is a problem for anyone wanting to perform upgrades on H1 server
using requests with a payload. It is uncommon but valid. So, now, it is the
H2 multiplexer responsibility to reject upgrade requests, on server side, if
there is a payload. An INTERNAL_ERROR is returned for the H2S in that
case. On H1 side, the upgrade is now allowed, but only if the server waits
for the end of the request to return the 101-Switching-protocol
response. Indeed, it is quite hard to synchronise the frontend side and the
backend side in that case. Asking to servers to fully consume the request
payload before returned the response seems reasonable.

This patch should fix the issue #2684. It could be backported after a period
of observation, as far as 2.4 if possible. But only if it is not too
hard. It depends on "MINOR: mux-h1: Set EOI on SE during demux when both
side are in DONE state".
2024-09-06 09:16:18 +02:00
Christopher Faulet
ad1ef94612 MINOR: mux-h1: Set EOI on SE during demux when both side are in DONE state
For now, this case is already handled for all requests except for those
waiting for a tunnel establishment (CONNECT and protocol upgrades). It is
not an issue because only bodyless requests are supported in these cases. So
the request is always finished at the end of headers and therefore before
the response.

However, to relax conditions for full H1 protocol upgrades (H1 client and
server), this case will be necessary. Indeed, the idea is to be able to
perform protocol upgrades for requests with a payload. Today, the "Upgrade:"
header is removed before sending the request to the server. But to support
this case, this patch is required to properly finish transaction when the
server does not perform the upgrade.
2024-09-06 09:00:13 +02:00
Aaron Kuehler
50322dff81 MEDIUM: server: add init-state
Allow the user to set the "initial state" of a server.

Context:

Servers are always set in an UP status by default. In
some cases, further checks are required to determine if the server is
ready to receive client traffic.

This introduces the "init-state {up|down}" configuration parameter to
the server.

- when set to 'fully-up', the server is considered immediately available
  and can turn to the DOWN sate when ALL health checks fail.
- when set to 'up' (the default), the server is considered immediately
  available and will initiate a health check that can turn it to the DOWN
  state immediately if it fails.
- when set to 'down', the server initially is considered unavailable and
  will initiate a health check that can turn it to the UP state immediately
  if it succeeds.
- when set to 'fully-down', the server is initially considered unavailable
  and can turn to the UP state when ALL health checks succeed.

The server's init-state is considered when the HAProxy instance
is (re)started, a new server is detected (for example via service
discovery / DNS resolution), a server exits maintenance, etc.

Link: https://github.com/haproxy/haproxy/issues/51
2024-09-05 11:13:10 +02:00
Willy Tarreau
e8b1ad4c2b BUG/MEDIUM: clock: also update the date offset on time jumps
In GH issue #2704, @swimlessbird and @xanoxes reported problems handling
time jumps. Indeed, since 2.7 with commit 4eaf85f5d9 ("MINOR: clock: do
not update the global date too often") we refrain from updating the global
offset in case it didn't change. But there's a catch: in case of a large
time jump, if the poller was interrupted, the local time remains the same
and we return immediately from there without updating the offset. It then
becomes incorrect regarding the "date" value, and upon subsequent call to
the poller, there's no way to detect a jump anymore so we apply the old,
incorrect offset and the date becomes wrong. Worse, going back to the
original time (then in the past), global_now_ns remains higher than the
local time and neither get updated anymore.

What is missing in practice is to immediately update the offset when
detecting a time jump. In an ideal world, the offset would be updated
upon every call, that's what was being done prior to commit above but
it's extremely CPU intensive on large systems. However we can perfectly
afford to update the offset every time we detect a time jump, as it's
not as common.

This needs to be backported as far as 2.8. Thanks to both participants
above for providing very helpful details.
2024-09-04 16:55:43 +02:00
Ilya Shipitsin
1f6e5f7a61 CLEANUP: assorted typo fixes in the code and comments
This is 43rd iteration of typo fixes
2024-09-03 17:49:21 +02:00
Christopher Faulet
e1cae42879 BUG/MEDIUM: mux-pt: Fix condition to perform a shutdown for writes in mux_pt_shut()
A regression was introduced in the commit 76fa71f7a ("BUG/MEDIUM: mux-pt:
Never fully close the connection on shutdown") because of a typo on the
connection flags. CO_FL_SOCK_WR_SH flag must be tested to prevent a call to
conn_sock_shutw() and not CO_FL_SOCK_RD_SH.

Concretly, most of time, it is harmeless because shutdown for writes is
always performed before any shutdown for reads. Except in case describe by
the commit above. But it is not clear if it has an impact or not.

This patch must be backported with the commit above, so as far as 2.9.
2024-09-03 15:25:05 +02:00
Frederic Lecaille
7e19432fd4 BUG/MINOR: Crash on O-RTT RX packet after dropping Initial pktns
This bug arrived with this naive commit:

    BUG/MINOR: quic: Too shord datagram during O-RTT handshakes (aws-lc only)

which omitted to consider the case where the Initial packet number space
could be discarded before receiving 0-RTT packets.

To fix this, append/insert the O-RTT (early-data) packet number space
into the encryption level list depending on the presence or not of
the Initial packet number space.

This issue was revealed when using aws-lc as TLS stack in GH #2701 issue.
Thank you to @Tristan971 for having reported this issue.

Must be backported where the commit mentionned above is supposed to be
backported: as far as 2.9.
2024-09-03 15:23:06 +02:00
Willy Tarreau
f8bff3b531 BUG/MINOR: mux-spop: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf
That's the equivalent of the mux-h2 one, except that here there's no
real risk to loop since normally we cannot feed data that bypass the
closed state check (e.g. no zero-copy forward). But it still remains
dirty to be able to leave and empty mbuf with MFULL and MROOM set, so
better clear them as well.

No backport is needed since this is only in 3.1.
2024-09-03 14:39:04 +02:00
Willy Tarreau
830e50561c BUG/MAJOR: mux-h2: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf
There exists an extremely tricky code path that was revealed in 3.0 by
the glitches feature, though it might theoretically have existed before.

TL;DR: a mux mbuf may be full after successfully sending GOAWAY, and
discard its remaining contents without clearing H2_CF_MUX_MFULL and
H2_CF_DEM_MROOM, then endlessly loop in h2_send(), until the watchdog
takes care of it.

What can happen is the following: Some data are received, h2_io_cb() is
called. h2_recv() is called to receive the incoming data. Then
h2_process() is called and in turn calls h2_process_demux() to process
input data. At some point, a glitch limit is reached and h2c_error() is
called to close the connection. The input frame was incomplete, so some
data are left in the demux buffer. Then h2_send() is called, which in
turn calls h2_process_mux(), which manages to queue the GOAWAY frame,
turning the state to H2_CS_ERROR2. The frame is sent, and h2_process()
calls h2_send() a last time (doing nothing) and leaves. The streams
are all woken up to notify about the error.

Multiple backend streams were waiting to be scheduled and are woken up
in turn, before their parents being notified, and communicate with the
h2 mux in zero-copy-forward mode, request a buffer via h2_nego_ff(),
fill it, and commit it with h2_done_ff(). At some point the mux's output
buffer is full, and gets flags H2_CF_MUX_MFULL.

The io_cb is called again to process more incoming data. h2_send() isn't
called (polled) or does nothing (e.g. TCP socket buffers full). h2_recv()
may or may not do anything (doesn't matter). h2_process() is called since
some data remain in the demux buf. It goes till the end, where it finds
st0 == H2_CS_ERROR2 and clears the mbuf. We're now in a situation where
the mbuf is empty and MFULL is still present.

Then it calls h2_send(), which doesn't call h2_process_mux() due to
MFULL, doesn't enter the for() loop since all buffers are empty, then
keeps sent=0, which doesn't allow to clear the MFULL flag, and since
"done" was not reset, it loops forever there.

Note that the glitches make the issue more reproducible but theoretically
it could happen with any other GOAWAY (e.g. PROTOCOL_ERROR). What makes
it not happen with the data produced on the parsing side is that we
process a single buffer of input at once, and there's no way to amplify
this to 30 buffers of responses (RST_STREAM, GOAWAY, SETTINGS ACK,
WINDOW_UPDATE, PING ACK etc are all quite small), and since the mbuf is
cleared upon every exit from h2_process() once the error was sent, it is
not possible to accumulate response data across multiple calls. And the
regular h2_snd_buf() path checks for st0 >= H2_CS_ERROR so it will not
produce any data there either.

Probably that h2_nego_ff() should check for H2_CS_ERROR before accepting
to deliver a buffer, but this needs to be carefully studied. In the mean
time the real problem is that the MFULL flag was kept when clearing the
buffer, making the two inconsistent.

Since it doesn't seem possible to trigger this sequence without the
zero-copy-forward mechanism, this fix needs to be backported as far as
2.9, along with previous commit "MINOR: mux-h2: try to clear DEM_MROOM
and MUX_MFULL at more places" which will strengthen the consistency
between these checks.

Many thanks to Annika Wickert for her detailed report that allowed to
diagnose this problem. CVE-2024-45506 was assigned to this problem.
2024-09-03 14:39:04 +02:00
Willy Tarreau
e9cdedb39b MINOR: mux-h2: try to clear DEM_MROOM and MUX_MFULL at more places
The code leading to H2_CF_MUX_MFULL and H2_CF_DEM_MROOM being cleared
is quite complex and assumptions about its state are extremely difficult
when reading the code. There are indeed long sequences where the mux might
possibly be empty, still having the flag set until it reaches h2_send()
which will clear it after the last send. Even then it's not obviour whether
it's always guaranteed to release the flag when invoked in multiple passes.
Let's just simplify the conditionnn so that h2_send() does not depend on
"sent" anymore and that h2_timeout_task() doesn't leave the flags set on
the buffer on emptiness. While it doesn't seem to fix anything, it will
make the code more robust against future changes.
2024-09-03 14:39:04 +02:00
Christopher Faulet
0d4271cdae BUG/MEDIUM: mux-h1: Properly handle empty message when an error is triggered
When a 400/408/500/501 error is returned by the H1 multiplexer, we first try
to get the error message of the proxy before using the default one. This may
be configured to be mapped on /dev/null or on an empty file. In that case,
no message is emitted, as expected. But everything is handled as the error
was successfully sent.

However, there is an bug here. In h1_send_error() function, this case is not
properly handled. The flag H1C_F_ABRTED is not set on the H1 connection as it
should be and h1_close() function is not called, leaving the H1 connection in an
undefined state.

It is especially an issue when a "empty" 408-Request-Time-out error is emitted
while there are data blocked in the output buffer. In that case, the connection
remains openned until the client closes and a "cR--"/408 is logged repeatedly, every
time the client timeout is reached.

This patch must backported as far as 2.8.
2024-09-03 14:28:42 +02:00
Frederic Lecaille
15a737eb5f BUG/MINOR: quic: unexploited retransmission cases for Initial pktns.
qc_prep_hdshk_fast_retrans() job is to pick some packets to be retransmitted
from Initial and Handshake packet number spaces. A packet may be coalesced to
a first one into the same datagram. When a coalesced packet is inspected for
retransmission, it is skipped if its length would make the total datagram length
it is attached to exceeding the anti-amplification limit. But in this case, the
first packet must be kept for the current retransmission. This is tracked by
this trace statemement:
    TRACE_PROTO("will probe Initial packet number space", QUIC_EV_CONN_SPPKTS, qc);
This was not the case because of the wrong "goto end" statement. This latter
must be run only if the Initial packet number space must not be probe with
the first packet found as coalesced to another one which must be skipped.

This bug was revealed by AWS-LC interop runner with handshakeloss and
handshakecorruption which always fail because this stack leads the server
to send more Initial packets.

Thank you to Ilya (@chipitsine) for this issue report in GH #2663.

Must be backported as far as 2.6.
2024-09-03 11:47:51 +02:00
Christopher Faulet
d4781bd5e7 BUG/MEDIUM: cli: Always release back endpoint between two commands on the mcli
When several commands are chained on the master CLI, the same client
connection is used. Because, it is a TCP connection, the mux PT is used. It
means there is no stream at the mux level. It is not possible to release the
applicative stream between each commands as for the HTTP. So, to work around
this limitation, between two commands, the master CLI is resetting the
stream. It does exactly what it was performed on HTTP to manage keep-alive
connections on old HAProxy versions.

But this part was copied from a code dealing with connection only while the
back endpoint can be an applet or a mux for the master cli. The previous fix
on the mux PT ("BUG/MEDIUM: mux-pt: Never fully close the connection on
shutdown") revealed a bug. Between two commands, the back endpoint was only
released if the connection's XPRT was closed. This works if the back
endpoint is an applet because there is no connection. But for commands sent
to a worker, a connection is used. At this stage, this only works if the
connection's XPRT is closed. Otherwise, the old endpoint is never detached
leading to undefined behavior on the next command execution (most probably a
crash).

Without the commit above, the connection's XPRT is always closed on
shutdown. It is no longer true. At this stage, we must inconditionnally
release the back endpoint by resetting the corresponding sedesc to fix the
bug.

This patch must be backported with the commit above in all stable
versions. On 2.4 and lower, it will need to be adapted.
2024-09-02 18:31:35 +02:00
Christopher Faulet
76fa71f7a8 BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown
When a shutdown is reported to the mux (shutdown for reads or writes), the
connexion is immediately fully closed if the mux detects the connexion is
closed in both directions. Only the passthrough multiplexer is able to
perform this action at this stage because there is no stream and no internal
data. Other muxes perform a full connection close during the mux's release
stage. It was working quite well since recently. But, in theory, the bug is
quite old.

In fact, it seems possible for the lower layer to report an error on the
connection in same time a shutdown is performed on the mux. Depending on how
events are scheduled, the following may happen:

 1. An connection error is detected at the fd layer and a wakeup is
    scheduled on the mux to handle the event.

 2. A shutdown for writes is performed on the mux. Here the mux decides to
    fully close the connexion. If the xprt is not used to log info, it is
    released.

 3. The mux is finally woken up. It tries to retrieve data from the xprt
    because it is not awayre there was an error. This leads to a crash
    because of a NULL-deref.

By reading the code, it is not obvious. But it seems possible with SSL
connection when the handshake is rearmed. It happens when a
SSL_ERROR_WANT_WRITE is reported on a SSL_read() attempt or a
SSL_ERROR_WANT_READ on a SSL_write() attempt.

This bug is only visible if the XPRT is not used to log info. So it is no so
common.

This patch should fix the 2nd crash reported in the issue #2656. It must
first be backported as far as 2.9 and then slowly to all stable versions.
2024-09-02 15:50:25 +02:00
Christopher Faulet
f9adcdf039 MEDIUM: bwlim: Use a read-lock on the sticky session to apply a shared limit
There is no reason to acquire a write-lock on the sticky session when a
shared limit is applied because only the frequency is updated. The sticky
session itself is not modified. We must just take care it is not removed in
the mean time. So a read-lock may be used instead.
2024-09-02 15:50:25 +02:00
Christopher Faulet
a7f6b0ac03 MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates
Add a factor parameter to stick-tables, called "brates-factor", that is
applied to in/out bytes rates to work around the 32-bits limit of the
frequency counters. Thanks to this factor, it is possible to have bytes
rates beyond the 4GB. Instead of counting each bytes, we count blocks
of bytes. Among other things, it will be useful for the bwlim filter, to be
able to configure shared limit exceeding the 4GB/s.

For now, this parameter must be in the range ]0-1024].
2024-09-02 15:50:25 +02:00
Frederic Lecaille
db13df3d6e BUG/MINOR: quic: Crash from trace dumping SSL eary data status (AWS-LC)
This bug follows this patch:
     MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
where a new third variable was added to be dumped from QUIC_EV_CONN_IO_CB trace
event. The quic_trace() code did not reveal there was already another variable
passed as third argument but not dumped. This leaded to crash when dereferencing
a point to an int in place of a point to an SSL object.

This issue was reproduced only by handshakecorruption aws-lc interop test with
s2n-quic as client.

Note that this patch must be backported with this one:
     BUG/MEDIUM: quic: always validate sender address on 0-RTT
which depends on the commit mentionned above.
2024-09-02 10:01:41 +02:00
Aperence
20efb856e1 MEDIUM: protocol: add MPTCP per address support
Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths.

Multipath TCP has been used for several use cases. On smartphones, MPTCP
enables seamless handovers between cellular and Wi-Fi networks while
preserving established connections. This use-case is what pushed Apple
to use MPTCP since 2013 in multiple applications [2]. On dual-stack
hosts, Multipath TCP enables the TCP connection to automatically use the
best performing path, either IPv4 or IPv6. If one path fails, MPTCP
automatically uses the other path.

To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [3]. To
use it on Linux, an application must explicitly enable it when creating
the socket. No need to change anything else in the application.

This attached patch adds MPTCP per address support, to be used with:

  mptcp{,4,6}@<address>[:port1[-port2]]

MPTCP v4 and v6 protocols have been added: they are mainly a copy of the
TCP ones, with small differences: names, proto, and receivers lists.

These protocols are stored in __protocol_by_family, as an alternative to
TCP, similar to what has been done with QUIC. By doing that, the size of
__protocol_by_family has not been increased, and it behaves like TCP.

MPTCP is both supported for the frontend and backend sides.

Also added an example of configuration using mptcp along with a backend
allowing to experiment with it.

Note that this is a re-implementation of Bjrn's work from 3 years ago
[4], when haproxy's internals were probably less ready to deal with
this, causing his work to be left pending for a while.

Currently, the TCP_MAXSEG socket option doesn't seem to be supported
with MPTCP [5]. This results in a warning when trying to set the MSS of
sockets in proto_tcp:tcp_bind_listener.

This can be resolved by adding two new variables:
sock_inet(6)_mptcp_maxseg_default that will hold the default
value of the TCP_MAXSEG option. Note that for the moment, this
will always be -1 as the option isn't supported. However, in the
future, when the support for this option will be added, it should
contain the correct value for the MSS, allowing to correctly
set the TCP_MAXSEG option.

Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2]
Link: https://www.mptcp.dev [3]
Link: https://github.com/haproxy/haproxy/issues/1028 [4]
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5]

Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be>
Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
2024-08-30 18:53:49 +02:00
Aperence
2f171fe36a MEDIUM: sock: use protocol when creating socket
Use the protocol configured for a connection when creating the socket,
instead of always using 0.

This change is needed to allow new protocol to be used when creating
the sockets, such as MPTCP. Note however that this patch won't change
anything for now, as the only other value that proto->sock_prot could
hold is IPPROTO_TCP, which has the same behavior as 0 when passed to
socket.
2024-08-30 18:53:49 +02:00
Aperence
38618822e1 MINOR: server: add a alt_proto field for server
Add a new field alt_proto to the server structures that
specify if an alternate protocol should be used for this server.

This field can be transparently passed to protocol_lookup to get
an appropriate protocol structure.

This change allows thus to create servers with different protocols,
and not only TCP anymore.
2024-08-30 18:53:49 +02:00
Aperence
a7b04e383a MINOR: tools: extend str2sa_range to add an alt parameter
Add a new parameter "alt" that will store wether this configuration
use an alternate protocol.

This alt pointer will contain a value that can be transparently
passed to protocol_lookup to obtain an appropriate protocol structure.

This change is needed to allow for example the servers to know if it
need to use an alternate protocol or not.
2024-08-30 18:53:49 +02:00
Willy Tarreau
2bc513dd31 BUILD: quic: fix build errors on FreeBSD since recent GSO changes
The following commits broke the build on FreeBSD when QUIC is enabled:

  35470d518 ("MINOR: quic: activate UDP GSO for QUIC if supported")
  448d3d388 ("MINOR: quic: add GSO parameter on quic_sock send API")

Indeed, it turns out that netinet/udp.h requires sys/types.h to be
included before. Let's just change the includes order to fix the build.
No backport is needed.
2024-08-30 18:53:49 +02:00
Frederic Lecaille
f627b9272b BUG/MEDIUM: quic: always validate sender address on 0-RTT
It has been reported by Wedl Michael, a student at the University of Applied
Sciences St. Poelten, a potential vulnerability into haproxy as described below.

An attacker could have obtained a TLS session ticket after having established
a connection to an haproxy QUIC listener, using its real IP address. The
attacker has not even to send a application level request (HTTP3). Then
the attacker could open a 0-RTT session with a spoofed IP address
trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests.

To mitigate this vulnerability, one decided to use a token which can be provided
to the client each time it successfully managed to connect to haproxy. These
tokens may be reused for future connections to validate the address/path of the
remote peer as this is done with the Retry token which is used for the current
connection, not the next one. Such tokens are transported by NEW_TOKEN frames
which was not used at this time by haproxy.

So, each time a client connect to an haproxy QUIC listener with 0-RTT
enabled, it is provided with such a token which can be reused for the
next 0-RTT session. If no such a token is presented by the client,
haproxy checks if the session is a 0-RTT one, so with early-data presented
by the client. Contrary to the Retry token, the decision to refuse the
connection is made only when the TLS stack has been provided with
enough early-data from the Initial ClientHello TLS message and when
these data have been accepted. Hopefully, this event arrives fast enough
to allow haproxy to kill the connection if some early-data have been accepted
without token presented by the client.

quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN
frame with this newly implemented token to be transported inside.

quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre()
and modified to be reused and derive the secret for the new token implementation.

quic_token_validate() has been implemented to validate both the Retry and
the new token implemented by this patch. When this is a non-retry token
which could not be validated, the datagram received is marked as requiring
a Retry packet to be sent, and no connection is created.

When the Initial packet does not embed any non-retry token and if 0-RTT is enabled
the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon
as the TLS stack detects that some early-data have been provided and accepted by
the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from
ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted()
new function. The secret TLS handshake is interrupted as soon as possible returnin
0 from ha_quic_add_handshake_data(). The connection is also marked as
requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from
ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb())
knows how to behave: kill the connection after having sent a Retry packet.

About TLS stack compatibility, this patch is supported by aws-lc. It is
disabled for wolfssl which does not support 0-RTT at this time thanks
to HAVE_SSL_0RTT_QUIC.

This patch depends on these commits:

     MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
     MINOR: quic: Implement qc_ssl_eary_data_accepted().
     MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)
     BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder
     MINOR: quic: Token for future connections implementation.
     MINOR: quic: Implement quic_tls_derive_token_secret().
     MINOR: tools: Implement ipaddrcpy().

Must be backported as far as 2.6.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
8854cef036 MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
Dump the early data status from QUIC_EV_CONN_IO_CB trace event.
This is very helpful to know if the QUIC server has accepted the
early data received from clients.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
e926378375 MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)
Modify qf_new_token structure to use a static buffer with QUIC_TOKEN_LEN
as size as defined by the token for future connections (quic_token.c).
Modify consequently the NEW_TOKEN frame parser (see quic_parse_new_token_frame()).
Also add comments to denote that the NEW_TOKEN parser function is used only by
clients and that its builder is used only by servers.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
76c80605a6 BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder
quic_build_new_token_frame() is the function which is called to build
a NEW_TOKEN frame into a buffer. The position pointer for this buffer
was not updated, leading the NEW_TOKEN frame to be malformed.

Must be backported as far as 2.6.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
f5b09dc452 MINOR: quic: Token for future connections implementation.
There exist two sorts of token used by QUIC. They are both used to validate
the peer address (path validation). Retry are used for the current
connection the client want to open. This patch implement the other
sort of tokens which after having been received from a connection, may
be provided for the next connection from the same IP address to validate
it (or validate the network path between the client and the server).

The token generation is implemented by quic_generate_token(), and
the token validation by quic_token_chek(). The same method
is used as for Retry tokens to build such tokens to be reused for
future connections. The format is very simple: one byte for the format
identifier to distinguish these new tokens for the Retry token, followed
by a 32bits timestamps. As this part is ciphered with AEAD as cryptographic
algorithm, 16 bytes are needed for the AEAD tag. 16 more random bytes
are added to this token and a salt to derive the AEAD secret used
to cipher the token. In addition to this salt, this is the client IP address
which is used also as AAD to derive the AEAD secret. So, the length of
the token is fixed: 37 bytes.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
74caa0eece MINOR: quic: Implement quic_tls_derive_token_secret().
This is function is similar to quic_tls_derive_retry_token_secret().
Its aim is to derive the secret used to cipher the token to be used
for future connections.

This patch renames quic_tls_derive_retry_token_secret() to a more
and reuses its code to produce a more generic one: quic_do_tls_derive_token_secret().
Two arguments are added to this latter to produce both quic_tls_derive_retry_token_secret()
and quic_tls_derive_token_secret() new function which calls
quic_do_tls_derive_token_secret().
2024-08-30 17:04:09 +02:00
Frederic Lecaille
fb7a092203 MINOR: tools: Implement ipaddrcpy().
Implement ipaddrcpy() new function to copy only the IP address from
a sockaddr_storage struct object into a buffer.
2024-08-30 17:04:09 +02:00
Nicolas CARPi
a33407b499 CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE
There was a typo in the macro name, where LENGTH was incorrectly
written. This didn't cause any issue because the typo appeared in all
occurrences in the codebase.
2024-08-30 14:58:59 +02:00
Nicolas CARPi
534e7e4598 CLEANUP: haproxy: fix typos in code comment
Use "from" instead of "form" in ha_random_boot function code comments.
2024-08-30 14:58:59 +02:00
Christopher Faulet
e4812404c5 BUG/MEDIUM: stream: Prevent mux upgrades if client connection is no longer ready
If an early error occurred on the client connection, we must prevent any
multiplexer upgrades. Indeed, it is unexpected for a mux to be initialized
with no xprt. On a normal workflow it is impossible. So it is not an
issue. But if a mux upgrade is performed at the stream level, an early error
on the connection may have already been handled by the previous mux and the
connection may be already fully closed. If the mux upgrade is still
performed, a crash can be experienced.

It is possible to have a crash with an implicit TCP>HTTP upgrade if there is no
data in the input buffer. But it is also possible to get a crash with an
explicit "switch-mode http" rule.

It must be backported to all stable versions. In 2.2, the patch must be
applied directly in stream_set_backend() function.
2024-08-28 16:38:20 +02:00
Christopher Faulet
4ef5251c44 BUG/MEDIUM: mux-h2: Set ES flag when necessary on 0-copy data forwarding
When DATA frames are sent via the 0-copy data forwarding, we must take care
to set the ES flag on the last DATA frame. It should be performed in
h2_done_ff() when IOBUF_FL_EOI flag was set by the producer. This flag is
here to know when the producer has reached the end of input. When this
happens, the h2s state is also updated. It is switched to "half-closed
local" or "closed" state depending on its previous state.

It is mainly an issue on uploads because the server may be blocked waiting
for the end of the request. A workaround is to disable the 0-copy forwarding
support the the H2 by setting "tune.h2.zero-copy-fwd-send" directive to off
in your global section.

This patch should fix the issue #2665. It must be backported as far as 2.9.
2024-08-28 10:05:34 +02:00
Christopher Faulet
0d142e0756 MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status
The "429" status can now be specified on retry-on directives. PR_RE_* flags
were updated to remains sorted.

This patch should fix the issue #2687. It is quite simple so it may safely
be backported to 3.0 if necessary.
2024-08-28 10:05:34 +02:00
William Lallemand
d2fc1ab66e MEDIUM: ssl/sample: add ssl_fc_sigalgs_bin sample fetch
This new sample fetch allow to extract the binary list contained in the
signature_algorithms (13) TLS extensions.

https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.3
2024-08-26 15:17:40 +02:00
William Lallemand
e8fecef0ff MEDIUM: ssl: capture the signature_algorithms extension from Client Hello
Activate the capture of the TLS signature_algorithms extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.
2024-08-26 15:17:40 +02:00
William Lallemand
ac5c7158f9 MEDIUM: ssl/sample: add ssl_fc_supported_versions_bin sample fetch
This new sample fetch allow to extract the binary list contained in the
supported_versions (43) TLS extensions.

https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.1
2024-08-26 15:17:40 +02:00
William Lallemand
ce7fb6628e MEDIUM: ssl: capture the supported_versions extension from Client Hello
Activate the capture of the TLS supported_versions extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.
2024-08-26 15:12:42 +02:00
William Lallemand
3c0a0f1e1b CLEANUP: ssl: cleanup the clienthello capture
In order to add more extensions, clean up the clienthello capture
function a little bit.
2024-08-26 15:12:42 +02:00
Frederic Lecaille
414e3aa6bc BUILD: quic: 32bits build broken by wrong integer conversions for printf()
Since these commits the 32bits build is broken due to several errors as follow:

CC      src/quic_cli.o
src/quic_cli.c: In function ‘dump_quic_full’:
src/quic_cli.c:285:94: error: format ‘%ld’ expects argument of type ‘long int’,
        but argument 5 has type ‘uint64_t’ {aka ‘long long unsigned int’} [-Werror=format=]
  285 |                         chunk_appendf(&trash, "  [initl] rx.ackrng=%-6zu tx.inflight=%-6zu(%ld%%)\n",
      |                                                                                            ~~^
      |                                                                                              |
      |                                                                                              long int
      |                                                                                            %lld
  286 |                                       pktns->rx.arngs.sz, pktns->tx.in_flight,
  287 |                                       pktns->tx.in_flight * 100 / qc->path->cwnd);
      |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                 |
      |                                                                 uint64_t {aka long long unsigned int}

Replace several %ld by %llu with ull as printf conversion in quic_clic.c and a
%ld by %lld with (long long) as printf conversion in quic_cc_cubic.c.

Thank you to Ilya (@chipitsine) for having reported this issue in GH #2689.

Must be backported to 3.0.
2024-08-26 11:21:48 +02:00
William Lallemand
7a03ab426f BUILD: tools: environ is not defined in OS X and BSD
Add extern char **environ which in order to build the new functions to
manipulate the environment.

Indeed the variable environ is not required to be declared by POSIX, so
it need to be declared manually:

"In addition, the following variable, which must be declared by the user if it is to be used directly:

extern char **environ;"

https://pubs.opengroup.org/onlinepubs/9699919799/functions/environ.html
2024-08-23 19:39:57 +02:00
Valentine Krasnobaeva
28ca7fc594 BUG/MINOR: haproxy: free init_env in deinit only if allocated
This fixes 7b78e1571 (" MINOR: mworker: restore initial env before wait
mode").

In cases, when haproxy starts without any configuration, for example:
'haproxy -vv', init_env array to backup env variables is never allocated. So,
we need to check in deinit(), when we free its memory, that init_env is not a
NULL ptr.
2024-08-23 19:08:53 +02:00
Valentine Krasnobaeva
7b78e1571b MINOR: mworker: restore initial env before wait mode
This patch is the follow-up of 1811d2a6ba (MINOR: tools: add helpers to
backup/clean/restore env).

In order to avoid unexpected behaviour in master-worker mode during the process
reload with a new configuration, when the old one has contained '*env' keywords,
let's backup its initial environment before calling parse_cfg() and let's clean
and restore it in the context of master process, just before it enters in a wait
polling loop.

This will garantee that new workers will have a new updated environment and not
the previous one inherited from the master, which does not read the configuration,
when it's in a wait-mode.
2024-08-23 17:06:59 +02:00
Valentine Krasnobaeva
1811d2a6ba MINOR: tools: add helpers to backup/clean/restore env
'setenv', 'presetenv', 'unsetenv', 'resetenv' keywords in configuration could
modify the process runtime environment. In case of master-worker mode this
creates a problem, as the configuration is read only once before the forking a
worker and then the master process does the reexec without reading any config
files, just to free the memory. So, during the reload a new worker process will
be created, but it will inherited the previous unchanged environment from the
master in wait mode, thus it won't benefit the changes in configuration,
related to '*env' keywords. This may cause unexpected behavior or some parser
errors in master-worker mode.

So, let's add a helper to backup all process env variables just before it will
read its configuration. And let's also add helpers to clean up the current
runtime environment and to restore it to its initial state (as it was before
parsing the config).
2024-08-23 17:06:33 +02:00
Amaury Denoyelle
960d68a5af MINOR: mux-quic: correct qcc_bufwnd_full() documentation
Fix returned value domment of qcc_bufwnd_full() which was incorrect.
2024-08-23 16:25:04 +02:00
Amaury Denoyelle
ecfedc2570 MINOR: mux-quic: add buf_in_flight to QCC debug infos
Dump <buf_in_flight> QCC field both in QUIC MUX traces and "show quic".
This could help to detect if MUX does not allocate enough buffers
compared to quic_conn current congestion window.
2024-08-22 17:48:23 +02:00
Nathan Wehrman
5c07d58e08 MINOR: config: Created env variables for http and tcp clf formats
Since we already have variables for the other formats and the
change is trivial I thought it would be a nice addition for
completeness
2024-08-22 09:15:58 +02:00
Willy Tarreau
9911b53d75 CLEANUP: protocol: no longer initialize .receivers nor .nb_receivers
Protocol definitions no longer need to initialize these internal fields,
as they're now properly initialized during protocol registration.
2024-08-21 17:37:46 +02:00
Willy Tarreau
1cb3b0b745 MINOR: protocol: always initialize the receivers list on registration
Till now, protocols were required to self-initialize their receivers
list head, which is not very convenient, and is quite error prone.
Indeed, it's too easy to copy-paste a protocol definition and forget
to update the .receivers field to point to itself, resulting in mixed
lists. Let's just do that in protocol_register(). And while we're at
it, let's also zero the nb_receivers entry that works with it, so that
the protocol definition isn't required to pre-initialize stuff related
to internal book-keeping.
2024-08-21 17:37:46 +02:00
Willy Tarreau
034974106f MINOR: socket: don't ban all custom families from reuseport
The test on ss_family >= AF_MAX is too strict if we want to support new
custom families, let's apply this to the real_family instead so that we
check that the underlying socket supports reuseport.
2024-08-21 17:37:46 +02:00
Willy Tarreau
2a799b64b0 MINOR: protocol: add the real address family to the protocol
For custom families, there's sometimes an underlying real address and
it would be nice to be able to directly use the real family in calls
to bind() and connect() without having to add explicit checks for
exceptions everywhere.

Let's add a .real_family field to struct proto_fam for this. For now
it's always equal to the family except for non-transferable ones such
as rhttp where it's equal to the custom one (anything else could fit).
2024-08-21 17:37:46 +02:00
Willy Tarreau
d592ebdbeb MEDIUM: socket: always properly use the sock_domain for requested families
Now we make sure to always look up the protocol's domain for an address
family. Previously we would use it as-is, which prevented from properly
using custom addresses (which is when they differ).

This removes some hard-coded tests such as in log.c where UNIX vs UDP
was explicitly checked for example. It requires a bit of care, however,
so as to properly pass value 1 in the 3rd arg of the protocol_lookup()
for DGRAM stuff. Maybe one day we'll change these for defines or enums
to limit mistakes.
2024-08-21 17:36:58 +02:00
Willy Tarreau
ba4a416c66 MINOR: protocol: add a family lookup
At plenty of places we have access to an address family which may
include some custom addresses but we cannot simply convert them to
the real families without performing some random protocol lookups.

Let's simply add a proto_fam table like we have for the protocols.
The protocols could even be indexed there, but for now it's not worth
it.
2024-08-21 16:46:15 +02:00
Willy Tarreau
732913f848 MINOR: protocol: properly assign the sock_domain and sock_family
When we finally split sock_domain from sock_family in 2.3, something
was not cleanly finished. The family is what should be stored in the
address while the domain is what is supposed to be passed to socket().
But for the custom addresses, we did the opposite, just because the
protocol_lookup() function was acting on the domain, not the family
(both of which are equal for non-custom addresses).

This is an API bug but there's no point backporting it since it does
not have visible effects. It was visible in the code since a few places
were using PF_UNIX while others were comparing the domain against AF_MAX
instead of comparing the family.

This patch clarifies this in the comments on top of proto_fam, addresses
the indexing issue and properly reconfigures the two custom families.
2024-08-21 16:46:15 +02:00
Willy Tarreau
67bf1d6c9e MINOR: quic: support a tolerance for spurious losses
Tests performed between a 1 Gbps connected server and a 100 mbps client,
distant by 95ms showed that:

  - we need 1.1 MB in flight to fill the link
  - rare but inevitable losses are sufficient to make cubic's window
    collapse fast and long to recover
  - a 100 MB object takes 69s to download
  - tolerance for 1 loss between two ACKs suffices to shrink the download
    time to 20-22s
  - 2 losses go to 17-20s
  - 4 losses reach 14-17s

At 100 concurrent connections that fill the server's link:
  - 0 loss tolerance shows 2-3% losses
  - 1 loss tolerance shows 3-5% losses
  - 2 loss tolerance shows 10-13% losses
  - 4 loss tolerance shows 23-29% losses

As such while there can be a significant gain sometimes in setting this
tolerance above zero, it can also significantly waste bandwidth by sending
far more than can be received. While it's probably not a solution to real
world problems, it repeatedly proved to be a very effective troubleshooting
tool helping to figure different root causes of low transfer speeds. In
spirit it is comparable to the no-cc congestion algorithm, i.e. it must
not be used except for experimentation.
2024-08-21 08:34:30 +02:00
Willy Tarreau
fab0e99aa1 MINOR: quic: store the lost packets counter in the quic_cc_event element
Upon loss detection, qc_release_lost_pkts() notifies congestion
controllers about the event and its final time. However it does not
pass the number of lost packets, that can provide useful hints for
some controllers. Let's just pass this option.
2024-08-21 08:02:44 +02:00
Valentine Krasnobaeva
2e6e159ac4 BUG/MINOR: cfgparse-global: remove tune.fast-forward from common_kw_list
Remove tune.fast-forward from common_kw_list. It was replaced by
'tune.disable-fast-forward' and it's no longer present in "if..else if.."
parser from cfg_parse_global(). Otherwise, it may be shown as the best-match
keyword for some tune options, which is now wrong.

Should be backported in versions 2.9 and 3.0.
2024-08-20 19:16:34 +02:00
Valentine Krasnobaeva
731ef865e3 MINOR: cfgparse-global: move unsupported keywords in global list
Following the previous commits and in order to clean up cfg_parse_global let's
move unsupported keywords in the global list and let's add for them a dedicated
parser.
2024-08-20 19:16:33 +02:00
Valentine Krasnobaeva
55309592db MINOR: cfgparse-global: move tune options in global keywords list
In order to clean up cfg_parse_global() and to add the support of the new
MODE_DISCOVERY in configuration parsing, let's move the keywords related to
tune options into the global keywords list and let's add for them two dedicated
parsers. Tune options keywords are sorted between two parsers in dependency of
parameters number, which a given tune option needs.

tune options parser is called by section parser and follows the common API, i.e.
it returns -1 on failure, 0 on success and 1 on recoverable error. In case of
recoverable error we've previously returned ERR_ALERT (0x10) and we have emitted
an alert message at startup. Section parser treats all rc > 0 as ERR_WARN. So in
case, if some tune option was set twice in the global section, tune
options parser will return 1 (in order to respect the common API), section
parser will treat this as ERR_WARN and a warning message will be emitted during
process startup instead of alert, as it was before.
2024-08-20 19:16:32 +02:00
Valentine Krasnobaeva
c46497f16f MINOR: cfgparse-global: move 'expose-*' in global keywords list
Following the previous commit let's also move 'expose-*' keywords in the global
cfg_kws list and let's add for them a dedicated parser. This will simplify the
configuration parsing in the new MODE_DISCOVERY, which allows to read only the
keywords, needed at the early start of haproxy process (i.e. modes, pidfile,
chosen poller).
2024-08-20 19:16:31 +02:00
Valentine Krasnobaeva
450ce3e61b MINOR: cfgparse-global: move 'pidfile' in global keywords list
This commit cleans up cfg_parse_global() and prepares the config parser to
support MODE_DISCOVERY. This step is needed in early starting stage, just to
figura out in which mode the process was started, to set some necessary
parameteres needed for this mode and to continue the initialization
stage.

'pidfile' makes part of such common keywords, which are needed to be parsed
very early and which are used almost in all process modes (except the
foreground, '-d').

'pidfile' keyword parser is called by section parser and follows the common
API, i.e. it returns -1 on failure, 0 on success and 1 on recoverable error. In
case of recoverable error we've previously returned ERR_ALERT (0x10) and we have
emitted an alert message at startup. Section parser treats all rc > 0 as
ERR_WARN. So in case, if pidfile was already specified via command line, the
keyword parser will return 1 (in order to respect the common API), section
parser will treat this as ERR_WARN and a warning message will be emitted during
process startup instead of alert, as it was before.
2024-08-20 19:16:30 +02:00
Valentine Krasnobaeva
f29be97ac7 BUG/MINOR: cfgparse-global: remove redundant goto
In the case, when the given keyword was found in the global 'cfg_kws' list, we
go to 'out' label anyway, after testing rc returned by the keyword's parser. So
there is not a much gain if we perform 'goto out' jump specifically when rc > 0.
2024-08-20 19:16:29 +02:00
Valentine Krasnobaeva
74bc6f3d66 BUG/MINOR: cfgparse-global: clean common_kw_list
This patch fixes commits 118ac11ce
("MINOR: cfgparse-global: move mode's keywords in cfg_kw_list") and 83ff4db18
(MINOR: cfgparse-global: move no<poller_name> in cfg_kw_list).

'common_kw_list' serves to show the best-match keyword in cfg_parse_global(), if
the given keyword was not parsed in "if..else if.." cases. cfg_parse_global()
is still used as a parser for some keywords from the global section.

Mode-specific and no<poller_name> keywords now have their own parsers. They no
longer take place in the "if..else if.." from cfg_parse_global() and they are
registered in the 'cfg_kws' list. So, there is no longer need to duplicate
them in the 'common_kw_list'. Otherwise, they will be shown twice in parser
error message.
2024-08-20 19:16:28 +02:00
Valentine Krasnobaeva
4291d10b44 BUG/MINOR: cfgparse-global: fix err msg in mworker keyword parser
This patch fixes the commit 118ac11ce
("cfgparse-global: move mode's keywords in cfg_kw_list"). Error message
delivered by keyword parser in **err is always shown with ha_alert() by the
caller cfg_parse_global(). The caller always supplies these alerts with the
filename and the line number.
2024-08-20 19:16:27 +02:00
Amaury Denoyelle
0d6112b40b MINOR: mux-quic: retry after small buf alloc failure
Previous commit switch to small buffers for HTTP/3 HEADERS emission.
This ensures that several parallel streams can allocate their own buffer
without hitting the connection buffer limit based now on the congestion
window size.

However, this prevents the transmission of responses with uncommonly
large headers. Indeed, if all headers cannot be encoded in a single
buffer, an error is reported which cause the whole connection closure.

Adjust this by implementing a realloc API exposed by QUIC MUX. This
allows application layer to switch from a small to a default buffer and
restart its processing. This guarantees that again headers not longer
than bufsize can be properly transferred.
2024-08-20 18:12:27 +02:00
Amaury Denoyelle
b355e89bf9 MEDIUM: h3: allocate small buffers for headers frames
A major change was recently implemented to change QUIC MUX Tx buffer
allocation limit, which is now based on the current connection
congestion window size. As this size may be smaller than the previous
static value, it is likely that the limit will be reached more
frequently.

When using HTTP/3, the majority of requests streams are used for small
object exchanges. Every responses start with a HEADERS frames which
should be much smaller in size than the default buffer. But as the whole
buffer size is accounted against the congestion window, a single stream
can block others even if only emitting a single HEADERS frame which is
suboptimal for bandwith usage, if the congestion window is small enough.

To adapt to this new situation, rely on the newly available small
buffers to transfer HEADERS frame response. This at least guarantee that
several parallel streams could allocate their own buffer for the first
part of the response, even with a small congestion window.

The situation could be further improve to use various indication on the
data size and select a small buffer if sufficient. This could be done
for example via the Content-length value or HTX extra field. However
this must be the subject of a dedicated patch.
2024-08-20 18:12:27 +02:00
Amaury Denoyelle
885e4c5cf8 MINOR: quic: support sbuf allocation in quic_stream
This patch extends qc_stream_desc API to be able to allocate small
buffers. QUIC MUX API is similarly updated as ultimatly each application
protocol is responsible to choose between a default or a smaller buffer.

Internally, the type of allocated buffer is remembered via qc_stream_buf
instance. This is mandatory to ensure that the buffer is released in the
correct pool, in particular as small and standard buffers can be
configured with the same size.

This commit is purely an API change. For the moment, small buffers are
not used. This will changed in a dedicated patch.
2024-08-20 18:12:27 +02:00
Amaury Denoyelle
d0d8e57d47 MINOR: quic: define sbuf pool
Define a new buffer pool reserved to allocate smaller memory area. For
the moment, its usage will be restricted to QUIC, as such it is declared
in quic_stream module.

Add a new config option "tune.bufsize.small" to specify the size of the
allocated objects. A special check ensures that it is not greater than
the default bufsize to avoid unexpected effects.
2024-08-20 18:12:27 +02:00
Amaury Denoyelle
1de5f718cf MINOR: quic/config: adapt settings to new conn buffer limit
QUIC MUX buffer allocation limit is now directly based on the underlying
congestion window size. previous static limit based on conn-tx-buffers
is now unused. As such, this commit adds a warning to users to prevent
that it is now obsolete.

Secondly, update max-window-size setting. It is now the main entrypoint
to limit both the maximum congestion window size and the number of QUIC
MUX allocated buffer on emission. Remove its special value '0' which was
used to automatically adjust it on now unused conn-tx-buffers.
2024-08-20 17:59:35 +02:00
Amaury Denoyelle
aeb8c1ddc3 MAJOR: mux-quic: allocate Tx buffers based on congestion window
Each QUIC MUX may allocate buffers for MUX stream emission. These
buffers are then shared with quic_conn to handle ACK reception and
retransmission. A limit on the number of concurrent buffers used per
connection has been defined statically and can be updated via a
configuration option. This commit replaces the limit to instead use the
current underlying congestion window size.

The purpose of this change is to remove the artificial static buffer
count limit, which may be difficult to choose. Indeed, if a connection
performs with minimal loss rate, the buffer count would limit severely
its throughput. It could be increase to fix this, but it also impacts
others connections, even with less optimal performance, causing too many
extra data buffering on the MUX layer. By using the dynamic congestion
window size, haproxy ensures that MUX buffering corresponds roughly to
the network conditions.

Using QCC <buf_in_flight>, a new buffer can be allocated if it is less
than the current window size. If not, QCS emission is interrupted and
haproxy stream layer will subscribe until a new buffer is ready.

One of the criticals parts is to ensure that MUX layer previously
blocked on buffer allocation is properly woken up when sending can be
retried. This occurs on two occasions :

* after an already used Tx buffer is cleared on ACK reception. This case
  is already handled by qcc_notify_buf() via quic_stream layer.

* on congestion window increase. A new qcc_notify_buf() invokation is
  added into qc_notify_send().

Finally, remove <avail_bufs> QCC field which is now unused.

This commit is labelled MAJOR as it may have unexpected effect and could
cause significant behavior change. For example, in previous
implementation QUIC MUX would be able to buffer more data even if the
congestion window is small. With this patch, data cannot be transferred
from the stream layer which may cause more streams to be shut down on
client timeout. Another effect may be more CPU consumption as the
connection limit would be hit more often, causing more streams to be
interrupted and woken up in cycle.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
000976af58 MINOR: mux-quic: define buf_in_flight
Define a new QCC counter named <buf_in_flight>. Its purpose is to
account the current sum of all allocated stream buffer size used on
emission.

For this moment, this counter is updated and buffer allocation and
deallocation. It will be used to replace <avail_bufs> once congestion
window is used as limit for buffer allocation in a future commit.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
f9777bea30 MINOR: h3: mark control stream as metadata
A current work is performed to change QUIC MUX buffer allocation limit
from a configurable static value to use the size of the congestion
window instead. This change may cause the buffer allocation limit to be
triggered more frequently.

To ensure HTTP/3 control emission is not perturbed by this change, mark
the stream with qcc_send_metadata(). This ensures that buffer allocation
for this stream won't be subject to the connection limit. This is
necessary to guarantee that SETTINGS and GOAWAY frames are emitted.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
4c4bf26f44 MEDIUM: mux-quic: implement API to ignore txbuf limit for some streams
Define a new qc_stream_desc flag QC_SD_FL_OOB_BUF. This is to mark
streams which are not subject to the connection limit on allocated MUX
stream buffer.

The purpose is to simplify handling of QUIC MUX streams which do not
transfer data and as such are not driven by haproxy layer, for example
HTTP/3 control stream. These streams interacts synchronously with QUIC
MUX and cannot retry emission in case of temporary failure.

This commit will be useful once connection buffer allocation limit is
reimplemented to directly rely on the congestion window size. This will
probably cause the buffer limit to be reached more frequently, maybe
even on QUIC MUX initialization. As such, it will be possible to mark
control streams and prevent them to be subject to the buffer limit.

QUIC MUX expose a new function qcs_send_metadata(). It can be used by an
application protocol to specify which streams are used for control
exchanges. For the moment, no such stream use this mechanism.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
f4d1bd0b76 MINOR: mux-quic: account stream txbuf in QCC
A limit per connection is put on the number of buffers allocated by QUIC
MUX for emission accross all its streams. This ensures memory
consumption remains under control. This limit is simply explained as a
count of buffers which can be concurrently allocated for each
connection.

As such, quic_conn structure was used to account currently allocated
buffers. However, a quic_conn nevers allocates new stream buffers. This
is only done at QUIC MUX layer. As such, this commit moves buffer
accounting inside QCC structure. This simplifies the API, most notably
qc_stream_buf_alloc() usage.

Note that this commit inverts the accounting. Previously, it was
initially set to 0 and increment for each allocated buffer. Now, it is
set to the maximum value and decrement for each buf usage. This is
considered as clearer to use.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
635fbaaa4a MINOR: quic: allocate stream txbuf via qc_stream_desc API
This commit simply adjusts QUIC stream buffer allocation. This operation
is conducted by QUIC MUX using qc_stream_desc layer. Previously,
qc_stream_buf_alloc() would return a qc_stream_buf instance and QUIC MUX
would finalized the buffer area allocation. Change this to perform the
buffer allocation directly into qc_stream_buf_alloc().

This patch clarifies the interaction between QUIC MUX and
qc_stream_desc. It is cleaner to allocate the buffer via qc_stream_desc
as it is already responsible to free the buffer.

It also ensures that connection buffer accounting is only done after the
whole qc_stream_buf and its buffer are allocated. Previously, the
increment operation was performed between the two steps. This was not an
issue, as this kind of error triggers the whole connection closure.
However, if in the future this is handled as a stream closure instead,
this commit ensures that the buffer remains valid in all cases.
2024-08-20 17:17:17 +02:00
Amaury Denoyelle
c24c8667b2 MINOR: quic: define max-window-size config setting
Define a new global keyword tune.quic.frontend.max-window-size. This
allows to set globally the maximum congestion window size for each QUIC
frontend connections.

The default value is 0. It is a special value which automatically derive
the size from the configured QUIC connection buffer limit. This is
similar to the previous "quic-cc-algo" behavior, which can be used to
override the maximum window size per bind line.
2024-08-20 17:02:29 +02:00
Amaury Denoyelle
280b61468a MINOR: quic: extract config window-size parsing
quic-cc-algo is a bind line keyword which allow to select a QUIC
congestion algorithm. It can take an optional integer to specify the
maximum window size. This value is an integer and support the suffixes
'k', 'm' and 'g' to specify respectively kilobytes, megabytes and
gigabytes.

Extract the maximum window size parsing in a dedicated function named
parse_window_size(). It accepts as input an integer value with an
optional suffix, 'k', 'm' or 'g'. The first invalid character is
returned by the function to the caller.

No functional change. This commit will allow to quickly implement a new
keyword to configure a default congestion window size in the global
section.
2024-08-20 16:07:22 +02:00
Nicolas CARPi
bba679026c BUG/MINOR: stats: add lang attribute to html tag
The "html" element of the stats page was missing a "lang" attribute.
This change specifies the "en" value, which corresponds to english
language.

It is also a required element for WCAG Success Criterion 3.1.1, which
renders the web more accessible through a set of requirements. In this
case it allows assistive technologies such as screen readers to
determine the language of the page.

MDN page: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/lang
HTML standard: https://html.spec.whatwg.org/multipage/dom.html#attr-lang
WCAG criterion: https://www.w3.org/WAI/WCAG22/Understanding/language-of-page.html
2024-08-20 15:55:45 +02:00
Nicolas CARPi
9318a624a1 CLEANUP: stats: use modern DOCTYPE tag
Switching the stats page doctype to the modern standard is shorter and
less complex, and is the recommended doctype by current HTML standard.
It makes it clear that we do not want to run in quirks mode. More information below.

Quirks mode: https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode
HTML Standard: https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
2024-08-20 15:55:31 +02:00
Nicolas CARPi
c63d558e41 BUG/MINOR: stats: fix color of input elements in dark mode
Previously the text color was dark, with a dark background, this makes it
white, and thus readable. This is visible on the "Scope" input field.
2024-08-20 15:55:14 +02:00
Valentine Krasnobaeva
8b1dfa9def MINOR: cfgparse: limit file size loaded via /dev/stdin
load_cfg_in_mem() can continuously reallocate memory in order to load an
extremely large input from /dev/stdin, until it fails with ENOMEM, which means
that process has consumed all available RAM. In case of containers and
virtualized environments it's not very good.

So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will
limit the size of input supplied via /dev/stdin.
2024-08-20 14:28:34 +02:00
Nathan Wehrman
fd48b28315 MINOR: Implements new log format of option tcplog clf
Some systems require log formats in the CLF format and that meant that I
could not send my logs for proxies in mode tcp to those servers.  This
implements a format that uses log variables that are compatble with TCP
mode frontends and replaces traditional HTTP values in the CLF format
to make them stand out. Instead of logging method and URI like this
"GET /example HTTP/1.1" it will log "TCP " and for a response code I
used "000" so it would be easy to separate from legitimate HTTP
traffic. Now your log servers that require a CLF format can see the
timings for TCP traffic as well as HTTP.
2024-08-20 07:46:34 +02:00
Aurelien DARRAGON
f8299bc5ea MINOR: log: "drop" support for log-profile steps
It is now possible to use "drop" keyword for "on" lines under a
log-profile section to specify that no log at all should be emitted for
the specified step (setting an empty format was not sufficient to do so
because only the log payload would be empty, not the log header, thus the
log would still be emitted).

It may be useful to selectively disable logging at specific steps for a
given log target (since the log profile may be set on log directives):

log-profile myprof
  on request format "blabla" sd "custom sd"
  on response drop

New testcase was added to reg-tests/log/log_profiles.vtc
2024-08-19 18:53:01 +02:00
Aurelien DARRAGON
41ca89bc6f MEDIUM: log: relax some checks and emit diag warnings instead in lf_expr_postcheck()
With 7a21c3a ("MAJOR: log: implement proper postparsing for logformat
expressions") which finally made postparsing checks reliable, we started
to get report from users that couldn't start haproxy 3.0 with configs that
used to work in the past. The current situation is described in GH #2642.

While the checks are mostly relevant, it turns out there are not strictly
needed anymore from a technical point of view. Most of them were useful in
early logformat implementation to prevent runtime bugs due to the use of
an alias or fetch at runtime from an incompatible proxy. It's been a few
versions already that the code handling fetches and log aliases is robust
enough to support fetches/aliases used from the wrong context: all it
does is that the fetch/alias will silently fail if it's not available.

This can be proved by the fact that even if the postparsing checks were
partially broken in the past, it didn't cause runtime issues (at least
on recent haproxy versions).

Most of these checks can now be seen as configuration hints: when a check
triggers, it will indicate a configuration inconsistency in most cases,
but they are some corner cases where it is not possible to know at config
time if the conditions will be met for the alias/fetch to work properly..
so instead of failing with a hard error like we did so far, let's just be
more permissive and report our findings using "diag_warning": such
warnings are only emitted when haproxy is started with '-dD' cli option.

We also took this opportunity to improve messages clarity and make them
more precise (report the offending item instead of complaining about the
whole expression because of a single element).

With this patch, configs that used to start before 7a21c3a shouldn't
trigger hard errors anymore.

This may be backported in 3.0.
2024-08-16 14:25:10 +02:00
Valentine Krasnobaeva
911f4d93d4 BUG/MINOR: pattern: pat_ref_set: return 0 if err was found
pat_ref_set_elt() returns 0, if we are run out of memory or can't parse a new
map value. Any arror message emitted by pat_ref_set_elt() is saved in err
buffer, if its provided by caller. These error messages are cumulated during
the loop.

pat_ref_set() is used to update values in map, referred to the same given key.
If during the update pat_ref_set_elt() fails, let's retun 0 to caller
immediately. We have the same non-unique key and the same new value in each
loop. So it seems quite odd to cumulate the same error messages and print it in
CLI:

        > add map @1 mytest.map <<
        + 1.0.1.11 TestA
        + 1.0.1.11 TESTA
        + 1.0.1.11 test_a
        +

        > set map mytest.map 1.0.1.11 15
         unable to parse '15' unable to parse '15' unable to parse '15'.

cli_parse_set_map(), which calls pat_ref_set() to update map, will return only
one error message with this patch:

> set map mytest.map 1.0.1.11 15
 unable to parse '15'.

hlua_set_map() and http_action_set_map() don't provide error buffer and will
just exit on the first error.

This should be backported in all stable versions.
2024-08-13 16:13:43 +02:00
Valentine Krasnobaeva
4f2493f355 BUG/MINOR: pattern: pat_ref_set: fix UAF reported by coverity
memprintf() performs realloc and updates then the pointer to an output buffer,
where it has written the data. So free() is called on the previous buffer
address, if it was provided.

pat_ref_set_elt() uses memprintf() to write its error message as well as
pat_ref_set(). So, when we re-enter into the while loop the second time and
pat_ref_set_elt() has returned, the *err ptr (previous value of *merr) is
already freed by memprintf() from pat_ref_set_el().

'if (!found)' condition is false at this point, because we've found a node at
the first loop. So, the second memprintf(), in order to write error messages,
does again free(*err).

This should be backported in all stable versions.
2024-08-13 16:13:41 +02:00
Willy Tarreau
0982bfd999 BUG/MINOR: tools: make fgets_from_mem() stop at the end of the input
The memchr() used to look for the LF character must consider the end of
input, not just the output buffer size.

This was found by oss-fuzz:
   https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=71096

No backport is needed.
2024-08-11 14:44:28 +02:00
William Lallemand
75944e266e CLEANUP: mworker/cli: clean up the mode handling
Cleanup the mode handling by refactoring the strings constant
that are written multiple times
2024-08-09 17:47:20 +02:00
Amaury Denoyelle
48514c118c BUG/MINOR: h3: properly reject too long header responses
When encoding HTX to HTTP/3 headers on the response path, a bunch of
ABORT_NOW() where used when buffer room was not enough. In most cases
this is safe as output buffer has just been allocated and so is empty at
the start of the function. However, with a header list longer than a
whole buffer, this would cause an unexpected crash.

Fix this by removing ABORT_NOW() statement with proper error return
path. For the moment, this would cause the whole connection to be close
rather than the stream only. This may be further improved in the future.

Also remove ABORT_NOW() when encoding frame length at the end of headers
or trailers encoding. Buffer room is sufficient as it was already
checked prior in the same function.

This should be backported up to 2.6. Special care should be handled
however as this code path has changed frequently :
* for 2.9 and older, the extra following statement must be inserted
  prior each newly added goto statement :
  h3c->err = H3_INTERNAL_ERROR;
* for 2.6, trailers support is not implemented. As such, related chunks
  should just be ignored when backporting.
2024-08-09 17:41:16 +02:00
Amaury Denoyelle
8939d8e473 MINOR: mux-quic: do not trace error in qcc_send_frames() on empty list
qcc_send_frames() can be called with an empty list and returns
immediately with an error code. This is convenience to be able to call
it in a while loop.

Remove the trace with "error" when this is the case and replacing it
with a less alarming "leaving on..." message. This should help debugging
when traces are active.
2024-08-09 17:41:16 +02:00
Valentine Krasnobaeva
9fc69ebc0a MINOR: proto_uxst: copy errno in errmsg for syscalls
Let's copy errno in error messages, which we emit in cases when listen() or
connect() fail. This is helpful for debugging.
2024-08-09 17:38:42 +02:00
Valentine Krasnobaeva
16e89f6b5c BUG/MINOR: cfgparse: parse_cfg: fix null ptr dereference reported by coverity
This commit fixes potential null ptr dereferences reported by coverity, see
more details about it in the issues #2676 and #2668.

'outline' ptr, which is initialized to NULL explicitly as a temporary buffer to
store split keywords may be in theory implicitly dereferenced in some corner
cases (which we haven't encountered yet with real world configurations) in
'if (!**args)'. parse_line() code, called before under some conditions
assigns: args[arg] = outline + outpos and outpos initial value is 0.
2024-08-09 15:43:29 +02:00
Valentine Krasnobaeva
eb82358690 BUG/MINOR: proto_uxst: delete fd from fdtab if listen() fails
This patch is done mostly as a safeguard in order not to trigger
BUG_ON(fdtab[fd].owner != NULL) check, if listen() will fail on UNIX domain
socket.

In uxst_bind_listener(), the pretty same logic of closing socket on error path
was kept, as it was in tcp_bind_listener() before. The use of fd_delete() was
not generalized, when the support of UNIX sock_stream protocol was implemented.
So, let's remove fd from fdtab on failure, instead of closing it. Otherwise,
uxst_bind_listener(), which could be called in loop for each receiver, will
obtain the same fd via socket() for the next receiver. Then, it will bind it
again and it will try to re-insert it in fdtab.

This can be backported to all stable versions.
2024-08-09 15:23:28 +02:00
Amaury Denoyelle
f3c75a52df BUG/MINOR: mux-quic: do not send too big MAX_STREAMS ID
QUIC stream IDs are expressed as QUIC variable integer which cover the
range for 0 to 2^62 - 1. As such, it is forbidden to send an ID for
MAX_STREAMS flow-control frame which would allow to overcome this value.

This patch fixes MAX_STREAMS emission to ensure sent value is valid.
This also ensures that the peer cannot open a stream with an invalid ID
as this would cause a flow-control violation instead.

This must be backported up to 2.6.
2024-08-09 14:33:49 +02:00
Valentine Krasnobaeva
aae2ff7691 MINOR: startup: fix unused value reported by coverity
Unused 0 is assigned to ret, as it's rewritten by error code of read_cfg().
This issue was reported by coverity.
2024-08-08 19:54:12 +02:00
Valentine Krasnobaeva
da82f08055 MINOR: cfgparse: load_cfg_in_mem: fix null ptr dereference reported by coverity
This helps to optimize a bit load_cfg_in_mem() and fixes the potential null ptr
dereference in fread() call. If (read_bytes + bytes_to_read) equals to initial
chunk_size (zero), realloc is never called, *cfg_content keeps its NULL value.

So, let's assure that initial number of bytes to read
(read_bytes + bytes_to_read) is stricly positive, when we enter into loop at
the first time.
2024-08-08 19:54:12 +02:00
William Lallemand
b75edf2f11 BUG/MEDIUM: mworker/cli: fix pipelined modes on master CLI
Since commit 3d93ecc ("BUG/MAJOR: cli: Restore non-interactive mode
behavior with pipelined commands") and commit 598c7f16 ("BUG/MEDIUM:
cli: Warn if pipelined commands are delimited by a \n"), the pipelined
command on the master CLI are either broken or emit warnings depending
on which version.

The reason is that mode applied on the master CLI are saved on the in
the current CLI session, and then reinserted for each pipelined command,
however, these commande were inserted as new lines.

For example:

 "@1; expert-mode on; debug dev log foo; debug dev log bar"

 Would be sent as:

  "expert mode on\ndebug dev log foo"
  "expert mode on\ndebug dev log bar"

This patch fixes the issue by using the new ci_insert() function which
inserts a string instead of a newline, and the command are now suffixed
by ';' upon insertion allowing a correct pipelined command chain.

This must be backported with the previous commit introducing ci_insert()
in every stable version.

This is broken since the 3.0 version, but it emits a warning in every
version below, because 598c7f164 was backported.
2024-08-08 17:29:37 +02:00
William Lallemand
b2a8e8731d MINOR: channel: implement ci_insert() function
ci_insert() is a function which allows to insert a string <str> of size
<len> at <pos> of the input buffer. This is the equivalent of
ci_insert_line2() but without inserting '\r\n'
2024-08-08 17:29:37 +02:00
Valentine Krasnobaeva
46181e730a MINOR: proto_tcp: tcp_bind_listener: copy errno in errmsg
Let's copy errno in errmsg produced by tcp_bind_listener if it fails in
a syscall(). This is helpful to debug issues, while binding listeners.
2024-08-08 16:34:13 +02:00
Valentine Krasnobaeva
81f48395b3 BUG/MINOR: proto_tcp: keep error msg if listen() fails
If listen() fails, we need to keep the message about it, which is copied then
in errmsg buffer on the error path. This buffer is properly provided by the
caller (protocol_bind_all()) and reallocated if needed in memprintf(), but
it was deleted without being returned.

This can be backported to all stable versions.
2024-08-08 16:34:06 +02:00
Valentine Krasnobaeva
308c6881c0 BUG/MINOR: proto_tcp: delete fd from fdtab if listen() fails
If listen() fails, fd should be deleted from fdtab, not just closed. Otherwise,
sock_inet_bind_receiver(), which is called in loop for each receiver, will
obtain the same fd via socket() for the next receiver, registered in the
receivers list. Then, it will bind it again and it will try to re-insert it in
fdtab, and fd_insert() will trigger the BUG_ON(fdtab[fd].owner != NULL) check.

When tcp_bind_listener() code was implemented, the use of fd_delete() was
not generalized and this one remained overlooked.

This can be backported to all stable versions.
2024-08-08 16:33:53 +02:00
Valentine Krasnobaeva
c6cfa7cb4a MINOR: startup: rename readcfgfile in parse_cfg
As readcfgfile no longer opens configuration files and reads them with fgets,
but performs only the parsing of provided data, let's rename it to parse_cfg by
analogy with read_cfg in haproxy.c.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
5b52df4c4d MEDIUM: startup: load and parse configs from memory
Let's call load_cfg_in_ram() helper for each configuration file to load it's
content in some area in memory. Adapt readcfgfile() parser function
respectively. In order to limit changes in its scope we give as an argument a
cfgfile structure, already filled in init_args() and in load_cfg_in_ram() with
file metadata and content.

Parser function (readcfgfile()) uses now fgets_from_mem() instead of standard
fgets from libc implementations.

SPOE filter parses its own configuration file, pointed by 'config' keyword in
the configuration already loaded in memory. So, let's allocate and fill for
this a supplementary cfgfile structure, which is not referenced in cfg_cfgfiles
list. This structure and the memory with content of SPOE filter configuration
are freed immediately in parse_spoe_flt(), when readcfgfile() returns.

HAProxy OpenTracing filter also uses its own configuration file. So, let's
follow the same logic as we do for SPOE filter.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
2bb34edb0b MEDIUM: startup: make read_cfg() return immediately on ENOMEM
This commit prepares read_cfg() to call load_cfg_in_mem() helper in order to
load configuration files in memory. Before, read_cfg() calls the parser for all
files from cfg_cfgfiles list and cumulates parser's errors and memprintf's
errors in for_each loop. memprintf's errors did not stop this loop and were
accounted just after.

Now, as we plan to load configuration files in memory, we stop the loop, if
memprintf() fails, and we show appropraite error message with ha_alert. Then
process terminates. So not all cumulated syntax-related errors will be shown
before exit in this case and we has to stop, because we run out of memory.

If we can't open the current file or we fail to allocate a memory to store
some configuration line, the previous behaviour is kept, process emits
appropriate alert message and exits.

If parser returns some syntax-related error on the current file, the previous
behaviour is kept as well. We cumulate such errors for all parsed files and we
check them just after the loop. All syntax-related errors for all files is
shown then  as before in ha_alert messages line by line during the startup.
Then process will exit with 1.

As now cfg_cfgfiles list contains many pointers to some memory areas with
configuration files content and this content could be big, it's better to
free the list explicitly, when parsing was finished. So, let's change
read_cfg() to return some integer value to its caller init(), and let's perform
the free  routine at a caller level, as cfg_cfgfiles list was initialized and
initially filled at this level.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
007f7f2f02 MINOR: tools: add fgets_from_mem
Add fgets_from_mem() helper to read lines from configuration files, stored now
as memory chunks. In order to limit changes in the first-level parser code
(readcfgfile()), it is better to reimplement the standard fgets, i.e. to
have a fgets, which can read the serialized data line by line from some memory
area, instead of file stream, and can keep the same behaviour as libc
implementations fgets.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
03e63b98ca MINOR: cfgparse: load_cfg_in_mem: take in account file size
Let's take in account the given file size, when its reported via stat.

It's very convenient for large configuration files, as this allows to
perform only the one memory allocation call for precisely needeed file size.
This also allows to perform only the one call to fread().

We need to provide to fread() file_stat.st_size + 1 to be able to grab EOF.
Like this it sets feof(f)=1 flag and this allows to exit from the loop
immediately, just after fread call.

If /dev/stdin or /dev/null is provided as a file, we continue to read the
configuration chunk by chunk, stat doesn't report the size.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
5b9ed6e4be MINOR: cfgparse: add load_cfg_in_mem
Add load_cfg_in_mem() helper, which allows to store the content of a given file
in memory.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
bafb0ce272 MINOR: startup: adapt list_append_word to use cfgfile
list_append_word() helper was used before only to chain configuration file names
in a list. As now we start to use cfgfile structure which represents entire file
in memory and its metadata, let's adapt this helper to use this structure and
let's rename it to list_append_cfgfile().

Adapt functions, which process configuration files and directories to use
cfgfile structure and list_append_cfgfile() instead of wordlist.
2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva
39f2a19620 REORG: tools: move list_append_word to cfgparse
Let's move list_append_word to cfgparse.c as it is used only to fill
cfg_cfgfiles list with configuration file names.
2024-08-07 18:41:41 +02:00
Aurelien DARRAGON
a6d1eb8f5d MINOR: server: ensure max_events_at_once > 0 in server_atomic_sync()
In 8f1fd96 ("BUG/MEDIUM: server/addr: fix tune.events.max-events-at-once
event miss and leak"), we added a comment saying that
tune.events.max-events-at-once is assumed to be strictly positive.

It is so because the keyword parser forces values between 1 and 10000:
we don't want less than 1 because it wouldn't make any sense, and 10k
max because beyond that we could create contention in server_atomic_sync()

Now as the above commit implements a do..while it heavily relies on the
fact that the budget is at least 1. Upon soft-stop, we break away from
the loop without decrementing the budget. With all that in mind, it is
safe to assume that the 'remain' counter will only fall to 0 if the task
runs out of budget while doing work, in which case the task still exists
and must be rescheduled.

As seen in GH #2667 this assumption was ambiguous, so let's make it
official by adding a pair of BUG_ON() that make it explicit that it
works because remain 'cannot' be 0 unless the entire budget was
consumed.

No backport needed.
2024-08-07 18:31:35 +02:00
Amaury Denoyelle
3ef1ee477d BUG/MINOR: quic: prevent freeze after early QCS closure
A connection freeze may occur if a QCS is released before transmitting
any data. This can happen when an error is detected early by the stream,
for example during HTTP response headers encoding, forcing the whole
connection closure.

In this case, a connection error is registered by the QUIC MUX to the
lower layer. MUX is then release and xprt layer is notified to prepare
CONNECTION_CLOSE emission. However, this is prevented because quic_conn
streams tree is not empty as it contains the qc_stream_desc previously
attached to the failed QCS instance. The connection will freeze until
QUIC idle timeout.

This situation is caused by an omission during qc_stream_desc release
operation. In the described situation, qc_stream_desc current buffer is
empty and can thus by removed, which is the purpose of this patch. This
unblocks this previously failed situation, with qc_stream_desc removal
from quic_conn tree.

This issue can be reproduced by modifying H3/QPACK code to return an
early error during HEADERS response processing.

This must be backported up to 2.6, after a period of observation.
2024-08-07 18:14:29 +02:00
Willy Tarreau
d5da87b5dc MINOR: mux-h3/trace: add a state trace on stream creation/destruction
Logging below the developer level doesn't always yield very convenient
traces as we don't know well where streams are allocated nor released.
Let's just make that more explicit by using state-level traces for these
important steps.
2024-08-07 16:02:59 +02:00
Willy Tarreau
23417ab9d4 MINOR: mux-h2/trace: add a state trace on stream creation/destruction
Logging below the developer level doesn't always yield very convenient
traces as we don't know well where streams are allocated nor released.
Let's just make that more explicit by using state-level traces for these
important steps.
2024-08-07 16:02:59 +02:00
Willy Tarreau
cc12d1b253 MINOR: mux-h1/trace: add a state trace on stream creation/upgrade
Logging below the developer level doesn't always yield very convenient
traces as we don't know well where streams are allocated nor released.
Let's just make that more explicit by using state-level traces. Note that
h1s destruction was already logged as closing connection or switching
to idle mode.
2024-08-07 16:02:59 +02:00
Willy Tarreau
6191de6aa6 MINOR: mux-quic: add a trace context filling helper
This helper is able to find a connection, a session, a stream, or a
frontend from its args.
2024-08-07 16:02:59 +02:00
Willy Tarreau
b2cede590b MINOR: mux-quic: don't leave dangling pointer after freeing qcs->sd
In qcs_free() we're calling a few other functions after releasing
qcs->sd. None of them make use of it for now but with traces that
will change. Make sure to clear qcs->sd after releasing it.
2024-08-07 16:02:59 +02:00
Willy Tarreau
adfe0a30e1 MINOR: mux-h1: add a trace context filling helper
This helper is able to find a connection, a session, a stream, a
frontend or a backend from its args.
2024-08-07 16:02:59 +02:00
Willy Tarreau
6c6ef5ae12 MINOR: mux-h2: add a trace context filling helper
This helper is able to find a connection, a session, a stream, a
frontend or a backend from its args.

Note that this required to always make sure that h2s->sess is reset on
allocation because it's normally initialized later for backend streams,
and producing traces between the two could pre-fill a bad pointer in
the trace_ctx.
2024-08-07 16:02:59 +02:00
Willy Tarreau
10c8baca44 MINOR: trace: add a per-source helper to pre-fill the context
Now sources which want to do it can provide a helper that can pre-fill
some fields in the context based on their knowledge (e.g. mux streams).
2024-08-07 16:02:59 +02:00
Willy Tarreau
7d55a70f5a MINOR: trace: move the known trace context into a dedicated struct
We now have a trace_ctx to hold the sess, conn, qc, stream and so on.
This will allow us to pass it across layers so that other helpers can
help fill them.

Ideally it should be passed as an argument to __trace_enabled() by
__trace() so that it can be passed back to the trace callback. But
it seems that trace callbacks are smart enough to figure all their
info when they need them.
2024-08-07 16:02:59 +02:00
Willy Tarreau
d465610ec3 MEDIUM: trace: implement a "follow" mechanism
With "follow" from one source to another, it becomes possible for a
source to automatically follow another source's tracked pointer. The
best example is the session:
  - the "session" source is enabled and has a "lockon session"
    -> its lockon_ptr is equal to the session when valid
  - other sources (h1,h2,h3 etc) are configured for "follow session"
    and will then automatically check if session's lockon_ptr matches
    its own session, in which case tracing will be enabled for that
    trace (no state change).

It's not necessary to start/pause/stop traces when using this, only
"follow" followed by a source with lockon enabled is needed. Some
combinations might work better than others. At the moment the session
is almost never known from the backend, but this may improve.

The meta-source "all" is supported for the follower so that all sources
will follow the tracked one.
2024-08-07 16:02:59 +02:00
Willy Tarreau
abb07af67e MINOR: session/trace: enable very minimal session tracing
By having traces at the session level, it becomes possible to start
traces on session creation and pause them on session end. Doing so
will soon open new possibilties to synchronize multiple traces.
2024-08-07 16:02:59 +02:00
Willy Tarreau
d2a49de9c7 MINOR: trace: support setting the sink and level for all sources at once
It's extremely painful to have to set "trace <src> sink buf1" for all
sources, then to do the same for "level developer" (for example). Let's
have a possibility via a meta-source "all" to apply the change to all
sources at once. This currently supports level and sink, which are not
dependent on the source, this is a good start.
2024-08-07 16:02:59 +02:00
Willy Tarreau
6bf50dfccc BUG/MINOR: quic/trace: make quic_conn_enc_level_init() emit NEW not CLOSE
The event emitted by this trace was of type CLOSE instead of NEW, which
would somtimes temporarily pause a started trace.

This can be backported to 3.0, probably 2.6.
2024-08-07 16:02:59 +02:00
Willy Tarreau
7a22fbd453 BUG/MINOR: trace/quic: make "qconn" selectable as a lockon criterion
The test was was performed but there's no way to set the option! Let's
just add "qconn" to select the quic conn when the source supports it.

This can be backported at least to 3.0, probably 2.6.
2024-08-07 16:02:59 +02:00
Willy Tarreau
0406efe9ad BUG/MINOR: trace: automatically start in waiting mode with "start <evt>"
The doc clearly says that "start <evt>" should leave the trace in pause
mode until the indicated event appears. However it's not what's happening,
the state is not changed until one command uses "now", so it's typically
needed to configure the events with "start <evt>" then enable the waiting
mode using "pause now". This is counter-intuitive and does not match the
doc, so let's fix it so that "start <evt>" switches from stopped to waiting
as long as at least one event is enabled.

This can be backported to all versions.
2024-08-07 16:02:59 +02:00
Willy Tarreau
b5df6b5a31 BUG/MEDIUM: trace: fix null deref in lockon mechanism since TRACE_ENABLED()
When calling TRACE_ENABLED(), which is called by TRACE_PRINTF(), we pass
a NULL plockptr to __trace_enabled(). This argument is used when lockon
is active, and may update the pointer. This is an overlook which also
broke the lockon mechanism because now for calls from __trace(), it
dereferences a pointer pointing to NULL, and never updates it due to the
broken condition, so that trace() never sets up src->lockon_ptr.

The bug was introduced in 2.8 by commit 8f9a9704bb ("MINOR: trace: add a
TRACE_ENABLED() macro to determine if a trace is active"), so the fix must
be backported there.
2024-08-07 16:02:59 +02:00
Willy Tarreau
88a752ca78 BUG/MINOR: trace/quic: permit to lock on frontend/connect/session etc
These ones were not proposed in the list of trackable elements. Note
that this depends on previous commit:

    BUG/MINOR: trace/quic: enable conn/session pointer recovery from quic_conn

This should be backported to at least 3.0, maybe even 2.6.
2024-08-07 16:02:59 +02:00
Willy Tarreau
aa1915a9f5 BUG/MINOR: trace/quic: enable conn/session pointer recovery from quic_conn
In __trace_enabled(), a quic_conn was detected, but it was not possible
to derive the connection nor the session from it, which was quite limiting
in terms of ability to track a same instance.

This should be backported to at least 3.0, maybe even 2.6.
2024-08-07 16:02:59 +02:00
Amaury Denoyelle
9f829ea3f3 MINOR: mux-quic: measure QCS lifetime and its blocking state
Reuse newly defined tot_time structure to measure various values related
to a QCS lifetime.

First, a timer is used to comptabilize the total QCS lifetime. Then, two
other timers are used to account the total time during which Tx from
stream layer to MUX is blocked, either on lack of buffer or due to
flow-control.

These three timers are reported in qmux_dump_qcs_info(). Thus, they are
available in traces and for QUIC MUX debug string sample.
2024-08-07 15:40:52 +02:00
Amaury Denoyelle
663416b4ef MINOR: quic: dump quic_conn debug string for logs
Define a new xprt_ops callback named dump_info. This can be used to
extend MUX debug string with infos from the lower layer.

Implement dump_info for QUIC stack. For now, only minimal info are
reported : bytes in flight and size of the sending window. This should
allow to detect if the congestion controller is fine. These info are
reported via QUIC MUX debug string sample.
2024-08-07 15:40:52 +02:00
Amaury Denoyelle
630fa53c51 MINOR: mux-quic: implement debug string for logs
Implement MUX_SCTL_DBG_STR for QUIC MUX. This returns info for the
current QCS and QCC instances, reusing qmux_dump_qc{c,s}_info functions
already used for traces, and the connection flags.

This stream operation is useful for debug string sample support.
2024-08-07 15:40:52 +02:00
Amaury Denoyelle
eb4dfa3b36 MINOR: mux-quic: define dump functions for QCC and QCS
Extract trace code to dump QCC and QCS instances into dedicated
functions named qmux_dump_qc{c,s}_info(). This will allow to easily
print QCC/QCS infos outside of traces.
2024-08-07 15:40:52 +02:00
Willy Tarreau
490cb16d3a MINOR: mux-h2: implement the debug string for logs
Now it permits to have this for a front and a back:

<134>Jul 30 19:32:53 haproxy[24405]: 127.0.0.1:64860 [30/Jul/2024:19:32:53.732] test2 test2/s1 0/0/0/0/0 200 130 - - ---- 2/1/0/0/0 0/0 "GET /blah HTTP/2.0"  h2s.id=1 .st=CLO .flg=0x7003 .rxbuf=0@(nil)+0/0 .sc=0x1e03fb0(.flg=0x00034482 .app=0x1e04020) .sd=0x1e03f30(.flg=0x50405601) .subs=(nil) h2c.st0=FRH .err=0 .maxid=1 .lastid=-1 .flg=0x100e00 .nbst=0 .nbsc=1, .glitches=0 .fctl_cnt=0 .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=1 .dsi=1 .dbuf=0@(nil)+0/0 .mbuf=[1..1|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=(nil) conn.flg=0x80000300
<134>Jul 30 19:32:53 haproxy[24405]: 127.0.0.1:65246 [30/Jul/2024:19:32:53.732] test1 test1/s1 0/0/0/0/0 200 130 - - ---- 2/1/0/0/0 0/0 "GET /blah HTTP/1.1"  h2s.id=1 .st=CLO .flg=0x7003 .rxbuf=0@(nil)+0/0 .sc=0x1dfc7b0(.flg=0x0006d01b .app=0x1c65fe0) .sd=0x1dfc820(.flg=0x1040ca01) .subs=(nil) h2c.st0=FRH .err=0 .maxid=1 .lastid=-1 .flg=0x108e00 .nbst=0 .nbsc=1, .glitches=0 .fctl_cnt=0 .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=1 .dsi=1 .dbuf=0@(nil)+0/0 .mbuf=[1..1|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=(nil) conn.flg=0x000300

Just with this in the front and back proxies respectively:
  log-format "$HAPROXY_HTTP_LOG_FMT %[bs.debug_str(15)]"
  log-format "$HAPROXY_HTTP_LOG_FMT %[fs.debug_str(15)]"

For now the mux only implements muxs, muxc, conn. Xprt is ignored.
2024-08-07 14:07:41 +02:00
Willy Tarreau
921e04bf87 MINOR: stconn: add a new pair of sf functions {bs,fs}.debug_str
These are passed to the underlying mux to retrieve debug information
at the mux level (stream/connection) as a string that's meant to be
added to logs.

The API is quite complex just because we can't pass any info to the
bottom function. So we construct a union and pass the argument as an
int, and expect the callee to fill that with its buffer in return.

Most likely the mux->ctl and ->sctl API should be reworked before
the release to simplify this.

The functions take an optional argument that is a bit mask of the
layers to dump:
  muxs=1
  muxc=2
  xprt=4
  conn=8
  sock=16

The default (0) logs everything available.
2024-08-07 14:07:41 +02:00
Amaury Denoyelle
b2282082dd MINOR: quic: enforce ACK reception is handled in order
Add a new BUG_ON() in qc-stream_desc_ack(). It ensures that
acknowledgement are always notify in-order. This is because out-of-order
ACKs cannot be handled by qc_stream_desc layer which does not support
gap in STREAM sent data.

Prior to this fix, out-of-order ACKs are simply ignored without any
error. This currently cannot happen thanks to careful
qc_stream_desc_ack() invokation. If this assumption is broken in the
future by inatteion, this would cause loss of ACK notification which
will prevent qc_stream_desc release.
2024-08-07 11:08:20 +02:00
Amaury Denoyelle
e177cf341c BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM
STREAM frames have dedicated handling on retransmission. A special check
is done to remove data already acked in case of duplicated frames, thus
only unacked data are retransmitted.

This handling is faulty in case of an empty STREAM frame with FIN set.
On retransmission, this frame does not cover any unacked range as it is
empty and is thus discarded. This may cause the transfer to freeze with
the client waiting indefinitely for the FIN notification.

To handle retransmission of empty FIN STREAM frame, qc_stream_desc layer
have been extended. A new flag QC_SD_FL_WAIT_FOR_FIN is set by MUX QUIC
when FIN has been transmitted. If set, it prevents qc_stream_desc to be
freed until FIN is acknowledged. On retransmission side,
qc_stream_frm_is_acked() has been updated. It now reports false if
FIN bit is set on the frame and qc_stream_desc has QC_SD_FL_WAIT_FOR_FIN
set.

This must be backported up to 2.6. However, this modifies heavily
critical section for ACK handling and retransmission. As such, it must
be backported only after a period of observation.

This issue can be reproduced by using the following socat command as
server to add delay between the response and connection closure :
  $ socat TCP-LISTEN:<port>,fork,reuseaddr,crlf SYSTEM:'echo "HTTP/1.1 200 OK"; echo ""; sleep 1;'

On the client side, ngtcp2 can be used to simulate packet drop. Without
this patch, connection will be interrupted on QUIC idle timeout or
haproxy client timeout with ERR_DRAINING on ngtcp2 :
  $ ngtcp2-client --exit-on-all-streams-close -r 0.3 <host> <port> "http://<host>:<port>/?s=32o"

Alternatively to ngtcp2 random loss, an extra haproxy patch can also be
used to force skipping the emission of the empty STREAM frame :

diff --git a/include/haproxy/quic_tx-t.h b/include/haproxy/quic_tx-t.h
index efbdfe687..1ff899acd 100644
--- a/include/haproxy/quic_tx-t.h
+++ b/include/haproxy/quic_tx-t.h
@@ -26,6 +26,8 @@ extern struct pool_head *pool_head_quic_cc_buf;
 /* Flag a sent packet as being probing with old data */
 #define QUIC_FL_TX_PACKET_PROBE_WITH_OLD_DATA (1UL << 5)

+#define QUIC_FL_TX_PACKET_SKIP_SENDTO (1UL << 6)
+
 /* Structure to store enough information about TX QUIC packets. */
 struct quic_tx_packet {
 	/* List entry point. */
diff --git a/src/quic_tx.c b/src/quic_tx.c
index 2f199ac3c..2702fc9b9 100644
--- a/src/quic_tx.c
+++ b/src/quic_tx.c
@@ -318,7 +318,7 @@ static int qc_send_ppkts(struct buffer *buf, struct ssl_sock_ctx *ctx)
 		tmpbuf.size = tmpbuf.data = dglen;

 		TRACE_PROTO("TX dgram", QUIC_EV_CONN_SPPKTS, qc);
-		if (!skip_sendto) {
+		if (!skip_sendto && !(first_pkt->flags & QUIC_FL_TX_PACKET_SKIP_SENDTO)) {
 			int ret = qc_snd_buf(qc, &tmpbuf, tmpbuf.data, 0, gso);
 			if (ret < 0) {
 				if (gso && ret == -EIO) {
@@ -354,6 +354,7 @@ static int qc_send_ppkts(struct buffer *buf, struct ssl_sock_ctx *ctx)
 					qc->cntrs.sent_bytes_gso += ret;
 			}
 		}
+		first_pkt->flags &= ~QUIC_FL_TX_PACKET_SKIP_SENDTO;

 		b_del(buf, dglen + QUIC_DGRAM_HEADLEN);
 		qc->bytes.tx += tmpbuf.data;
@@ -2066,6 +2067,17 @@ static int qc_do_build_pkt(unsigned char *pos, const unsigned char *end,
 				continue;
 			}

+			switch (cf->type) {
+			case QUIC_FT_STREAM_8 ... QUIC_FT_STREAM_F:
+				if (!cf->stream.len && (qc->flags & QUIC_FL_CONN_TX_MUX_CONTEXT)) {
+					TRACE_USER("artificially drop packet with empty STREAM frame", QUIC_EV_CONN_TXPKT, qc);
+					pkt->flags |= QUIC_FL_TX_PACKET_SKIP_SENDTO;
+				}
+				break;
+			default:
+				break;
+			}
+
 			quic_tx_packet_refinc(pkt);
 			cf->pkt = pkt;
 		}
2024-08-07 11:03:32 +02:00
Amaury Denoyelle
714009b7bc MINOR: quic: implement function to check if STREAM is fully acked
When a STREAM frame is retransmitted, a check is performed to remove
range of data already acked from it. This is useful when STREAM frames
are duplicated and splitted to cover different data ranges. The newly
retransmitted frame contains only unacked data.

This process is performed similarly in qc_dup_pkt_frms() and
qc_build_frms(). Refactor the code into a new function named
qc_stream_frm_is_acked(). It returns true if frame data are already
fully acked and retransmission can be avoided. If only a partial range
of data is acknowledged, frame content is updated to only cover the
unacked data.

This patch does not have any functional change. However, it simplifies
retransmission for STREAM frames. Also, it will be reused to fix
retransmission for empty STREAM frames with FIN set from the following
patch :
  BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM

As such, it must be backported prior to it.
2024-08-07 10:57:10 +02:00
Amaury Denoyelle
bb9ac256a1 MINOR: quic: convert qc_stream_desc release field to flags
qc_stream_desc had a field <release> used as a boolean. Convert it with
a new <flags> field and QC_SD_FL_RELEASE value as equivalent.

The purpose of this patch is to be able to extend qc_stream_desc by
adding newer flags values. This patch is required for the following
patch
  BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM

As such, it must be backported prior to it.
2024-08-06 18:00:17 +02:00
Aurelien DARRAGON
8f1fd96d17 BUG/MEDIUM: server/addr: fix tune.events.max-events-at-once event miss and leak
An issue has been introduced with cd99440 ("BUG/MAJOR: server/addr: fix
a race during server addr:svc_port updates").

Indeed, in the above commit we implemented the atomic_sync task which is
responsible for consuming pending server events to apply the changes
atomically. For now only server's addr updates are concerned.

To prevent the task from causing contention, a budget was assigned to it.
It can be controlled with the global tunable
'tune.events.max-events-at-once': the task may not process more than this
number of events at once.

However, a bug was introduced with this budget logic: each time the task
has to be interrupted because it runs out of budget, we reschedule the
task to finish where it left off, but the current event which was already
removed from the queue wasn't processed yet. This means that this pending
event (each tune.events.max-events-at-once) is effectively lost.

When the atomic_sync task deals with large number of concurrent events,
this bug has 2 known consequences: first a server's addr/port update
will be lost every 'tune.events.max-events-at-once'. This can of course
cause reliability issues because if the event is not republished
periodically, the server could stay in a stale state for indefinite amount
of time. This is the case when the DNS server flaps for instance: some
servers may not come back UP after the incident as described in GH #2666.

Another issue is that the lost event was not cleaned up, resulting in a
small memory leak. So in the end, it means that the bug is likely to
cause more and more degradation over time until haproxy is restarted.

As a workaround, 'tune.events.max-events-at-once' may be set to the
maximum number of events expected per batch. Note however that this value
cannot exceed 10 000, otherwise it could cause the watchdog to trigger due
to the task being busy for too long and preventing other threads from
making any progress. Setting higher values may not be optimal for common
workloads so it should only be used to mitigate the bug while waiting for
this fix.

Since tune.events.max-events-at-once defaults to 100, this bug only
affects configs that involve more than 100 servers whose addr:port
properties are likely to be updated at the same time (batched updates
from cli, lua, dns..)

To fix the bug, we move the budget check after the current event is fully
handled. For that we went from a basic 'while' to 'do..while' loop as we
assume from the config that 'tune.events.max-events-at-once' cannot be 0.
While at it, we reschedule the task once thread isolation ends (it was not
required to perform the reschedule while under isolation) to give the hand
back faster to waiting threads.

This patch should be backported up to 2.9 with cd99440. It should fix
GH #2666.
2024-08-06 16:41:37 +02:00
Ilia Shipitsin
aaaacaaf4b BUG/MINOR: fcgi-app: handle a possible strdup() failure
This defect was found by the coccinelle script "unchecked-strdup.cocci".
It can be backported to 2.2.
2024-08-06 08:21:49 +02:00
Frederic Lecaille
eb1a097a66 BUG/MINOR: quic: Too short datagram during packet building failures (aws-lc only)
This issue was reported by Ilya (@Chipitsine) when building haproxy against
aws-lc in GH #2663 where handshakeloss and handshakecorruption interop tests could
lead haproxy to crash after having built too short datagrams:

FATAL: bug condition "first_pkt->type == QUIC_PACKET_TYPE_INITIAL && (first_pkt->flags & (1UL << 0)) && length < 1200" matched at src/quic_tx.c:163
call trace(13):
| 0x55f4ee4dcc02 [ba d9 00 00 00 48 8d 35]: main-0x195bf2
| 0x55f4ee4e3112 [83 3d 2f 16 35 00 00 0f]: qc_send+0x11f3/0x1b5d
| 0x55f4ee4e9ab4 [85 c0 0f 85 00 f6 ff ff]: quic_conn_io_cb+0xab1/0xf1c
| 0x55f4ee6efa82 [48 c7 c0 f8 55 ff ff 64]: run_tasks_from_lists+0x173/0x9c2
| 0x55f4ee6f05d3 [8b 7d a0 29 c7 85 ff 0f]: process_runnable_tasks+0x302/0x6e6
| 0x55f4ee671bb7 [83 3d 86 72 44 00 01 0f]: run_poll_loop+0x6e/0x57b
| 0x55f4ee672367 [48 8b 1d 22 d4 1d 00 48]: main-0x48d
| 0x55f4ee6755e0 [b8 00 00 00 00 e8 08 61]: main+0x2dec/0x335d

This could happen after Handshake packet building failures which follow a successful
Initial packet into the same datagram. In this case, the datagram could be emitted
with a too short length (<1200 bytes).

To fix this, store the datagram only if the first packet is not an Initial packet
or if its length is big enough (>=1200 bytes).

Must be backported as far as 2.6.
2024-08-05 13:40:51 +02:00
Frederic Lecaille
e12620a8a9 BUG/MINOR: quic: Too shord datagram during O-RTT handshakes (aws-lc only)
By "aws-lc only", one means that this bug was first revealed by aws-lc stack.
This does not mean it will not appeared for new versions of other TLS stacks which
have never revealed this bug.

This bug was reported by Ilya (@chipitsine) in GH #2657 where some QUIC interop
tests (resumption, zerortt) could lead to crash with haproxy compiled against
aws-lc TLS stack. These crashed were triggered by this BUG_ON() which detects
that too short datagrams with at least one ack-eliciting Initial packet inside
could be built.

  <0>2024-07-31T15:13:42.562717+02:00 [01|quic|5|quic_tx.c:739] qc_prep_pkts():
  next encryption level : qc@0x61d000041080 idle_timer_task@0x60d000006b80 flags=0x6000058

  FATAL: bug condition "first_pkt->type == QUIC_PACKET_TYPE_INITIAL && (first_pkt->flags & (1UL << 0)) && length < 1200" matched at src/quic_tx.c:163
  call trace(12):
  | 0x563ea447bc02 [ba d9 00 00 00 48 8d 35]: main-0x1958ce
  | 0x563ea4482703 [e9 73 fe ff ff ba 03 00]: qc_send+0x17e4/0x1b5d
  | 0x563ea4488ab4 [85 c0 0f 85 00 f6 ff ff]: quic_conn_io_cb+0xab1/0xf1c
  | 0x563ea468e6f9 [48 c7 c0 f8 55 ff ff 64]: run_tasks_from_lists+0x173/0x9c2
  | 0x563ea468f24a [8b 7d a0 29 c7 85 ff 0f]: process_runnable_tasks+0x302/0x6e6
  | 0x563ea4610893 [83 3d aa 65 44 00 01 0f]: run_poll_loop+0x6e/0x57b
  | 0x563ea4611043 [48 8b 1d 46 c7 1d 00 48]: main-0x48d
  | 0x7f64d05fb609 [64 48 89 04 25 30 06 00]: libpthread:+0x8609
  | 0x7f64d0520353 [48 89 c7 b8 3c 00 00 00]: libc:clone+0x43/0x5e

That said everything was correctly done by qc_prep_ptks() to prevent such a case.
But this relied on the hypothesis that the list of encryption levels it used
was always built in the same order as follows for 0-RTT sessions:

    initial, early-data, handshake, application

But this order is determined but the order the TLS stack derives the secrets
for these encryption levels. For aws-lc, this order is not the same but
as follows:

    initial, handshake, application, early-data

During 0-RTT sessions, the server may have to build three ack-eliciting packets
(with CRYPTO data inside) to reply to the first client packet: initial, hanshake,
application. qc_prep_pkts() adds a PADDING frame to the last built packet
for the last encryption level in the list. But after application level encryption,
there is early-data encryption level. This prevented qc_prep_pkts() to build
a padded applicaiton level last packet to send a 1200-bytes datagram.

To fix this, always insert early-data encryption level after the initial
encryption level into the encryption levels list when initializing this encryption
level from quic_conn_enc_level_init().

Must be backported as far as 2.9.
2024-08-02 15:25:26 +02:00
Christopher Faulet
78b8b60030 BUG/MEDIUM: peer: Notify the applet won't consume data when it waits for sync
When the peer applet is waiting for a synchronisation with the global sync
task, we must notify it won't consume data. Otherwise, if some data are
already waiting in the input buffer, the applet will be woken up in loop and
this wil trigger the watchdog. Once synchronized, the applet is woken up. In
that case, the peer applet must indicate it is going to consume data again.

This patch should fix the issue #2656. It must be backported to 3.0.
2024-08-02 08:42:29 +02:00
Christopher Faulet
184f16ded7 BUG/MEDIUM: mux-h2: Propagate term flags to SE on error in h2s_wake_one_stream
When a stream is explicitly woken up by the H2 conneciton, if an error
condition is detected, the corresponding error flag is set on the SE. So
SE_FL_ERROR or SE_FL_ERR_PENDING, depending if the end of stream was
reported or not.

However, there is no attempt to propagate other termination flags. We must
be sure to properly set SE_FL_EOI and SE_FL_EOS when appropriate to be able
to switch a pending error to a fatal error.

Because of this bug, the SE remains with a pending error and no end of
stream, preventing the applicative stream to trully abort it. It means on
some abort scenario, it is possible to block a stream infinitely.

This patch must be backported at least as far as 2.8. No bug was observed on
older versions while the same code is inuse.
2024-08-02 08:42:28 +02:00
Christopher Faulet
6743e128f3 BUG/MEDIUM: h2: Only report early HTX EOM for tunneled streams
For regular H2 messages, the HTX EOM flag is synonymous the end of input. So
SE_FL_EOI flag must also be set on the stream-endpoint descriptor. However,
there is an exception. For tunneled streams, the end of message is reported
on the HTX message just after the headers. But in that case, no end of input
is reported on the SE.

But here, there is a bug. The "early" EOM is also report on the HTX messages
when there is no payload (for instance a content-length set to 0). If there
is no ES flag on the H2 HEADERS frame, it is an unexpected case. Because for
the applicative stream and most probably for the opposite endpoint, the
message is considered as finihsed. It is switched in its DONE state (or the
equivalent on the endpoint). But, if an extra H2 frame with the ES flag is
received, a TRAILERS frame or an emtpy DATA frame, an extra EOT HTX block is
pushed to carry the HTX EOM flag. So an extra HTX block is emitted for a
regular HTX message. It is totally invalid, it must never happen.

Because it is an undefined behavior, it is difficult to predict the result.
But it definitly prevent the applicative stream to properly handle aborts
and errors because data remain blocked in the channel buffer. Indeed, the
end of the message was seen, so no more data are forwarded.

It seems to be an issue for 2.8 and upper. Harder to evaluate for older
versions.

This patch must be backported as far as 2.4.
2024-08-02 08:42:28 +02:00
Christopher Faulet
0ba6202796 BUG/MEDIUM: http-ana: Report error on write error waiting for the response
When we are waiting for the server response, if an error is pending on the
frontend side (a write error on client), it is handled as an abort and all
regular response analyzers are removed, except the one responsible to
release the filters, if any. However, while it is handled as an abort, the
error is not reported, as usual, via http_reply_and_close() function. It is
an issue because in that, the channels buffers are not reset.

Because of this bug, it is possible to block a stream infinitely. The
request side is waiting for the response side and the response side is
blocked because filters must be released and this cannot be done because
data remain blocked in channels buffers.

So, in that case, calling http_reply_and_close() with no message is enough
to unblock the stream.

This patch must be backported as far as 2.8.
2024-08-02 08:42:28 +02:00
Amaury Denoyelle
7a5a30d28a BUG/MINOR: h2: reject extended connect for h2c protocol
This commit prevents forwarding of an HTTP/2 Extended CONNECT when "h2c"
or "h2" token is set as targetted protocol. Contrary to the previous
commit which deals with HTTP/1 mux, this time the request is rejected
and a RESET_STREAM is reported to the client.

This must be backported up to 2.4 after a period of observation.
2024-08-01 18:23:44 +02:00
Amaury Denoyelle
7b89aa5b19 BUG/MINOR: h1: do not forward h2c upgrade header token
haproxy supports tunnel establishment through HTTP Upgrade mechanism.
Since the following commit, extended CONNECT is also supported for
HTTP/2 both on frontend and backend side.

  commit 9bf957335e
  MEDIUM: mux_h2: generate Extended CONNECT from htx upgrade

As specified by HTTP/2 rfc, "h2c" can be used by an HTTP/1.1 client to
request an upgrade to HTTP/2. In haproxy, this is not supported so it
silently ignores this. However, Connection and Upgrade headers are
forwarded as-is on the backend side.

If using HTTP/1 on the backend side and the server supports this upgrade
mechanism, haproxy won't be able to parse the HTTP response. If using
HTTP/2, mux backend tries to incorrectly convert the request to an
Extended CONNECT with h2c protocol, which may also prevent the response
to be transmitted.

To fix this, flag HTTP/1 request with "h2c" or "h2" token in an upgrade
header. On converting the header list to HTX, the upgrade header is
skipped if any of this token is present and the H1_MF_CONN_UPG flag is
removed.

This issue can easily be reproduced using curl --http2 argument to
connect to an HTTP/1 frontend.

This must be backported up to 2.4 after a period of observation.
2024-08-01 18:23:32 +02:00
Amaury Denoyelle
a7a2db4ad5 BUG/MIONR: quic: fix fc_lost
Control layer callback get_info has recently been implemented for QUIC.
However, fc_lost always returned 0. This is because quic_get_info() does
not use the correct input argument value to identify lost value.

This does not need to be backported.
2024-08-01 11:35:27 +02:00
Amaury Denoyelle
522c3bea2c BUG/MINOR: quic: fix fc_rtt/srtt values
QUIC has recently implement get_info callback to return RTT/sRTT values.
However, it uses milliseconds, contrary to TCP which uses microseconds.
This cause smp fetch functions to return invalid values. Fix this by
converting QUIC values to microseconds.

This does not need to be backported.
2024-08-01 11:35:27 +02:00
Frederic Lecaille
f7f76b8b0d MINOR: quic: Define ->get_info() control layer callback for QUIC
This low level callback may be called by several sample fetches for
frontend connections like "fc_rtt", "fc_rttvar" etc.
Define this callback for QUIC protocol as pointer to quic_get_info().
This latter supports these sample fetches:
   "fc_lost", "fc_reordering", "fc_rtt" and "fc_rttvar".

Update the documentation consequently.
2024-07-31 10:29:42 +02:00
Frederic Lecaille
1733dff42a MINOR: tcp_sample: Move TCP low level sample fetch function to control layer
Add ->get_info() new control layer callback definition to protocol struct to
retreive statiscal counters information at transport layer (TCPv4/TCPv6) identified by
an integer into a long long int.
Move the TCP specific code from get_tcp_info() to the tcp_get_info() control layer
function (src/proto_tcp.c) and define it as  the ->get_info() callback for
TCPv4 and TCPv6.
Note that get_tcp_info() is called for several TCP sample fetches.
This patch is useful to support some of these sample fetches for QUIC and to
keep the code simple and easy to maintain.
2024-07-31 10:29:42 +02:00
Amaury Denoyelle
bba6baff30 BUG/MEDIUM: quic: prevent conn freeze on 0RTT undeciphered content
Received QUIC packets are stored in quic_conn Rx buffer after header
protection removal in qc_rx_pkt_handle(). These packets are then removed
after quic_conn IO handler via qc_treat_rx_pkts().

If HP cannot be removed, packets are still copied into quic_conn Rx
buffer. This can happen if encryption level TLS keys are not yet
available. The packet remains in the buffer until HP can be removed and
its content processed.

An issue occurs if client emits a 0-RTT packet but haproxy does not have
the shared secret, for example after a haproxy process restart. In this
case, the packet is copied in quic_conn Rx buffer but its HP won't ever
be removed. This prevents the buffer to be purged. After some time, if
the client has emitted enough packets, Rx buffer won't have any space
left and received packets are dropped. This will cause the connection to
freeze.

To fix this, remove any 0-RTT buffered packets on handshake completion.
At this stage, 0-RTT packets are unnecessary anymore. The client is
expected to reemit its content in 1-RTT packet which are properly
deciphered.

This can easily reproduce with HTTP/3 POST requests or retrieving a big
enough object, which will fill the Rx buffer with ACK frames. Here is a
picoquic command to provoke the issue on haproxy startup :

$ picoquicdemo -Q -v 00000001 -a h3 <hostname> 20443 "/?s=1g"

Note that allow-0rtt must be present on the bind line to trigger the
issue. Else haproxy will reject any 0-RTT packets.

This must be backported up to 2.6.

This could be one of the reason for github issue #2549 but it's unsure
for now.
2024-07-31 10:24:53 +02:00
William Lallemand
f76e8e50f4 BUILD: ssl: replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC
Replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC in the code source, so we
won't need to set USE_OPENSSL_AWSLC in the Makefile on the long term.
2024-07-30 18:53:08 +02:00
William Lallemand
1889b86561 BUG/MEDIUM: ssl: 0-RTT initialized at the wrong place for AWS-LC
Revert patch fcc8255 "MINOR: ssl_sock: Early data disabled during
SSL_CTX switching (aws-lc)". The patch was done in the wrong callback
which is never built for AWS-LC, and applies options on the SSL_CTX
instead of the SSL, which should never be done elsewhere than in the
configuration parsing.

This was probably triggered by successfully linking haproxy against
AWS-LC without using USE_OPENSSL_AWSLC.

The patch also reintroduced SSL_CTX_set_early_data_enabled() in the
ssl_quic_initial_ctx() and ssl_sock_initial_ctx(). So the initial_ctx
does have the right setting, but it still needs to be applied to the
selected SSL_CTX in the clienthello, because we need it on the selected
SSL_CTX.

Must be backported to 3.0. (ssl_clienthello.c part was in ssl_sock.c)
2024-07-30 18:53:08 +02:00
Willy Tarreau
376b147fff BUG/MINOR: stconn: bs.id and fs.id had their dependencies incorrect
The backend depends on the response and the frontend on the request, not
the other way around. In addition, they used to depend on L6 (hence
contents in the channel buffers) while they should only depend on L5
(permanent info known in the mux).

This came in 2.9 with commit 24059615a7 ("MINOR: Add sample fetches to
get the frontend and backend stream ID") so this can be backported there.

(cherry picked from commit 61dd0156c82ea051779e6524cad403871c31fc5a)
Signed-off-by: Willy Tarreau <w@1wt.eu>
2024-07-30 18:39:29 +02:00
Christopher Faulet
d9f41b1d6e BUILD: mux-pt: Use the right name for the sedesc variable
A typo was introduced in 760d26a86 ("BUG/MEDIUM: mux-pt/mux-h1: Release the
pipe on connection error on sending path"). The sedesc variable is 'sd', not
'se'.

This patch must be backported with the commit above.
2024-07-30 10:44:00 +02:00
Christopher Faulet
760d26a862 BUG/MEDIUM: mux-pt/mux-h1: Release the pipe on connection error on sending path
When data are sent using the kernel splicing, if a connection error
occurred, the pipe must be released. Indeed, in that case, no more data can
be sent and there is no reason to not release the pipe. But it is in fact an
issue for the stream because the channel will appear are not empty. This may
prevent the stream to be released. This happens on 2.8 when a filter is also
attached on it. On 2.9 and upper, it seems there is not issue. But it is
hard to be sure and the current patch remains valid is all cases. On 2.6 and
lower, the code is not the same and, AFAIK, there is no issue.

This patch must be backported to 2.8. However, on 2.8, there is no zero-copy
data forwarding. The patch must be adapted. There is no done_ff/resume_ff
callback functions for muxes. The pipe must released in sc_conn_send() when
an error flag is set on the SE, after the call to snd_pipe callback
function.
2024-07-30 09:05:25 +02:00
Christopher Faulet
5dc45445ff BUG/MEDIUM: stconn: Report error on SC on send if a previous SE error was set
When a send on a connection is performed, if a SE error (or a pending error)
was already reported earlier, we leave immediately. No send is performed.
However, we must be sure to report the error at the SC level if necessary.
Indeed, the SE error may have been reported during the zero-copy data
forwarding. So during receive on the opposite side. In that case, we may
have missed the opportunity to report it at the SC level.

The patch must be backported as far as 2.8.
2024-07-30 09:05:25 +02:00
Willy Tarreau
5541d4995d BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()
After checking that a server or backend is full, it remains possible
to call pendconn_add() just after the last pending requests finishes, so
that there's no more connection on the server for very low maxconn (typ 1),
leaving new ones in queue till the timeout.

The approach depends on where the request was queued, though:
  - when queued on a server, we can simply detect that we may dequeue
    pending requests and wake them up, it will wake our request and
    that's fine. This needs to be done in srv_redispatch_connect() when
    the server is set.

  - when queued on a backend, it means that all servers are done with
    their requests. It means that all servers were full before the
    check and all were empty after. In practice this will only concern
    configs with less servers than threads. It's where the issue was
    first spotted, and it's very hard to reproduce with more than one
    server. In this case we need to load-balance again in order to find
    a spare server (or even to fail). For this, we call the newly added
    dedicated function pendconn_must_try_again() that tells whether or
    not a blocked pending request was dequeued and needs to be retried.

This should be backported along with pendconn_must_try_again() to all
stable versions, but with extreme care because over time the queue's
locking evolved.
2024-07-29 09:27:01 +02:00
Willy Tarreau
1a8f3a368f MINOR: queue: add a function to check for TOCTOU after queueing
There's a rare TOCTOU case that happens from time to time with maxconn 1
and multiple threads. Between the moment we see the queue full and the
moment we queue a request, it's possible that the last request on the
server or proxy ended and that no other one is left to offer it its place.

Given that all this code path is performance-critical and we cannot afford
to increase the lock duration, better recheck for the condition after
queueing. For this we need to be able to check for the condition and
cleanly dequeue a request. That's what this patch provides via the new
function pendconn_must_try_again(). It will catch more requests than
absolutely needed though it will catch them all. It may find that around
1/1000 of requests are at risk, though testing shows that in practice,
it's around 1 per million that really gets stuck (other ones benefit
from timing and finishing late requests). Maybe in the future some
conditions might be refined but it's harmless.

What happens to such requests is that they're dequeued and their pendconn
freed, so that the caller can decide to try to LB or queue them again. For
now the function is not used, it's just added separately for easier tracking.
2024-07-29 09:27:01 +02:00
Willy Tarreau
4316ef2eab BUILD: cfgparse-quic: fix build error on Solaris due to missing netinet/in.h
Since commit 35470d518 ("MINOR: quic: activate UDP GSO for QUIC if
supported"), Solaris build fails due to netinet/udp.h being included
without netinet/in.h. Adding it is sufficient to fix the problem. No
backport is needed.
2024-07-28 14:59:23 +02:00
Christopher Faulet
46b1fec0e9 BUG/MEDIUM: jwt: Clear SSL error queue on error when checking the signature
When the signature included in a JWT is verified, if an error occurred, one
or more SSL errors are queued and never cleared. These errors may be then
caught by the SSL stack and a fatal SSL error may be erroneously reported
during a SSL received or send.

So we must take care to clear the SSL error queue when the signature
verification failed.

This patch should fix issue #2643. It must be backported as far as 2.6.
2024-07-26 16:59:00 +02:00
Frederic Lecaille
4abaadd842 MINOR: quic: Dump TX in flight bytes vs window values ratio.
Display the ratio of the numbers of bytes in flight by packet number spaces
versus the current window values in percent.
2024-07-26 16:42:44 +02:00
Frederic Lecaille
76ff8afa2d MINOR: quic: Add information to "show quic" for CUBIC cc.
Add ->state_cli() new callback to quic_cc_algo struct to define a
function called by the "show quic (cc|full)" commands to dump some information
about the congestion algorithm internal state currently in use by the QUIC
connections.

Implement this callback for CUBIC algorithm to dump its internal variables:
   - K: (the time to reach the cubic curve inflexion point),
   - last_w_max: the last maximum window value reached before intering
     the last recovery period. This is also the window value at the
     inflexion point of the cubic curve,
   - wdiff: the difference between the current window value and last_w_max.
     So negative before the inflexion point, and positive after.
2024-07-26 16:42:44 +02:00
Willy Tarreau
2dab1ba84b MEDIUM: h1: allow to preserve keep-alive on T-E + C-L
In 2.5-dev9, commit 631c7e866 ("MEDIUM: h1: Force close mode for invalid
uses of T-E header") enforced a recently arrived new security rule in the
HTTP specification aiming at preventing a class of content-smuggling
attacks involving HTTP/1.0 agents. It consists in handling the very rare
T-E + C-L requests or responses in close mode.

It happens it does have an impact of a rare few and very old clients
(probably running insecure TLS stacks by the way) that continue to send
both with their POST requests. The impact is that for each and every
request they'll have to reconnect, possibly negotiating a full TLS
handshake that becomes harmful to the machine in terms of CPU computation.

This commit adds a new option "h1-do-not-close-on-insecure-transfer-encoding"
that does exactly what it says, it just asks not to close on such messages,
even though the message continues to be sanitized and C-L dropped. It means
that the risk is only between the sender and haproxy, which is limited, and
might be the only acceptable solution for such environments having to deal
with broken implementations.

The cases are so rare that it should not need to be backported, or in the
worst case, to the latest LTS if there is any demand.
2024-07-26 15:59:35 +02:00
Amaury Denoyelle
85131f91bf BUG/MEDIUM: quic: fix invalid conn reject with CONNECTION_REFUSED
quic-initial rules were implemented just recently. For some actions, a
new flags field was added in quic_dgram structure. This is used to
report the result of the rules execution.

However, this flags field was left uninitialized. Depending on its
value, it may close the connection to be wrongly rejected via
CONNECTION_REFUSED. Fix this by properly set flags value to 0.

No need to backport.
2024-07-26 15:24:35 +02:00
Amaury Denoyelle
08515af9df MINOR: quic: implement send-retry quic-initial rules
Define a new quic-initial "send-retry" rule. This allows to force the
emission of a Retry packet on an initial without token instead of
instantiating a new QUIC connection.
2024-07-25 15:39:39 +02:00
Amaury Denoyelle
69d7e9f3b7 MINOR: quic: implement reject quic-initial action
Define a new quic-initial action named "reject". Contrary to dgram-drop,
the client is notified of the rejection by a CONNECTION_CLOSE with
CONNECTION_REFUSED error code.

To be able to emit the necessary CONNECTION_CLOSE frame, quic_conn is
instantiated, contrary to dgram-drop action. quic_set_connection_close()
is called immediatly after qc_new_conn() which prevents the handshake
startup.
2024-07-25 15:39:39 +02:00
Amaury Denoyelle
f91be2657e MINOR: quic: pass quic_dgram as obj_type for quic-initial rules
To extend quic-initial rules, pass quic_dgram instance to argument for
the various actions. As such, quic_dgram is now supported as an obj_type
and can be used in session origin field.
2024-07-25 15:39:39 +02:00
Amaury Denoyelle
1259700763 MINOR: quic: support ACL for quic-initial rules
Add ACL condition support for quic-initial rules. This requires the
extension of quic_parse_quic_initial() to parse an extra if/unless
block.

Only layer4 client samples are allowed to be used with quic-initial
rules. However, due to the early execution of quic-initial rules prior
to any connection instantiation, some samples are non supported.

To be able to use the 4 described samples, a dummy session is
instantiated before quic-initial rules execution. Its src and dst fields
are set from the received datagram values.
2024-07-25 15:39:39 +02:00
Amaury Denoyelle
cafe596608 MEDIUM: quic: implement quic-initial rules
Implement a new set of rules labelled as quic-initial.

These rules as specific to QUIC. They are scheduled to be executed early
on Initial packet parsing, prior a new QUIC connection instantiation.
Contrary to tcp-request connection, this allows to reject traffic
earlier, most notably by avoiding unnecessary QUIC SSL handshake
processing.

A new module quic_rules is created. Its main function
quic_init_exec_rules() is called on Initial packet parsing in function
quic_rx_pkt_retrieve_conn().

For the moment, only "accept" and "dgram-drop" are valid actions. Both
are final. The latter drops silently the Initial packet instead of
allocating a new QUIC connection.
2024-07-25 15:39:39 +02:00
Amaury Denoyelle
a72e82c382 MINOR: quic: delay Retry emission on quic-force-retry
Currently, quic Retry packets are emitted for two different reasons
after processing an Initial without token :
- quic-force-retry is set on bind-line
- an abnormal number of half-open connection is currently detected

Previously, these two conditions were checked separately in different
functions during datagram parsing. Uniformize this by moving
quic-force-retry check in quic_rx_pkt_retrieve_conn() along the second
condition check.

The purpose of this patch is to uniformize datagram parsing stages. It
is necessary to implement quic-initial rules in
quic_rx_pkt_retrieve_conn() prior to any Retry emission. This prevents
to emit unnecessary Retry if an Initial is subject to a reject rule.
2024-07-25 15:29:50 +02:00
Aurelien DARRAGON
e328056ddc MEDIUM: sink: assume sft appctx stickiness
As mentioned in b40d804 ("MINOR: sink: add some comments about sft->appctx
usage in applet handlers"), there are few places in the code where it
looks like we assumed that the applet callbacks such as
sink_forward_session_init() or sink_forward_io_handler() could be
executing an appctx whose sft is detached from the appctx
(appctx != sft->appctx).

In practise this should not be happening since an appctx sticks to the
same thread its entire lifetime, and the only times sft->appctx is
effectively assigned is during the session/appctx creation (in
process_sink_forward()) or release.

Thus if sft->appctx wouldn't point to the appctx that the sft was bound
to after appctx creation, it would probably indicate a bug rather than
an expected condition. To further emphasize that and prevent the
confusion, and since 3.1-dev4 was released, let's remove such checks and
instead add a BUG_ON to ensure this never happens.

In _sink_forward_io_handler(), the "hard_close" label was removed since
there are no more uses for it (no hard errors may be caught from the
function for now)
2024-07-25 14:56:19 +02:00