haproxy

mirror of https://github.com/haproxy/haproxy.git synced 2026-04-29 18:18:59 -04:00

Author	SHA1	Message	Date
Valentine Krasnobaeva	df68f7ec96	BUG/MINOR: cfgparse-global: fix allowed args number for setenv Keywords setenv and presetenv take 2 arguments: variable name and value. So, the total number, that should be passed to alertif_too_many_args is 2 ("setenv <name> <value>") instead of 3. For alertif_too_many_args the first argument index is 0. This should be backported in all stable versions.	2024-10-01 10:35:09 +02:00
Christopher Faulet	273d322b6f	MINOR: stream/stats: Expose the total number of streams ever created in stats A shared counter is added in the thread context to track the total number of streams created on the thread. This number is then reported in stats. It will be a useful information to diagnose some bugs.	2024-09-30 16:55:53 +02:00
Christopher Faulet	18ee22ff76	MINOR: stream/stats: Expose the current number of streams in stats A shared counter is added in the thread context to track the current number of streams. This number is then reported in stats. It will be a useful information to diagnose some bugs.	2024-09-30 16:55:53 +02:00
Christopher Faulet	6a94b7419e	MINOR: stream: Support dynamic changes of the number of connection retries Thanks to the previous patch, it is now possible to add an action to dynamically change the maxumum number of connection retires for a stream. "set-retries" action may now be used to do so, from a "tcp-request content" or a "http-request" rule. This action accepts an expression or an integer between 0 and 100. The integer value is checked during the configuration parsing and leads to an error if it is not in the expected range. However, for the expression, the value is retrieve at runtime. So, invalid value are just ignored. Too high value is forbidden to avoid any trouble. 100 retries seems already be an amazingly hight value. In addition, the option is only available on backend or listen sections. Because the max retries is limited to 100 at most, it can be stored as a unsigned short. This save some space in the stream structure.	2024-09-30 16:55:53 +02:00
Christopher Faulet	91e785edc9	MINOR: stream: Rely on a per-stream max connection retries value Instead of directly relying on the backend parameter to limit the number of connection retries, we now use a per-stream value. This value is by default inherited from the backend value when it is set. So for now, there is no change except the stream value is used instead of the backend value. But thanks to this change, it will be possible to dynamically change this value.	2024-09-30 16:55:53 +02:00
Christopher Faulet	0d91de2be4	MINOR: action: Export release_expr_int_action() release function This function was only used by TCP actions and was private to tcp_act.c file. However, it make sense to make it public to be used by any action relying on an int-or-expression argument.	2024-09-30 16:55:53 +02:00
Christopher Faulet	688abb6f30	BUG/MINOR: mcli: Pretend the mux have more data to deliver between two commands Since the commit "OPTIM: stconn: Don't pretend mux have more data to deliver on EOI/EOS/ERROR", the SC no longer pretend its mux have more data to deliver when one of EOI/EOS/ERROR flags are set on its sedesc. However, for the master cli, it is an issue because any EOI/EOS at the end of a command is in fact detected on the attempt to get the next command. To do so, the stream is reset. Because if the commit above, the next received is never performed. To fix the issue, when the stream is reset, the front SC pretend its mux have more data to deliver. This patch must only be bacported if the commit above is backported.	2024-09-30 16:55:53 +02:00
Christopher Faulet	bca5e14235	OPTIM: stconn: Don't pretend mux have more data to deliver on EOI/EOS/ERROR Doing some benchs on the 3.0, we encountered a small loss on requests/sec on small objects compared to the 2.8 . After bisecting the issue, it appeared that this was introduced when the mux-to-mux zero-copy data forwarding was implemented in 2.9-dev8. Extra subscribes on receives at the end of the message were responsible of the loss. A basic configuration, sending H2 requests to a H1 server returning responses without payload is enough to observe the issue. With the following command, we can observe a huge increase of epoll_ctl calls on 2.9/3.x: h2load -c 100 -m 10 -n 100000 http://... On 2.8 we have around 3200 calls to epoll_ctl against more than 20k on 3.1. The fix seems obvious. After a receive, there is no reason to state a mux have more data to deliver if EOI/EOS/ERROR flag was set on the stream-endpoint descriptor. With this change, extra calls to epoll_ctl disappear. However it is a sensitive part so it is important to keep an eye on it and to not backport it. Thanks to Willy and Emeric to have spot the issue.	2024-09-30 16:55:48 +02:00
Willy Tarreau	11051ed9c7	OPTIM: channel: speed up co_getline()'s search of the end of line Previously, co_getline() was essentially used for occasional parsing in peers's banner or Lua, so it could afford to read one character at a time. However now it's also used on the TCP log path, where it can consume up to 40% CPU as mentioned in GH issue #2731. Let's speed it up by using memchr() to look for the LF, and copying the data at once using memcpy(). Previously it would take 2.44s to consume 1 GB of log on a single thread of a Core i7-8650U, now it takes 1.56s (-36%).	2024-09-30 11:36:39 +02:00
Willy Tarreau	1d403caf8a	MINOR: server: make srv_shutdown_sessions() call pendconn_redistribute() When shutting down server sessions, the queue was not considered, which is a problem if some element reached the queue at the moment the server was going down, because there will be no more requests to kick them out of it. Let's always make sure we scan the queue to kick these streams out of it and that they can possibly find a more suitable server. This may make a difference in the time it takes to shut down a server on the CLI when lots of servers are in the queue. It might be interesting to backport this to 3.0 but probably not much further.	2024-09-27 19:01:38 +02:00
Willy Tarreau	1385e33eb0	BUG/MINOR: queue: make sure that maintenance redispatches server queue Turning a server to maintenance currently doesn't redispatch the server queue unless there's an explicit "option redispatch" and no "option persist", while the former has never really been the purpose of this test. Better refine this so that forced maintenance also causes the queue to be flushed, and possibly redispatched unless the proxy has option persist. This way now when turning a server to maintenance, the queue is immediately flushed and streams can decide what to do. This can be backported, though there's no need to go far since it was never directly reported and only noticed as part of debugging some rare "shutdown sessions" strangeness, which it might participate to.	2024-09-27 18:54:07 +02:00
Willy Tarreau	b8e3b0a18d	BUG/MEDIUM: stream: make stream_shutdown() async-safe The solution found in commit `b500e84e24` ("BUG/MINOR: server: shut down streams under thread isolation") to deal with inter-thread stream shutdown doesn't work fine because there exists code paths involving a server lock which can then deadlock on thread_isolate(). A better solution then consists in deferring the shutdown to the stream itself and just wake it up for that. The only thing is that TASK_WOKEN_OTHER is a bit too generic and we need to pass at least 2 types of events (SF_ERR_DOWN and SF_ERR_KILLED), so we're now leveraging the new TASK_F_UEVT1 and _UEVT2 flags on the task's state to convey these info. The caller only needs to wake the task up with these flags set, and the stream handler will then finish the job locally using stream_shutdown_self(). This needs to be carefully backported to all branches affected by the dequeuing issue and containing any of the `5541d4995d` ("BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()"), and/or `b11495652e` ("BUG/MEDIUM: queue: implement a flag to check for the dequeuing").	2024-09-27 12:15:41 +02:00
Willy Tarreau	d1c398b786	Revert "BUG/MINOR: server: shut down streams under thread isolation" This reverts commit `b500e84e24`. Thread isolation does not work well for this, there exists code paths which already hold the server's lock and result in a deadlock. Let's revert that and address it better without isolation.	2024-09-27 10:17:31 +02:00
Aurelien DARRAGON	e3eb6a9035	MEDIUM: log: consider log-steps proxy setting for existing log origins During tcp/http transaction processing, haproxy may produce logs at different steps during the processing (accept, connect, request, response, close). But the behavior is hardly configurable because haproxy will only emit a single log per transaction, and by default it will try to produce the log once all log aliases or fetches used in the logformat could be satisfied, which means the log is often emitted during connection teardown, unless "option logasap" is used. We were often asked to have a way to emit multiple logs for a single transaction, like for instance emit log during accept, then request, response and close for instance, see GH #401 for more context. Thanks to "log-steps" keyword introduced by commit "MINOR: log: introduce "log-steps" proxy keyword", it is now possible to explictly configure when logs should be generated by haproxy when processing a transaction. This commit adds the required checks so that log-steps proxy option is properly considered for existing logs generated by haproxy. If "log-steps" is not specified on the proxy, the old behavior is preserved. Note: a slight cpu overhead should only be visible when "log-steps" keyword will be used due to the implementation relying on eb32 lookup instead of basic bitfield check as described in "MINOR: proxy: add log_steps struct member". However, the default behavior shouldn't be affected. When combining log-steps with log-profiles, user has the ability to explicitly control how and when haproxy should generate logs during requests handling.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	4189eb7aca	MINOR: log: add log_orig_proxy() helper function Function may be used on proxy where log-steps are used to check if a given log origin should be handled or not.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	c043d5d372	MINOR: log: introduce "log-steps" proxy keyword For now it is only available for proxies with frontend capability because log-steps are only evaluated under sess_log() or strm_log() which essentially focus on the frontend side when it comes to log settings so it's better to keep it this way for better consistency, at least for now. For now the setting does nothing (it is not considered during runtime), it will be implemented and documented in upcoming commits.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	9341792baf	MINOR: proxy: add log_steps struct member add proxy->conf.log_steps eb32 root tree which will be used to store the log origin identifiers that should result in haproxy emitting a log as configured by the user using upcoming "log-steps" proxy keyword. It was chosen to use eb32 tree instead of simple bitfield because despite the slight overhead it is more future-proof given that we already implemented the prerequisites for seamless custom log origins registration that will also be usable from "log-steps" proxy keyword.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	b882402a29	MINOR: log: support extra log origins for '%OG' alias Following previous commits, let's improve log_orig_to_str() so that extra log origins (registered through log_orig_register()) can be translated to string from origin ID. For that, it is required to add eb_32 tree node to log_origin struct in order to enable quick integer lookup during runtime. Slow name lookup using the list is acceptable for config parsing, but it is not the case during runtime when log_orig_to_str() is expected to be used. Also, to prevent duplicated info, get rid of ->id field and use ->tree.key instead	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	f8bb9d5c57	MINOR: log: explicitly handle extra log origins as error when relevant Thanks to previous commit, we can know check for log_orig optional flags in functions taking struct log_orig as parameter. Let's take this opportunity to add the LOG_ORIG_FL_ERROR flag and check this flag at a few places to handle the log message differently because if the flag is set then the caller expects the log to be handled as an error explicitly. e.g.: in _process_send_log_override(), if the flag is set, use the error log format instead of the dedicated one.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	3c15ee05e9	MINOR: log: introduce log_orig flags Rename 'enum log_orig' to 'enum log_orig_id', since this enum specifically contains the log origin ids. Add 'struct log_orig' which wraps 'enum log_orig' with optional flags (no flags defined for now). Add log_orig() helper func that takes id and flags as parameter and returns log_orig struct initialized with input arguments. Update functions taking log origin as parameter so they explicitly take log orig id or log orig wrapper as argument depending on the level of context expected by the function.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	6567e37680	MINOR: log: handle extra log origins in _process_send_log_override() Thanks to the previous commit, it is now possible to register additional log origins that may be used from log-profile section as 'on' steps. As such, let's make _process_send_log_override() function aware of them by trying to lookup in the tree of extra logging steps in the default switch-case catchall. If the log origin id matches with the id of the extra logging step, we use the associated log format instead of the "any" log format.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	818475c5cc	MINOR: log: introduce extra log profile steps add a way to register additional log origins using log_origin_register() that may be used as log profile steps from log profile sections. For now this does nothing as no extra origins are registered and extra log origins are not yet considered for runtime logging paths. When specifying an extra logging step for on <step> under log-profile section, the logging step is stored within a binary tree for efficient lookup during runtime. No performance impact should be expected if extra log origins are not being used, and slight performance impact if extra log origins are used. Don't forget to update the documentation when new log origins are added (both %OG log alias and on <step> log-profile keyword are concerned.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	facf259d88	MINOR: log: fix indent in strm_log() `8f34320e15` ("MINOR: log: provide log origin in logformat expressions using '%OG'") caused wrong indent in strm_log()	2024-09-26 16:53:07 +02:00
Oliver Dala	a889413f5e	BUG/MEDIUM: cli: Deadlock when setting frontend maxconn The proxy lock state isn't passed down to relax_listener through dequeue_proxy_listeners, which causes a deadlock in relax_listener when it tries to get that lock. Backporting: Older versions didn't have relax_listener and directly called resume_listener in dequeue_proxy_listeners. lpx should just be passed directly to resume_listener then. The bug was introduced in commit `001328873c` [cf: This patch should fix the issue #2726. It must be backported as far as 2.4]	2024-09-25 17:12:11 +02:00
Christopher Faulet	14a413033c	BUG/MEDIUM: cli: Be sure to catch immediate client abort A client abort while nothing was sent is properly handled except when this immediately happens after the connection was accepted. The read0 event is caught before the CLI applet is created. In that case, the shutdown is not handled and the applet is no longer wakeup. In that case, the stream remains blocked and no timeout are armed. The bug was due to the fact that when the applet I/O handler was called for the first time, the applet context was initialized and nothing more was performed. A shutdown, if any, would be handled on the next call. In that case, it was too late. Now, afet the init step, we loop to eval the first command. There is no command here but the shutdown will be tested. This patch should fix the issue #2727. It must be backported to 3.0.	2024-09-24 18:01:38 +02:00
Aurelien DARRAGON	d622f9d5b6	MEDIUM: mailers: warn about deprecated legacy mailers As mentioned in 2.8 announce on the mailing list [1] and on the wiki [2], use of legacy mailers is now deprecated and will not be supported anymore starting with version 3.3. Use of Lua script (AKA Lua mailers) is now encouraged (and fully supported since 2.8) for this purpose, as it offers more flexibility (e.g: alerts can be customized) and is more future-proof. Configurations relying on legacy mailers will now raise a warning. Users willing to keep their existing mailers config in a working state should simply add the following line to their global section: # mailers.lua file as provided in the git repository # adjust path as needed lua-load examples/lua/mailers.lua [1]: https://www.mail-archive.com/haproxy@formilux.org/msg43600.html [2]: https://github.com/haproxy/wiki/wiki/Breaking-changes	2024-09-23 20:16:27 +02:00
Willy Tarreau	fdf38ed7fc	BUG/MINOR: proxy: also make the cli and resolvers use the global name As detected by ASAN on the CI, two places still using strdup() on the proxy names were left by commit `b325453c3` ("MINOR: proxy: use the global file names for conf->file"). No backport is needed.	2024-09-21 20:08:06 +02:00
Willy Tarreau	b500e84e24	BUG/MINOR: server: shut down streams under thread isolation Since the beginning of thread support, the shutdown of streams attached to a server was run under the server's lock, but that's not sufficient. It indeed turns out that shutting down streams (either from the CLI using "shutdown sessions server XXX" or due to "on-error shutdown-sessions") iterates over all the streams to shut them down, but stream_shutdown() has no way to protect its actions against concurrent actions from the stream itself on another thread, and streams offer no such provisions anyway. The impact is some rare but possible crashes when shutting down streams from the CLI in cmopetition with high server traffic. The probability is low enough to mark it minor, though it was observed in the field. At least since 2.4 the streams are arranged in per-thread lists, so it likely would be possible using the event subsystem to delegate these events to dedicated per-thread tasks which would address the problem. But server streams don't get killed often enough to justify such extra complexity, so better just run the loop under thread isolation. It also shows that the internal API could probably be improved to support a lighter thread exclusion instead of full isolation: various places want to only exclude one thread and here it could work. But again there's no point doing this for now. This patch should be backported to all stable branches. It's important to carefully check that this srv_shutdowns_streams() function is never called itself under isolation in older versions (though at first glance it looks OK).	2024-09-21 19:35:35 +02:00
Willy Tarreau	e77c73316a	MEDIUM: cfgparse: warn about deprecated use of duplicate server names As discussed below, there are too many problems and limitations caused by still supporting duplicate server names. That's already particularly complicated and dissuasive to use since it requires these servers to have explicit IDs to be accept. Let's now warn on any duplicate, even with explicit IDs and remind that this will become forbidden in 3.3. Link: https://www.mail-archive.com/haproxy@formilux.org/msg45185.html	2024-09-20 17:15:11 +02:00
Willy Tarreau	029d75df1e	OPTIM: cfgparse: speed up duplicate server detection Surprisingly, the duplicate server name detection has never made use of the names tree, so lookups were still in O(N^2). It took 1 second to validate 50k servers spread into 25 backends at 2k per backend. By simply using the tree (and since the current server already is in the tree), we just have to walk using ebpt_prev_dup to visit previous servers with the same name. We can then detect which ones conflict without having an ID set and error. The config check time is now 1/4 of the previous one for 2k servers per backend, and more importantly it will make it simpler to check for any duplicates later.	2024-09-20 17:14:50 +02:00
Willy Tarreau	ccd1ecba1d	MEDIUM: cfgparse: drop duplicate named defaults sections after use It has never been permitted to explicitly reference named defaults sections for which there are duplicate names. This means that when a duplicate defaults section is found, there's no point in keeping it since it will never be used for lookups, so it can be dropped. However, some such defaults sections might have some rules in them that are implicitly referenced by proxies placed after them. In this case they cannot be removed. What is done here is that upon each new named section creation, if another one is found with the same name, its config location is stored into the new proxy's {prev_file,prev_line} pair, and the old section is either destroyed if its refcount is null, or just unindexed. The dup check when creating a new proxy now consists in checking the prev_line instead of performing a dup lookup on the defaults section. This will guarantee that we can't find duplicate defaults sections in their tree anymore, while still keeping track of what's allocated and releasing everything upon exit. Beyond the consistency gain, there are nice savings for large configs involving many defaults sections: a test with 300k sections saved about 1.9 GB of RAM, and started 25% faster likely thanks to spending less time allocating memory.	2024-09-20 16:35:32 +02:00
Willy Tarreau	c8b813771d	MINOR: proxy: add a list of orphaned defaults sections We'll soon delete unreferenced and duplicated named defaults sections from the list of proxies. The problem with this is that this list (in fact a name-based tree) is used to release all of them at the end. Let's add a list of orphaned defaults sections, typically those containing "http-check send" statements or various other rules, and that are implicitly inherited by a proxy hence have a non-zero refcount while also having a name. These now makes it possible to remove them from the name index while still keeping their memory around for the lifetime of the process, and cleaning it at the end.	2024-09-20 15:59:04 +02:00
Willy Tarreau	cb4c236fac	BUG/MINOR: cfgparse: detect another uncaught case of duplicate defaults The following sequence was not properly caught: defaults def backend back from def defaults def But this one was: defaults def defaults def backend back from def Let's check when defaults are declared that they're not already referenced. Better not backport this. While it will catch broken configs (possibly some with backends pasted after the wrong defaults), these might still work by accident. It may be reported as a diag warning though.	2024-09-20 15:58:10 +02:00
Willy Tarreau	5b221d1e41	CLEANUP: cfgparse: factor proxy vs log-forward collisions This simplifies the check added in `1a38684fbc` ("MEDIUM: cfgparse: detect collisions between defaults and log-forward"), by factoring it with the other existing one. The tests are ugly in that code because a first block tests pure proxies, a second one proxies or defaults and inside that one we have special cases for defaults. Let's just move the tests to the "any proxy type" block.	2024-09-20 14:13:14 +02:00
Willy Tarreau	b325453c36	MINOR: proxy: use the global file names for conf->file Proxy file names are assigned a bit everywhere (resolvers, peers, cli, logs, proxy). All these elements were enumerated and now use copy_file_name(). The only ha_free() call was turned to drop_file_name(). As a bonus side effect, a 300k backend config saved 14 MB of RAM.	2024-09-19 15:38:19 +02:00
Willy Tarreau	9ab21a3c2d	CLEANUP: stick-table: make the file location point to a global file name The file name used to point to the calling function's stack for stick tables, which was OK during parsing but remained dangling afterwards. At least it was already marked const so as not to accidentally free it. Let's make it point to a file_name_node now.	2024-09-19 15:38:19 +02:00
Willy Tarreau	d6c060c5ae	MINOR: tools: add minimal file name management In proxies, stick-tables, servers, etc... at plenty of places we store a file name and a line number. Some file names are the result of strdup() (e.g. in proxies), others not (e.g. stick-tables) and leave dangling pointers at the end of parsing. The risk of double-free is not null either. In order to stop this, let's first add a simple tool that allows to register short strings inside a global list, these strings happening to be server names. The strings are either duplicated and stored upon failure to find them, or just added to this storage. Since file names are not expected to disappear before the end of the process, for now we don't even implement refcounting, and we free them all at the end. There's already a drop_file_name() function to reset the pointer like ha_free() used to do, and even if not strictly needed it's a good habit to get used to doing it. The strings are returned as const so that they're stored as-is in structs, and that nasty free() calls are easily caught. The pointer points to the char[] storage inside the node itself. This way later if we want to implement refcounting, it will be trivial to just look up a string and change its associated node's refcount. If needed, comparisons can also be made on pointers. For now they're not used yet and are released on deinit().	2024-09-19 15:36:58 +02:00
Willy Tarreau	1a38684fbc	MEDIUM: cfgparse: detect collisions between defaults and log-forward Sadly, when log-forward were introduced they took great care of avoiding collision with regular proxies but defaults were missed (they need to be explicitly checked for). So now we have to move them to a warning for 3.1 instead of rejecting them.	2024-09-18 18:08:15 +02:00
Willy Tarreau	d8f4b07e40	MEDIUM: cfgparse: warn about colliding names between defaults and proxies In order to complete the checks added in `303a66573d` ("MEDIUM: cfgparse: warn about proxies having the same names"), we also need to warn about regular proxies having the same name as defaults sections as well as defaults sections having the same name as proxies, since defaults sections are inherently proxies, albeit stored in a separate list for now.	2024-09-18 18:08:06 +02:00
Amaury Denoyelle	fcd6d29acf	BUG/MINOR: mux-quic: report glitches to session Glitch counter was implemented for QUIC/HTTP3. The counter is stored in the QCC MUX connection instance. However, this is never reported at the session level which is necessary if glitch counter is tracked via a stick-table. To fix this, use session_add_glitch_ctr() in various QUIC MUX functions which may increment glitch counter. This should be backported up to 3.0.	2024-09-18 16:11:03 +02:00
Willy Tarreau	303a66573d	MEDIUM: cfgparse: warn about proxies having the same names As discussed below, there are too many problems and uncaught bugs in the parser when trying to support proxies having similar names but different types. There's specific code to detect the presence of stick-tables in a pair of such proxies for example. It's even possible that certain combinations of backend+listen that were not previously detected have some nasty side effects. According to the proposal in the discussion, this is now deprecated in 3.1 (thus we emit a warning) and will become forbidden in 3.3. A backport might be useful, but reporting a diag_warning only, not a classical warning, so as not to break setups running in zero-warning mode. It was verified with a config involving all 9 combinations of (frontend,backend,listen) followed by one of the same three that all collisions are now properly blocked and that only back+front are kept and emit a warning. Link: https://www.mail-archive.com/haproxy@formilux.org/msg45185.html	2024-09-17 19:55:00 +02:00
Willy Tarreau	c70906c8a1	BUG/MINOR: cfgparse: detect incorrect overlap of same backend names As reported below, it's possible to declare a backend then a proxy with the same name, because for the proxy we check a frontend capability (the first one to be tested): backend b listen b bind :8888 Let's check the two capabilities in this case and not just the frontend. Better not backport this, as there's a risk of breakage of existing setups that work by accident. It might make sense to report them as diag warnings though. Link: https://www.mail-archive.com/haproxy@formilux.org/msg45185.html	2024-09-17 19:55:00 +02:00
Aurelien DARRAGON	17e52c922b	BUG/MINOR: cfgparse-listen: fix option httpslog override warning message "option httpslog" override warning messaged used to be reported as "option httplog", probably as a result of copy paste without adjusting the context. Let's fix that to prevent emitting confusing warning messages The issue exists since `98b930d` ("MINOR: ssl: Define a default https log format"), thus it should be backported up to 2.6	2024-09-17 15:40:02 +02:00
Aurelien DARRAGON	bc4bf5779f	BUG/MINOR: fix missing "'option httpslog' overrides previous 'option tcplog clf'..." detection Same as b85edd44db0 ("BUG/MINOR: fix missing "log-format overrides previous 'option tcplog clf'..." detection") but for "option httpslog" keyword. No backport needed unless `fd48b28` ("MINOR: Implements new log format of option tcplog clf") is.	2024-09-17 15:40:02 +02:00
Aurelien DARRAGON	607b9adc9b	BUG/MINOR: fix missing "log-format overrides previous 'option tcplog clf'..." detection In commit `fd48b28315` ("MINOR: Implements new log format of option tcplog clf") "option tcplog clf" detection was correcly added for "option tcplog" and "option httplog", but "log-format" case was overlooked. Thus, this config would report erroneous warning message: defaults option tcplog clf log-format "ok" [WARNING] (727893) : config : parsing [test.conf:3]: 'log-format' overrides previous 'log-format' in 'defaults' section. No backport needed unless `fd48b28315` is.	2024-09-17 14:41:58 +02:00
Willy Tarreau	499e057644	MEDIUM: clock: don't compute before_poll when using monotonic clock There's no point keeping both clocks up to date; if the monotonic clock is ticking, let's just refrain from updating the wall clock one before polling since we won't use it. We still do it after polling however as we need a wall clock time to communicate with outside. This saves one gettimeofday() call per loop and two timeval comparisons.	2024-09-17 09:08:10 +02:00
Willy Tarreau	24496803d1	MEDIUM: clock: use the monotonic clock for idle time calculation By just keeping a copy of the last known value before entering polling, we can apply the same algorithm as we're currently using, except that it's now applied to the monotonic clock instead of the wall clock, when it's detected that it's ticking. This improves idle time calculation accuracy by making it independent on the wall clock.	2024-09-17 09:08:10 +02:00
Willy Tarreau	4150851ce5	MEDIUM: clock: opportunistically use CLOCK_MONOTONIC for the internal time We already collect CLOCK_MONOTONIC when it's available when leaving the poller, but it's only used for profiling. The functions that return it set the value to zero when it's not available, so we can use that to detect if it works or not. The idea is that if the monotonic time is non-zero, it is ticking and usable, then we use if for now_ns, otherwise we use the corrected date. We continue to apply the now_offset to the returned value because it helps forcing an early time wrap-around. Proceeding like this presents two benefits: - on systems supporting this, the time is much more robust against time changes - when it works, it saves us from having to go through the time correction code, which is usually cheap, but better avoided anyway. Note that idle time calculation continues to rely on the wall-clock time.	2024-09-17 09:08:10 +02:00
Willy Tarreau	f793845f4a	MEDIUM: clock: collect the monotonic time in clock_local_update_date() Now we collect this clock in clock_local_update_date(), the closest from the poller, which is also used when busy-polling, and the values is set into the thread's curr_mono_time which did not exist before. Later, clock_leaving_poll() just sets the prev_mono_time value from the curr_ one instead of retrieving the time at this specific point. It also means that the monotonic time will now also cover the time needed to update the global time, which should be negligible. Note that we don't collect the CPU time in the clock_local_update_date() function even though it's tempting, because when doing busy-polling, it would be collected on each round while being useless. Doing so will make sure that the local time always knows the monotonic time when it is available.	2024-09-17 09:08:10 +02:00
Willy Tarreau	42e699903e	MINOR: clock: test all clock_gettime() return values Till now we were only using clock_gettime() for profiling, so if it would fail it was no big deal. We intend to use it as the main clock as well now, so we need to more reliably detect its absence or failure and gracefully fall back to other options. Without the test we would return anything present in the stack, which is neither clean nor easy to detect.	2024-09-17 09:08:10 +02:00
Christopher Faulet	afc50f2445	BUG/MEDIUM: cache/stats: Wait to have the request before sending the response It seems obvious. On a classical workflow, the request headers analysis is finished when these applets are woken up for the first time. So they don't take care to really have the request to start to process it and to send the response. But with a filter, it is possible to stop the request analysis after the applet creation. If this happens for the stats applet, this leads to a crash because we retrieve the request start-line without checking if it is available. For the cache applet, the response is just immediatly sent. And here it is a problem if the compression is enabled. In that case too, this may lead to a crash because the compression may be enabled but not initialized. For a true server, there is no issue because the connection cannot be established. The server is chosen only after the request analysis. The issue with applets is that once created, an applet is quickly switched to the established state. So it is probably a point that must be carefully reviewed and probably reworked. In the mean time, as a fix, in the cache and the stats applet, we just take care to have the request before sending the response. This will do the trick. The patch must be backported as far as 2.6. On 2.6, the patch must be adapted.	2024-09-16 22:55:40 +02:00
Christopher Faulet	4de6632693	MINOR: proxy: Rename accept-invalid-http-* options With these options, it is possible to accept some invalid messages that may considered as unsafe and may result as vulnerabilities. The naming is not explicit enough on this point. These option must really be considered as dangerous and only used as a temporary workaround. Unfortunately, when used, it is probably because there are some legacy and unsupported applications in place. Nevermind. The documentation warns about the use of these options. Now the name of the options itself is a warning. So now, "accept-invalid-http-request" and "accept-invalid-http-response" options are deprecated and replaced by "accept-unsafe-violations-in-http-request" and "accept-unsafe-violations-in-http-response" options.	2024-09-16 22:55:25 +02:00
Aurelien DARRAGON	1e0920f855	BUG/MINOR: peers: local entries updates may not be advertised after resync Since commit `864ac3117` ("OPTIM: stick-tables: check the stksess without taking the read lock"), when entries for a local table are learned from another peer upon resynchro, and this is the only peer haproxy speaks to, local updates on such entries are not advertised to the peer anymore, until they eventually expire and can be recreated upon local updates. This is due to the fact that ts->seen is always set to 0 when creating new entry, and also when touch_remote is performed on the entry. Indeed, while `864ac3117` attempts to avoid useless updates, it didn't consider entries learned from a remote peer. Such entries are exclusively learned in peer_treat_updatemsg(): once the entry is created (or updated) with new data, touch_remote is used to commit the change. However, unlike touch_local, entries committed using touch_remote will not be advertised to the peer from which the entry was just learned (otherwise we would enter a looping situation). Due to the above patch, once an entry is learned from the (unique) remote peer, 'seen' will be stuck to 0 so it will never be advertised for its whole lifetime. Instead, when entries are learned from a peer, we should consider that the peer that taught us the entry has seen it. To do this, let's set seen=1 in peer_treat_updatemsg() after calling touch_remote(). This way, if we happen to perform updates on this entry, it will be properly advertized to relevant peers. This patch should not affect the performance gain documented in `864ac3117` given that the test scenario didn't involved entries learned by remote peers, but solely locally created entries advertised to remote peers upon updates. This should be backported in 3.0 with `864ac3117`.	2024-09-16 14:06:39 +02:00
Willy Tarreau	5d350d1e50	OPTIM: vars: use multiple name heads in the vars struct Given that the original list-based version was using a list head as the root of the variables, while the tree is using a single pointer, it made sense to reuse that space to place multiple roots, indexed on the lower bits of the name hash. Two roots slightly increase the performance level, but the best gain is obtained with 4 roots. The performance is now always above that of the list, even with small counts, and with 100 vars, it's 21% higher than before, or 67% higher than with the list. We keep the same lock (it could have made sense to use one lock per head), because most of the variables in large configs are attached to a stream or a session, hence are not shared between threads. Thus there's no point in sharding the pointer.	2024-09-15 23:51:51 +02:00
Willy Tarreau	47ec7c681e	OPTIM: vars: use a cebtree instead of a list for variable names Configs involving many variables can start to eat a lot of CPU in name lookups. The reason is that the names themselves are dynamic in that they are relative to dynamic objects (sessions, streams, etc), so there's no fixed index for example. The current implementation relies on a standard linked list, and in order to speed up lookups and avoid comparing strings, only a 64-bit hash of the variable's name is stored and compared everywhere. But with just 100 variables and 1000 accesses in a config, it's clearly visible that variable name lookup can reach 56% CPU with a config generated this way: for i in {0..100}; do printf "\thttp-request set-var(txn.var%04d) int(%d)" $i $i; for j in {1..10}; do [ $i -lt $j ] \|\| printf ",add(txn.var%04d)" $((i-j)); done; echo; done The performance and a 4-core skylake 4.4 GHz reaches 85k RPS with a perf profile showing: Samples: 170K of event 'cycles', Event count (approx.): 142378815419 Overhead Shared Object Symbol 56.39% haproxy [.] var_to_smp 6.65% haproxy [.] var_set.part.0 5.76% haproxy [.] sample_process_cnv 3.23% haproxy [.] sample_conv_var2smp 2.88% haproxy [.] sample_conv_arith_add 2.33% haproxy [.] __pool_alloc 2.19% haproxy [.] action_store 2.13% haproxy [.] vars_get_by_desc 1.87% haproxy [.] smp_dup [above, var_to_smp() calls var_get() under the read lock]. By switching to a binary tree, the cost is significantly lower, the performance reaches 117k RPS (+37%) with this profile: Samples: 170K of event 'cycles', Event count (approx.): 142323631229 Overhead Shared Object Symbol 40.22% haproxy [.] cebu64_lookup 7.12% haproxy [.] sample_process_cnv 6.15% haproxy [.] var_to_smp 4.75% haproxy [.] cebu64_insert 3.79% haproxy [.] sample_conv_var2smp 3.40% haproxy [.] cebu64_delete 3.10% haproxy [.] sample_conv_arith_add 2.36% haproxy [.] action_store 2.32% haproxy [.] __pool_alloc 2.08% haproxy [.] vars_get_by_desc 1.96% haproxy [.] smp_dup 1.75% haproxy [.] var_set.part.0 1.74% haproxy [.] cebu64_first 1.07% [kernel] [k] aq_hw_read_reg 1.03% haproxy [.] pool_put_to_cache 1.00% haproxy [.] sample_process The performance lowers a bit earlier than with the list however. What can be seen is that the performance maintains a plateau till 25 vars, starts degrading a little bit for the tree while it remains stable till 28 vars for the list. Then both cross at 42 vars and the list continues to degrade doing a hyperbole while the tree resists better. The biggest loss is at around 32 variables where the list stays 10% higher. Regardless, given the extremely narrow band where the list is better, it looks relevant to switch to this in order to preserve the almost linear performance of large setups. For example at 1000 variables and 10k lookups, the tree is 18 times faster than the list. In addition this reduces the size of the struct vars by 8 bytes since there's a single pointer, though it could make sense to re-invest them into a secondary head for example.	2024-09-15 23:49:01 +02:00
Willy Tarreau	a0205f9de4	IMPORT: import cebtree (compact elastic binary trees) This is an import of the compact elastic binary trees at commit a9cd84a ("OPTIM: descent: better prefetch less and for writes when deleting") These will be used to replace certain lists (and possibly certain tree nodes as well). They're as fast (or even faster) than ebtrees for lookups, as fast for insertion and slower for deletion, and a node only uses 2 pointers (like a list). The only changes were cebtree.h where common/tools.h was replaced with ebtree.h which we already have and already provides the needed functions and macros, and the addition of a wrapper cebtree-prv.h in src/ to redirect to import/cebtree-prv.h.	2024-09-15 23:44:59 +02:00
Willy Tarreau	6e92988e20	MINOR: vars: remove the emptiness tests in callers before pruning All callers of vars_prune_* currently check the list for emptiness. Let's leave that to vars_prune() itself, it will ease some changes in the code. Thanks to the previous inlining of the vars_prune() function, there's no performance loss, and even a very tiny 0.1% gain.	2024-09-15 23:44:16 +02:00
Willy Tarreau	2c1a9c3a43	OPTIM: vars: inline vars_prune() to avoid many calls Many configs don't have variables and call it for no reason, and even configs with variables don't necessarily have some in all scopes.	2024-09-15 23:42:09 +02:00
Willy Tarreau	aad6b771dd	OPTIM: vars: remove the unneeded lock in vars_prune_* vars_prune() and vars_prune_all() take the variable lock while purging all variables from a head. However this is not needed: - proc scope variables are only purged during deinit, hence no lock is needed ; - all other scopes are attached to entities bound to a single thread so no lock is needed either. Removing the lock saves about 0.5% CPU on variables-intensive setups, but above all simplify the code, so let's do it.	2024-09-15 23:05:50 +02:00
Willy Tarreau	51ade2f1db	OPTIM: sample: don't check casts for samples of same type Originally when converters were created, they were mostly for casting types. Nowadays we have many artithmetic converters to perform operations on integers, and a number of converters operating on strings. Both of these categories most often do not need any cast since the input and output types are the same, which is visible as the cast function is c_none. However, profiling shows that when heavily using arithmetic converters, it's possible to spend up to ~7% of the time in sample_process_cnv(), a good part of which is only in accessing the sample_casts[] array. Simply avoiding this lookup when input and ouput types are equal saves about 2% CPU on such setups doing intensive use of converters.	2024-09-15 12:43:56 +02:00
Willy Tarreau	b11495652e	BUG/MEDIUM: queue: implement a flag to check for the dequeuing As unveiled in GH issue #2711, commit `5541d4995d` ("BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()") does have some side effects in that it can occasionally cause an endless loop. As Christopher analysed it, the problem is that process_srv_queue(), which uses a trylock in order to leave only one thread in charge of the dequeueing process, can lose the lock race against pendconn_add(). If this happens on the last served request, then there's no more thread to deal with the dequeuing, and assign_server_and_queue() will loop forever on a condition that was initially exepected to be extremely rare (and still is, except that now it can become sticky). Previously what was happening is that such queued requests would just time out and since that was very rare, nobody would notice. The root of the problem really is that trylock. It was added so that only one thread dequeues at a time but it doesn't offer only that guarantee since it also prevents a thread from dequeuing if another one is in the process of queuing. We need a different criterion. What we're doing now is to set a flag "dequeuing" in the server, which indicates that one thread is currently in the process of dequeuing requests. This one is atomically tested, and only if no thread is in this process, then the thread grabs the queue's lock and dequeues. This way it will be serialized with pendconn_add() and no request addition will be missed. It is not certain whether the original race covered by the fix above can still happen with this change, so better keep that fix for now. Thanks to @Yenya (Jan Kasprzak) for the precise and complete report allowing to spot the problem. This patch should be backported wherever the patch above was backported.	2024-09-13 08:35:47 +02:00
Willy Tarreau	adaba6f904	BUG/MINOR: clock: validate that now_offset still applies to the current date We want to make sure that now_offset is still valid for the current date: another thread could very well have updated it by detecting a backwards jump, and at the very same moment the time got fixed again, that we retrieve and add to the new offset, which results in a larger jump. Normally, for this to happen, it would mean that before_poll was also affected by the jump and was detected before and bounded within 2 seconds, resulting in max 2 seconds perturbations. Here we try to detect this situation and fall back to re-adjusting the offset instead. It's more of a strengthening of what's done by commit `e8b1ad4c2b` ("BUG/MEDIUM: clock: also update the date offset on time jumps") than a pure fix, in that the issue was not direclty observed but it's visibly possible by reading the code, so this should be backported along with the patch above. This is related to issue GH #2704. Note that this could be simplified in terms of operations by migrating the deadlines to nanoseconds, but this was the path to least intrusive changes.	2024-09-12 19:09:19 +02:00
Willy Tarreau	af48e4cc6b	BUG/MINOR: clock: make time jump corrections a bit more accurate Since commit `e8b1ad4c2b` ("BUG/MEDIUM: clock: also update the date offset on time jumps") we try to update the now_offet based on the last known valid date. But if it's off compared to the global_now_ns date shared by other threads, we'll get the time off a little bit. When this happens, we should consider the most recent of these dates so that if the global date was already known to be more recent, we should use it and stick to it. This will avoid setting too large an offset that could in turn provoke a larger jump on another thread. This is related to issue GH #2704. This can be backported to other branches having the patch above.	2024-09-12 18:27:03 +02:00
Willy Tarreau	ad98edd00a	BUG/MINOR: polling: fix time reporting when using busy polling Since commit `beb859abce` ("MINOR: polling: add an option to support busy polling") the time and status passed to clock_update_local_date() were incorrect. Indeed, what is considered is the before_poll date related to the configured timeout which does not correspond to what is passed to the poller. That's not correct because before_poll+the syscall's timeout will be crossed by the current date 100 ms after the start of the poller. In practice it didn't happen when the poller was limited to 1s timeout but at one minute it happens all the time. That's particularly visible when running a multi-threaded setup with busy polling and only half of the threads working (bind ... thread even). In this case, the fixup code of clock_update_local_date() is executed for each round of busy polling. The issue was made really visible starting with recent commit `e8b1ad4c2b` ("BUG/MEDIUM: clock: also update the date offset on time jumps") because upon a jump, the shared offset is reset, while it should not be in this specific case. What needs to be done instead is to pass the configured timeout of the poller (and not of the syscall), and always pass "interrupted" set so as to claim we got an event (which is sort of true as it just means the poller returned instantly). In this case we can still detect backwards/forward jumps and will use a correct boundary for the maximum date that covers the whole loop. This can be backported to all versions since the issue was introduced with busy-polling in 1.9-dev8.	2024-09-12 17:47:13 +02:00
Christopher Faulet	1900ca475f	MEDIUM: h1: Accept invalid T-E values with accept-invalid-http-response option Since the 2.6, A parsing error is reported when the chunked encoding is found twice. As stated in RFC9112, A sender must not apply the chunked transfer coding more than once to a message body. It means only one chunked coding must be found. In addition, empty values are also rejected becaues it is forbidden by RFC9110. However, in both cases, it may be useful to relax the rules for trusted legacy servers when accept-invalid-http-response option is set. Especially because it was accepted on 2.4 and older. In addition, T-E header is now sanitized before sending it. It is not a problem Because it is a hop-by-hop header Note that it remains invalid on client side because there is no good reason to relax the parsing on this side. We can argue a server is trusted so we can decide to support some legacy behavior. It is not true on client side and it is highly suspicious if a client is sending an invalid T-E header. Note also we continue to reject unsupported T-E values (so all codings except "chunked"). Because the "TE" header is sanitized and cannot contain other value than "Trailers", there is absolutely no reason for a server to use something else. This patch should fix the issue #2677. It could probably be backported as far as 2.6 if necessary.	2024-09-12 09:21:57 +02:00
Willy Tarreau	2b95c77c08	DOC: server: document what to check for when adding new server keywords It's too easy to overlook the dynamic servers when adding new server keywords, and the fields on each keyword line are totally obscure. This commit adds a title to each column of the table and explains what is expected and what to check for when adding a keyword.	2024-09-10 18:50:12 +02:00
Damien Claisse	ce6a621ae3	MINOR: server: allow init-state for dynamic servers Commit `50322df` introduced the init-state keyword, but it didn't enable it for dynamic servers. However, this feature is perfectly desirable for virtual servers too, where someone would like a server inlived through "set server be1/srv1 state ready" to be put out of maintenance in down state until the next health check succeeds. At reading the code, it seems that it's only a matter of allowing this keyword for dynamic servers, as current code path calls srv_adm_set_ready() which incidentally triggers a call to _srv_update_status_adm().	2024-09-10 18:18:38 +02:00
Willy Tarreau	9f8d9c9e8b	BUG/MINOR: pattern: do not leave a leading comma on "set" error messages Commit `4f2493f355` ("BUG/MINOR: pattern: pat_ref_set: fix UAF reported by coverity") dropped the condition to concatenate error messages and as such introduced a leading comma in front of all of them. Then commit `911f4d93d4` ("BUG/MINOR: pattern: pat_ref_set: return 0 if err was found") changed the behavior to stop at the first error anyway, so all the mechanics dedicated to the concatenation of error messages is no longer needed and we can simply return the error as-is, without inserting any comma. This should be backported where the patches above are backported.	2024-09-10 08:55:29 +02:00
Christopher Faulet	a99d58819f	BUG/MINOR: h1-htx: Don't flag response as bodyless when a tunnel is established This reverts commit `225a4d02e1`. When a 200-OK response is replied to a CONNECT request or a 101-Switching-protocol, a tunnel is considered as established between the client and the server. However, we must not declare the reponse as bodyless. Of course, there is no payload, but tunneled data are expected. Because of this bug, the zero-copy forwarding is disabled on the server side. This patch must be backported as far as 2.9.	2024-09-09 19:01:47 +02:00
Christopher Faulet	f6e193f1b0	BUG/MAJOR: mux-h1: Wake SC to perform 0-copy forwarding in CLOSING state When the mux is woken up on I/O events, if the zero-copy forwarding is enabled, receives are blocked. In this case, the SC is woken up to be able to perform 0-copy forwarding to the other side. This works well, except for the H1C in CLOSING state. Indeed, in that case, in h1_process(), the SC is not woken up because only RUNNING H1 connections are considered. As consequence, the mux will ignore connection closure. The H1 connection remains blocked, waiting for the shutdown timeout. If no timeout is configured, the H1 connection is never closed leading to a leak. This patch should fix leak reported by Damien Claisse in the issue #2697. It should be backported as far as 2.8.	2024-09-09 19:01:47 +02:00
William Lallemand	021ac6a108	MEDIUM: ssl/cli: "dump ssl cert" allow to dump a certificate in PEM format The new "dump ssl cert" CLI command allows to dump a certificate stored into HAProxy memory. Until now it was only possible to dump the description of the certificate using "show ssl cert", but with this new command you can dump the PEM content on the filesystem. This command is only available on a admin stats socket. $ echo "@1 dump ssl cert cert.pem" \| socat /tmp/master.sock - -----BEGIN PRIVATE KEY----- [...] -----END PRIVATE KEY----- -----BEGIN CERTIFICATE----- [...] -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- [...] -----END CERTIFICATE-----	2024-09-09 16:54:48 +02:00
Aurelien DARRAGON	68cfb222b5	BUG/MEDIUM: pattern: prevent UAF on reused pattern expr Since `c5959fd` ("MEDIUM: pattern: merge same pattern"), UAF (leading to crash) can be experienced if the same pattern file (and match method) is used in two default sections and the first one is not referenced later in the config. In this case, the first default section will be cleaned up. However, due to an unhandled case in the above optimization, the original expr which the second default section relies on is mistakenly freed. This issue was discovered while trying to reproduce GH #2708. The issue was particularly tricky to reproduce given the config and sequence required to make the UAF happen. Hopefully, Github user @asmnek not only provided useful informations, but since he was able to consistently trigger the crash in his environment he was able to nail down the crash to the use of pattern file involved with 2 named default sections. Big thanks to him. To fix the issue, let's push the logic from `c5959fd` a bit further. Instead of relying on "do_free" variable to know if the expression should be freed or not (which proved to be insufficient in our case), let's switch to a simple refcounting logic. This way, no matter who owns the expression, the last one attempting to free it will be responsible for freeing it. Refcount is implemented using a 32bit value which fills a previous 4 bytes structure gap: int mflags; /* 80 4 / / XXX 4 bytes hole, try to pack / long unsigned int lock; / 88 8 */ (output from pahole) Even though it was not reproduced in 2.6 or below by @asmnek (the bug was revealed thanks to another bugfix), this issue theorically affects all stable versions (up to `c5959fd`), thus it should be backported to all stable versions.	2024-09-09 16:07:05 +02:00
Aurelien DARRAGON	8157c1caf2	BUG/MEDIUM: pattern: prevent uninitialized reads in pat_match_{str,beg} Using valgrind when running map_beg or map_str, the following error is reported: ==242644== Conditional jump or move depends on uninitialised value(s) ==242644== at 0x2E4AB1: pat_match_str (pattern.c:457) ==242644== by 0x2E81ED: pattern_exec_match (pattern.c:2560) ==242644== by 0x343176: sample_conv_map (map.c:211) ==242644== by 0x27522F: sample_process_cnv (sample.c:1330) ==242644== by 0x2752DB: sample_process (sample.c:1373) ==242644== by 0x319917: action_store (vars.c:814) ==242644== by 0x24D451: http_req_get_intercept_rule (http_ana.c:2697) In fact, the error is legit, because in pat_match_{beg,str}, we dereference the buffer on len+1 to check if a value was previously set, and then decide to force NULL-byte if it wasn't set. But the approach is no longer compatible with current architecture: data past str.data is not guaranteed to be initialized in the buffer. Thus we cannot dereference the value, else we expose us to uninitialized read errors. Moreover, the check is useless, because we systematically set the ending byte to 0 when the conditions are met. Finally, restoring the older value after the lookup is not relevant: indeed, either the sample is marked as const and in such case it is already duplicated, or the sample is not const and we forcefully add a terminating NULL byte outside from the actual string bytes (since we're past str.data), so as we didn't alter effective string data and that data past str.data cannot be dereferenced anyway as it isn't guaranteed to be initialized, there's no point in restoring previous uninitialized data. It could be backported in all stable versions. But since this was only detected by valgrind and isn't known to cause issues in existing deployments, it's probably better to wait a bit before backporting it to avoid any breakage.. although the fix should be theoretically harmless.	2024-09-09 15:57:30 +02:00
Aurelien DARRAGON	3449525a02	BUG/MINOR: pattern: prevent const sample from being tampered in pat_match_beg() This is a complementary patch to `a68affeaa` ("BUG/MINOR: pattern: a sample marked as const could be written"). Indeed the same logic from pat_match_str() is used there, but we lack the check to ensure that the sample is not const before writing data to it. It could be backported to all stable versions.	2024-09-09 15:57:23 +02:00
Willy Tarreau	ef8d8215de	BUG/MEDIUM: clock: detect and cover jumps during execution After commit `e8b1ad4c2` ("BUG/MEDIUM: clock: also update the date offset on time jumps"), @firexinghe mentioned that the issue was still present in their case. In fact it depends on the load, which affects the probability that the time changes between two poll() calls vs that it changes during poll(). The time correction code used to only deal with the latter. But under load if it changes between two poll() calls, what happens then is that before_poll is off, and after returning from poll(), the date is within bounds defined by before_poll, so no correction is applied. After many tests, it turns out that the most reliable solution without using CLOCK_MONOTONIC is to prevent before_poll from being earlier than the previous after_poll (trivial), and to cover forward jumps, we need to enforce a margin. Given that the watchdog kills a looping task within 2 seconds and that no sane setup triggers it, it seems that 2 seconds remains a safe enough margin. This means that in the worst case, some forward jumps of up to 2 seconds will not be corrected, leading to an apparent fast time and low rates. But this is supposed to be an exceptional event anyway (typically an admin or crontab running ntpdate). For future versions, given that we now opportunistically call now_mono_time() before and after poll(), that returns zero if not supported, we could imagine relying on this one for the thread's local time when it's non-null.	2024-09-08 19:15:38 +02:00
Christopher Faulet	001fb1a548	BUG/MEDIUM: mux-h1/mux-h2: Reject upgrades with payload on H2 side only Since `1d2d77b27` ("MEDIUM: mux-h1: Return a 501-not-implemented for upgrade requests with a body"), it is no longer possible to perform a protocol upgrade for requests with a payload. The main reason was to be able to support protocol upgrade for H1 client requesting a H2 server. In that case, the upgrade request is converted to a CONNECT request. So, it is not possible to convey a payload in that case. But, it is a problem for anyone wanting to perform upgrades on H1 server using requests with a payload. It is uncommon but valid. So, now, it is the H2 multiplexer responsibility to reject upgrade requests, on server side, if there is a payload. An INTERNAL_ERROR is returned for the H2S in that case. On H1 side, the upgrade is now allowed, but only if the server waits for the end of the request to return the 101-Switching-protocol response. Indeed, it is quite hard to synchronise the frontend side and the backend side in that case. Asking to servers to fully consume the request payload before returned the response seems reasonable. This patch should fix the issue #2684. It could be backported after a period of observation, as far as 2.4 if possible. But only if it is not too hard. It depends on "MINOR: mux-h1: Set EOI on SE during demux when both side are in DONE state".	2024-09-06 09:16:18 +02:00
Christopher Faulet	ad1ef94612	MINOR: mux-h1: Set EOI on SE during demux when both side are in DONE state For now, this case is already handled for all requests except for those waiting for a tunnel establishment (CONNECT and protocol upgrades). It is not an issue because only bodyless requests are supported in these cases. So the request is always finished at the end of headers and therefore before the response. However, to relax conditions for full H1 protocol upgrades (H1 client and server), this case will be necessary. Indeed, the idea is to be able to perform protocol upgrades for requests with a payload. Today, the "Upgrade:" header is removed before sending the request to the server. But to support this case, this patch is required to properly finish transaction when the server does not perform the upgrade.	2024-09-06 09:00:13 +02:00
Aaron Kuehler	50322dff81	MEDIUM: server: add init-state Allow the user to set the "initial state" of a server. Context: Servers are always set in an UP status by default. In some cases, further checks are required to determine if the server is ready to receive client traffic. This introduces the "init-state {up\|down}" configuration parameter to the server. - when set to 'fully-up', the server is considered immediately available and can turn to the DOWN sate when ALL health checks fail. - when set to 'up' (the default), the server is considered immediately available and will initiate a health check that can turn it to the DOWN state immediately if it fails. - when set to 'down', the server initially is considered unavailable and will initiate a health check that can turn it to the UP state immediately if it succeeds. - when set to 'fully-down', the server is initially considered unavailable and can turn to the UP state when ALL health checks succeed. The server's init-state is considered when the HAProxy instance is (re)started, a new server is detected (for example via service discovery / DNS resolution), a server exits maintenance, etc. Link: https://github.com/haproxy/haproxy/issues/51	2024-09-05 11:13:10 +02:00
Willy Tarreau	e8b1ad4c2b	BUG/MEDIUM: clock: also update the date offset on time jumps In GH issue #2704, @swimlessbird and @xanoxes reported problems handling time jumps. Indeed, since 2.7 with commit `4eaf85f5d9` ("MINOR: clock: do not update the global date too often") we refrain from updating the global offset in case it didn't change. But there's a catch: in case of a large time jump, if the poller was interrupted, the local time remains the same and we return immediately from there without updating the offset. It then becomes incorrect regarding the "date" value, and upon subsequent call to the poller, there's no way to detect a jump anymore so we apply the old, incorrect offset and the date becomes wrong. Worse, going back to the original time (then in the past), global_now_ns remains higher than the local time and neither get updated anymore. What is missing in practice is to immediately update the offset when detecting a time jump. In an ideal world, the offset would be updated upon every call, that's what was being done prior to commit above but it's extremely CPU intensive on large systems. However we can perfectly afford to update the offset every time we detect a time jump, as it's not as common. This needs to be backported as far as 2.8. Thanks to both participants above for providing very helpful details.	2024-09-04 16:55:43 +02:00
Ilya Shipitsin	1f6e5f7a61	CLEANUP: assorted typo fixes in the code and comments This is 43rd iteration of typo fixes	2024-09-03 17:49:21 +02:00
Christopher Faulet	e1cae42879	BUG/MEDIUM: mux-pt: Fix condition to perform a shutdown for writes in mux_pt_shut() A regression was introduced in the commit `76fa71f7a` ("BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown") because of a typo on the connection flags. CO_FL_SOCK_WR_SH flag must be tested to prevent a call to conn_sock_shutw() and not CO_FL_SOCK_RD_SH. Concretly, most of time, it is harmeless because shutdown for writes is always performed before any shutdown for reads. Except in case describe by the commit above. But it is not clear if it has an impact or not. This patch must be backported with the commit above, so as far as 2.9.	2024-09-03 15:25:05 +02:00
Frederic Lecaille	7e19432fd4	BUG/MINOR: Crash on O-RTT RX packet after dropping Initial pktns This bug arrived with this naive commit: BUG/MINOR: quic: Too shord datagram during O-RTT handshakes (aws-lc only) which omitted to consider the case where the Initial packet number space could be discarded before receiving 0-RTT packets. To fix this, append/insert the O-RTT (early-data) packet number space into the encryption level list depending on the presence or not of the Initial packet number space. This issue was revealed when using aws-lc as TLS stack in GH #2701 issue. Thank you to @Tristan971 for having reported this issue. Must be backported where the commit mentionned above is supposed to be backported: as far as 2.9.	2024-09-03 15:23:06 +02:00
Willy Tarreau	f8bff3b531	BUG/MINOR: mux-spop: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf That's the equivalent of the mux-h2 one, except that here there's no real risk to loop since normally we cannot feed data that bypass the closed state check (e.g. no zero-copy forward). But it still remains dirty to be able to leave and empty mbuf with MFULL and MROOM set, so better clear them as well. No backport is needed since this is only in 3.1.	2024-09-03 14:39:04 +02:00
Willy Tarreau	830e50561c	BUG/MAJOR: mux-h2: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf There exists an extremely tricky code path that was revealed in 3.0 by the glitches feature, though it might theoretically have existed before. TL;DR: a mux mbuf may be full after successfully sending GOAWAY, and discard its remaining contents without clearing H2_CF_MUX_MFULL and H2_CF_DEM_MROOM, then endlessly loop in h2_send(), until the watchdog takes care of it. What can happen is the following: Some data are received, h2_io_cb() is called. h2_recv() is called to receive the incoming data. Then h2_process() is called and in turn calls h2_process_demux() to process input data. At some point, a glitch limit is reached and h2c_error() is called to close the connection. The input frame was incomplete, so some data are left in the demux buffer. Then h2_send() is called, which in turn calls h2_process_mux(), which manages to queue the GOAWAY frame, turning the state to H2_CS_ERROR2. The frame is sent, and h2_process() calls h2_send() a last time (doing nothing) and leaves. The streams are all woken up to notify about the error. Multiple backend streams were waiting to be scheduled and are woken up in turn, before their parents being notified, and communicate with the h2 mux in zero-copy-forward mode, request a buffer via h2_nego_ff(), fill it, and commit it with h2_done_ff(). At some point the mux's output buffer is full, and gets flags H2_CF_MUX_MFULL. The io_cb is called again to process more incoming data. h2_send() isn't called (polled) or does nothing (e.g. TCP socket buffers full). h2_recv() may or may not do anything (doesn't matter). h2_process() is called since some data remain in the demux buf. It goes till the end, where it finds st0 == H2_CS_ERROR2 and clears the mbuf. We're now in a situation where the mbuf is empty and MFULL is still present. Then it calls h2_send(), which doesn't call h2_process_mux() due to MFULL, doesn't enter the for() loop since all buffers are empty, then keeps sent=0, which doesn't allow to clear the MFULL flag, and since "done" was not reset, it loops forever there. Note that the glitches make the issue more reproducible but theoretically it could happen with any other GOAWAY (e.g. PROTOCOL_ERROR). What makes it not happen with the data produced on the parsing side is that we process a single buffer of input at once, and there's no way to amplify this to 30 buffers of responses (RST_STREAM, GOAWAY, SETTINGS ACK, WINDOW_UPDATE, PING ACK etc are all quite small), and since the mbuf is cleared upon every exit from h2_process() once the error was sent, it is not possible to accumulate response data across multiple calls. And the regular h2_snd_buf() path checks for st0 >= H2_CS_ERROR so it will not produce any data there either. Probably that h2_nego_ff() should check for H2_CS_ERROR before accepting to deliver a buffer, but this needs to be carefully studied. In the mean time the real problem is that the MFULL flag was kept when clearing the buffer, making the two inconsistent. Since it doesn't seem possible to trigger this sequence without the zero-copy-forward mechanism, this fix needs to be backported as far as 2.9, along with previous commit "MINOR: mux-h2: try to clear DEM_MROOM and MUX_MFULL at more places" which will strengthen the consistency between these checks. Many thanks to Annika Wickert for her detailed report that allowed to diagnose this problem. CVE-2024-45506 was assigned to this problem.	2024-09-03 14:39:04 +02:00
Willy Tarreau	e9cdedb39b	MINOR: mux-h2: try to clear DEM_MROOM and MUX_MFULL at more places The code leading to H2_CF_MUX_MFULL and H2_CF_DEM_MROOM being cleared is quite complex and assumptions about its state are extremely difficult when reading the code. There are indeed long sequences where the mux might possibly be empty, still having the flag set until it reaches h2_send() which will clear it after the last send. Even then it's not obviour whether it's always guaranteed to release the flag when invoked in multiple passes. Let's just simplify the conditionnn so that h2_send() does not depend on "sent" anymore and that h2_timeout_task() doesn't leave the flags set on the buffer on emptiness. While it doesn't seem to fix anything, it will make the code more robust against future changes.	2024-09-03 14:39:04 +02:00
Christopher Faulet	0d4271cdae	BUG/MEDIUM: mux-h1: Properly handle empty message when an error is triggered When a 400/408/500/501 error is returned by the H1 multiplexer, we first try to get the error message of the proxy before using the default one. This may be configured to be mapped on /dev/null or on an empty file. In that case, no message is emitted, as expected. But everything is handled as the error was successfully sent. However, there is an bug here. In h1_send_error() function, this case is not properly handled. The flag H1C_F_ABRTED is not set on the H1 connection as it should be and h1_close() function is not called, leaving the H1 connection in an undefined state. It is especially an issue when a "empty" 408-Request-Time-out error is emitted while there are data blocked in the output buffer. In that case, the connection remains openned until the client closes and a "cR--"/408 is logged repeatedly, every time the client timeout is reached. This patch must backported as far as 2.8.	2024-09-03 14:28:42 +02:00
Frederic Lecaille	15a737eb5f	BUG/MINOR: quic: unexploited retransmission cases for Initial pktns. qc_prep_hdshk_fast_retrans() job is to pick some packets to be retransmitted from Initial and Handshake packet number spaces. A packet may be coalesced to a first one into the same datagram. When a coalesced packet is inspected for retransmission, it is skipped if its length would make the total datagram length it is attached to exceeding the anti-amplification limit. But in this case, the first packet must be kept for the current retransmission. This is tracked by this trace statemement: TRACE_PROTO("will probe Initial packet number space", QUIC_EV_CONN_SPPKTS, qc); This was not the case because of the wrong "goto end" statement. This latter must be run only if the Initial packet number space must not be probe with the first packet found as coalesced to another one which must be skipped. This bug was revealed by AWS-LC interop runner with handshakeloss and handshakecorruption which always fail because this stack leads the server to send more Initial packets. Thank you to Ilya (@chipitsine) for this issue report in GH #2663. Must be backported as far as 2.6.	2024-09-03 11:47:51 +02:00
Christopher Faulet	d4781bd5e7	BUG/MEDIUM: cli: Always release back endpoint between two commands on the mcli When several commands are chained on the master CLI, the same client connection is used. Because, it is a TCP connection, the mux PT is used. It means there is no stream at the mux level. It is not possible to release the applicative stream between each commands as for the HTTP. So, to work around this limitation, between two commands, the master CLI is resetting the stream. It does exactly what it was performed on HTTP to manage keep-alive connections on old HAProxy versions. But this part was copied from a code dealing with connection only while the back endpoint can be an applet or a mux for the master cli. The previous fix on the mux PT ("BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown") revealed a bug. Between two commands, the back endpoint was only released if the connection's XPRT was closed. This works if the back endpoint is an applet because there is no connection. But for commands sent to a worker, a connection is used. At this stage, this only works if the connection's XPRT is closed. Otherwise, the old endpoint is never detached leading to undefined behavior on the next command execution (most probably a crash). Without the commit above, the connection's XPRT is always closed on shutdown. It is no longer true. At this stage, we must inconditionnally release the back endpoint by resetting the corresponding sedesc to fix the bug. This patch must be backported with the commit above in all stable versions. On 2.4 and lower, it will need to be adapted.	2024-09-02 18:31:35 +02:00
Christopher Faulet	76fa71f7a8	BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown When a shutdown is reported to the mux (shutdown for reads or writes), the connexion is immediately fully closed if the mux detects the connexion is closed in both directions. Only the passthrough multiplexer is able to perform this action at this stage because there is no stream and no internal data. Other muxes perform a full connection close during the mux's release stage. It was working quite well since recently. But, in theory, the bug is quite old. In fact, it seems possible for the lower layer to report an error on the connection in same time a shutdown is performed on the mux. Depending on how events are scheduled, the following may happen: 1. An connection error is detected at the fd layer and a wakeup is scheduled on the mux to handle the event. 2. A shutdown for writes is performed on the mux. Here the mux decides to fully close the connexion. If the xprt is not used to log info, it is released. 3. The mux is finally woken up. It tries to retrieve data from the xprt because it is not awayre there was an error. This leads to a crash because of a NULL-deref. By reading the code, it is not obvious. But it seems possible with SSL connection when the handshake is rearmed. It happens when a SSL_ERROR_WANT_WRITE is reported on a SSL_read() attempt or a SSL_ERROR_WANT_READ on a SSL_write() attempt. This bug is only visible if the XPRT is not used to log info. So it is no so common. This patch should fix the 2nd crash reported in the issue #2656. It must first be backported as far as 2.9 and then slowly to all stable versions.	2024-09-02 15:50:25 +02:00
Christopher Faulet	f9adcdf039	MEDIUM: bwlim: Use a read-lock on the sticky session to apply a shared limit There is no reason to acquire a write-lock on the sticky session when a shared limit is applied because only the frequency is updated. The sticky session itself is not modified. We must just take care it is not removed in the mean time. So a read-lock may be used instead.	2024-09-02 15:50:25 +02:00
Christopher Faulet	a7f6b0ac03	MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates Add a factor parameter to stick-tables, called "brates-factor", that is applied to in/out bytes rates to work around the 32-bits limit of the frequency counters. Thanks to this factor, it is possible to have bytes rates beyond the 4GB. Instead of counting each bytes, we count blocks of bytes. Among other things, it will be useful for the bwlim filter, to be able to configure shared limit exceeding the 4GB/s. For now, this parameter must be in the range ]0-1024].	2024-09-02 15:50:25 +02:00
Frederic Lecaille	db13df3d6e	BUG/MINOR: quic: Crash from trace dumping SSL eary data status (AWS-LC) This bug follows this patch: MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event. where a new third variable was added to be dumped from QUIC_EV_CONN_IO_CB trace event. The quic_trace() code did not reveal there was already another variable passed as third argument but not dumped. This leaded to crash when dereferencing a point to an int in place of a point to an SSL object. This issue was reproduced only by handshakecorruption aws-lc interop test with s2n-quic as client. Note that this patch must be backported with this one: BUG/MEDIUM: quic: always validate sender address on 0-RTT which depends on the commit mentionned above.	2024-09-02 10:01:41 +02:00
Aperence	20efb856e1	MEDIUM: protocol: add MPTCP per address support Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension that enables a TCP connection to use different paths. Multipath TCP has been used for several use cases. On smartphones, MPTCP enables seamless handovers between cellular and Wi-Fi networks while preserving established connections. This use-case is what pushed Apple to use MPTCP since 2013 in multiple applications [2]. On dual-stack hosts, Multipath TCP enables the TCP connection to automatically use the best performing path, either IPv4 or IPv6. If one path fails, MPTCP automatically uses the other path. To benefit from MPTCP, both the client and the server have to support it. Multipath TCP is a backward-compatible TCP extension that is enabled by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...). Multipath TCP is included in the Linux kernel since version 5.6 [3]. To use it on Linux, an application must explicitly enable it when creating the socket. No need to change anything else in the application. This attached patch adds MPTCP per address support, to be used with: mptcp{,4,6}@<address>[:port1[-port2]] MPTCP v4 and v6 protocols have been added: they are mainly a copy of the TCP ones, with small differences: names, proto, and receivers lists. These protocols are stored in __protocol_by_family, as an alternative to TCP, similar to what has been done with QUIC. By doing that, the size of __protocol_by_family has not been increased, and it behaves like TCP. MPTCP is both supported for the frontend and backend sides. Also added an example of configuration using mptcp along with a backend allowing to experiment with it. Note that this is a re-implementation of Bj�rn's work from 3 years ago [4], when haproxy's internals were probably less ready to deal with this, causing his work to be left pending for a while. Currently, the TCP_MAXSEG socket option doesn't seem to be supported with MPTCP [5]. This results in a warning when trying to set the MSS of sockets in proto_tcp:tcp_bind_listener. This can be resolved by adding two new variables: sock_inet(6)_mptcp_maxseg_default that will hold the default value of the TCP_MAXSEG option. Note that for the moment, this will always be -1 as the option isn't supported. However, in the future, when the support for this option will be added, it should contain the correct value for the MSS, allowing to correctly set the TCP_MAXSEG option. Link: https://www.rfc-editor.org/rfc/rfc8684.html [1] Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2] Link: https://www.mptcp.dev [3] Link: https://github.com/haproxy/haproxy/issues/1028 [4] Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5] Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be> Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>	2024-08-30 18:53:49 +02:00
Aperence	2f171fe36a	MEDIUM: sock: use protocol when creating socket Use the protocol configured for a connection when creating the socket, instead of always using 0. This change is needed to allow new protocol to be used when creating the sockets, such as MPTCP. Note however that this patch won't change anything for now, as the only other value that proto->sock_prot could hold is IPPROTO_TCP, which has the same behavior as 0 when passed to socket.	2024-08-30 18:53:49 +02:00
Aperence	38618822e1	MINOR: server: add a alt_proto field for server Add a new field alt_proto to the server structures that specify if an alternate protocol should be used for this server. This field can be transparently passed to protocol_lookup to get an appropriate protocol structure. This change allows thus to create servers with different protocols, and not only TCP anymore.	2024-08-30 18:53:49 +02:00
Aperence	a7b04e383a	MINOR: tools: extend str2sa_range to add an alt parameter Add a new parameter "alt" that will store wether this configuration use an alternate protocol. This alt pointer will contain a value that can be transparently passed to protocol_lookup to obtain an appropriate protocol structure. This change is needed to allow for example the servers to know if it need to use an alternate protocol or not.	2024-08-30 18:53:49 +02:00
Willy Tarreau	2bc513dd31	BUILD: quic: fix build errors on FreeBSD since recent GSO changes The following commits broke the build on FreeBSD when QUIC is enabled: `35470d518` ("MINOR: quic: activate UDP GSO for QUIC if supported") `448d3d388` ("MINOR: quic: add GSO parameter on quic_sock send API") Indeed, it turns out that netinet/udp.h requires sys/types.h to be included before. Let's just change the includes order to fix the build. No backport is needed.	2024-08-30 18:53:49 +02:00
Frederic Lecaille	f627b9272b	BUG/MEDIUM: quic: always validate sender address on 0-RTT It has been reported by Wedl Michael, a student at the University of Applied Sciences St. Poelten, a potential vulnerability into haproxy as described below. An attacker could have obtained a TLS session ticket after having established a connection to an haproxy QUIC listener, using its real IP address. The attacker has not even to send a application level request (HTTP3). Then the attacker could open a 0-RTT session with a spoofed IP address trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests. To mitigate this vulnerability, one decided to use a token which can be provided to the client each time it successfully managed to connect to haproxy. These tokens may be reused for future connections to validate the address/path of the remote peer as this is done with the Retry token which is used for the current connection, not the next one. Such tokens are transported by NEW_TOKEN frames which was not used at this time by haproxy. So, each time a client connect to an haproxy QUIC listener with 0-RTT enabled, it is provided with such a token which can be reused for the next 0-RTT session. If no such a token is presented by the client, haproxy checks if the session is a 0-RTT one, so with early-data presented by the client. Contrary to the Retry token, the decision to refuse the connection is made only when the TLS stack has been provided with enough early-data from the Initial ClientHello TLS message and when these data have been accepted. Hopefully, this event arrives fast enough to allow haproxy to kill the connection if some early-data have been accepted without token presented by the client. quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN frame with this newly implemented token to be transported inside. quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre() and modified to be reused and derive the secret for the new token implementation. quic_token_validate() has been implemented to validate both the Retry and the new token implemented by this patch. When this is a non-retry token which could not be validated, the datagram received is marked as requiring a Retry packet to be sent, and no connection is created. When the Initial packet does not embed any non-retry token and if 0-RTT is enabled the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon as the TLS stack detects that some early-data have been provided and accepted by the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted() new function. The secret TLS handshake is interrupted as soon as possible returnin 0 from ha_quic_add_handshake_data(). The connection is also marked as requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb()) knows how to behave: kill the connection after having sent a Retry packet. About TLS stack compatibility, this patch is supported by aws-lc. It is disabled for wolfssl which does not support 0-RTT at this time thanks to HAVE_SSL_0RTT_QUIC. This patch depends on these commits: MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event. MINOR: quic: Implement qc_ssl_eary_data_accepted(). MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct) BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder MINOR: quic: Token for future connections implementation. MINOR: quic: Implement quic_tls_derive_token_secret(). MINOR: tools: Implement ipaddrcpy(). Must be backported as far as 2.6.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	8854cef036	MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event. Dump the early data status from QUIC_EV_CONN_IO_CB trace event. This is very helpful to know if the QUIC server has accepted the early data received from clients.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	e926378375	MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct) Modify qf_new_token structure to use a static buffer with QUIC_TOKEN_LEN as size as defined by the token for future connections (quic_token.c). Modify consequently the NEW_TOKEN frame parser (see quic_parse_new_token_frame()). Also add comments to denote that the NEW_TOKEN parser function is used only by clients and that its builder is used only by servers.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	76c80605a6	BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder quic_build_new_token_frame() is the function which is called to build a NEW_TOKEN frame into a buffer. The position pointer for this buffer was not updated, leading the NEW_TOKEN frame to be malformed. Must be backported as far as 2.6.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	f5b09dc452	MINOR: quic: Token for future connections implementation. There exist two sorts of token used by QUIC. They are both used to validate the peer address (path validation). Retry are used for the current connection the client want to open. This patch implement the other sort of tokens which after having been received from a connection, may be provided for the next connection from the same IP address to validate it (or validate the network path between the client and the server). The token generation is implemented by quic_generate_token(), and the token validation by quic_token_chek(). The same method is used as for Retry tokens to build such tokens to be reused for future connections. The format is very simple: one byte for the format identifier to distinguish these new tokens for the Retry token, followed by a 32bits timestamps. As this part is ciphered with AEAD as cryptographic algorithm, 16 bytes are needed for the AEAD tag. 16 more random bytes are added to this token and a salt to derive the AEAD secret used to cipher the token. In addition to this salt, this is the client IP address which is used also as AAD to derive the AEAD secret. So, the length of the token is fixed: 37 bytes.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	74caa0eece	MINOR: quic: Implement quic_tls_derive_token_secret(). This is function is similar to quic_tls_derive_retry_token_secret(). Its aim is to derive the secret used to cipher the token to be used for future connections. This patch renames quic_tls_derive_retry_token_secret() to a more and reuses its code to produce a more generic one: quic_do_tls_derive_token_secret(). Two arguments are added to this latter to produce both quic_tls_derive_retry_token_secret() and quic_tls_derive_token_secret() new function which calls quic_do_tls_derive_token_secret().	2024-08-30 17:04:09 +02:00
Frederic Lecaille	fb7a092203	MINOR: tools: Implement ipaddrcpy(). Implement ipaddrcpy() new function to copy only the IP address from a sockaddr_storage struct object into a buffer.	2024-08-30 17:04:09 +02:00
Nicolas CARPi	a33407b499	CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE There was a typo in the macro name, where LENGTH was incorrectly written. This didn't cause any issue because the typo appeared in all occurrences in the codebase.	2024-08-30 14:58:59 +02:00
Nicolas CARPi	534e7e4598	CLEANUP: haproxy: fix typos in code comment Use "from" instead of "form" in ha_random_boot function code comments.	2024-08-30 14:58:59 +02:00
Christopher Faulet	e4812404c5	BUG/MEDIUM: stream: Prevent mux upgrades if client connection is no longer ready If an early error occurred on the client connection, we must prevent any multiplexer upgrades. Indeed, it is unexpected for a mux to be initialized with no xprt. On a normal workflow it is impossible. So it is not an issue. But if a mux upgrade is performed at the stream level, an early error on the connection may have already been handled by the previous mux and the connection may be already fully closed. If the mux upgrade is still performed, a crash can be experienced. It is possible to have a crash with an implicit TCP>HTTP upgrade if there is no data in the input buffer. But it is also possible to get a crash with an explicit "switch-mode http" rule. It must be backported to all stable versions. In 2.2, the patch must be applied directly in stream_set_backend() function.	2024-08-28 16:38:20 +02:00
Christopher Faulet	4ef5251c44	BUG/MEDIUM: mux-h2: Set ES flag when necessary on 0-copy data forwarding When DATA frames are sent via the 0-copy data forwarding, we must take care to set the ES flag on the last DATA frame. It should be performed in h2_done_ff() when IOBUF_FL_EOI flag was set by the producer. This flag is here to know when the producer has reached the end of input. When this happens, the h2s state is also updated. It is switched to "half-closed local" or "closed" state depending on its previous state. It is mainly an issue on uploads because the server may be blocked waiting for the end of the request. A workaround is to disable the 0-copy forwarding support the the H2 by setting "tune.h2.zero-copy-fwd-send" directive to off in your global section. This patch should fix the issue #2665. It must be backported as far as 2.9.	2024-08-28 10:05:34 +02:00
Christopher Faulet	0d142e0756	MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status The "429" status can now be specified on retry-on directives. PR_RE_* flags were updated to remains sorted. This patch should fix the issue #2687. It is quite simple so it may safely be backported to 3.0 if necessary.	2024-08-28 10:05:34 +02:00
William Lallemand	d2fc1ab66e	MEDIUM: ssl/sample: add ssl_fc_sigalgs_bin sample fetch This new sample fetch allow to extract the binary list contained in the signature_algorithms (13) TLS extensions. https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.3	2024-08-26 15:17:40 +02:00
William Lallemand	e8fecef0ff	MEDIUM: ssl: capture the signature_algorithms extension from Client Hello Activate the capture of the TLS signature_algorithms extension from the Client Hello. This list is stored in the ssl_capture buffer when the global option "tune.ssl.capture-cipherlist-size" is enabled.	2024-08-26 15:17:40 +02:00
William Lallemand	ac5c7158f9	MEDIUM: ssl/sample: add ssl_fc_supported_versions_bin sample fetch This new sample fetch allow to extract the binary list contained in the supported_versions (43) TLS extensions. https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.1	2024-08-26 15:17:40 +02:00
William Lallemand	ce7fb6628e	MEDIUM: ssl: capture the supported_versions extension from Client Hello Activate the capture of the TLS supported_versions extension from the Client Hello. This list is stored in the ssl_capture buffer when the global option "tune.ssl.capture-cipherlist-size" is enabled.	2024-08-26 15:12:42 +02:00
William Lallemand	3c0a0f1e1b	CLEANUP: ssl: cleanup the clienthello capture In order to add more extensions, clean up the clienthello capture function a little bit.	2024-08-26 15:12:42 +02:00
Frederic Lecaille	414e3aa6bc	BUILD: quic: 32bits build broken by wrong integer conversions for printf() Since these commits the 32bits build is broken due to several errors as follow: CC src/quic_cli.o src/quic_cli.c: In function ‘dump_quic_full’: src/quic_cli.c:285:94: error: format ‘%ld’ expects argument of type ‘long int’, but argument 5 has type ‘uint64_t’ {aka ‘long long unsigned int’} [-Werror=format=] 285 \| chunk_appendf(&trash, " [initl] rx.ackrng=%-6zu tx.inflight=%-6zu(%ld%%)\n", \| ~~^ \| \| \| long int \| %lld 286 \| pktns->rx.arngs.sz, pktns->tx.in_flight, 287 \| pktns->tx.in_flight * 100 / qc->path->cwnd); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| \| \| uint64_t {aka long long unsigned int} Replace several %ld by %llu with ull as printf conversion in quic_clic.c and a %ld by %lld with (long long) as printf conversion in quic_cc_cubic.c. Thank you to Ilya (@chipitsine) for having reported this issue in GH #2689. Must be backported to 3.0.	2024-08-26 11:21:48 +02:00
William Lallemand	7a03ab426f	BUILD: tools: environ is not defined in OS X and BSD Add extern char environ which in order to build the new functions to manipulate the environment. Indeed the variable environ is not required to be declared by POSIX, so it need to be declared manually: "In addition, the following variable, which must be declared by the user if it is to be used directly: extern char environ;" https://pubs.opengroup.org/onlinepubs/9699919799/functions/environ.html	2024-08-23 19:39:57 +02:00
Valentine Krasnobaeva	28ca7fc594	BUG/MINOR: haproxy: free init_env in deinit only if allocated This fixes `7b78e1571` (" MINOR: mworker: restore initial env before wait mode"). In cases, when haproxy starts without any configuration, for example: 'haproxy -vv', init_env array to backup env variables is never allocated. So, we need to check in deinit(), when we free its memory, that init_env is not a NULL ptr.	2024-08-23 19:08:53 +02:00
Valentine Krasnobaeva	7b78e1571b	MINOR: mworker: restore initial env before wait mode This patch is the follow-up of `1811d2a6ba` (MINOR: tools: add helpers to backup/clean/restore env). In order to avoid unexpected behaviour in master-worker mode during the process reload with a new configuration, when the old one has contained '*env' keywords, let's backup its initial environment before calling parse_cfg() and let's clean and restore it in the context of master process, just before it enters in a wait polling loop. This will garantee that new workers will have a new updated environment and not the previous one inherited from the master, which does not read the configuration, when it's in a wait-mode.	2024-08-23 17:06:59 +02:00
Valentine Krasnobaeva	1811d2a6ba	MINOR: tools: add helpers to backup/clean/restore env 'setenv', 'presetenv', 'unsetenv', 'resetenv' keywords in configuration could modify the process runtime environment. In case of master-worker mode this creates a problem, as the configuration is read only once before the forking a worker and then the master process does the reexec without reading any config files, just to free the memory. So, during the reload a new worker process will be created, but it will inherited the previous unchanged environment from the master in wait mode, thus it won't benefit the changes in configuration, related to '*env' keywords. This may cause unexpected behavior or some parser errors in master-worker mode. So, let's add a helper to backup all process env variables just before it will read its configuration. And let's also add helpers to clean up the current runtime environment and to restore it to its initial state (as it was before parsing the config).	2024-08-23 17:06:33 +02:00
Amaury Denoyelle	960d68a5af	MINOR: mux-quic: correct qcc_bufwnd_full() documentation Fix returned value domment of qcc_bufwnd_full() which was incorrect.	2024-08-23 16:25:04 +02:00
Amaury Denoyelle	ecfedc2570	MINOR: mux-quic: add buf_in_flight to QCC debug infos Dump <buf_in_flight> QCC field both in QUIC MUX traces and "show quic". This could help to detect if MUX does not allocate enough buffers compared to quic_conn current congestion window.	2024-08-22 17:48:23 +02:00
Nathan Wehrman	5c07d58e08	MINOR: config: Created env variables for http and tcp clf formats Since we already have variables for the other formats and the change is trivial I thought it would be a nice addition for completeness	2024-08-22 09:15:58 +02:00
Willy Tarreau	9911b53d75	CLEANUP: protocol: no longer initialize .receivers nor .nb_receivers Protocol definitions no longer need to initialize these internal fields, as they're now properly initialized during protocol registration.	2024-08-21 17:37:46 +02:00
Willy Tarreau	1cb3b0b745	MINOR: protocol: always initialize the receivers list on registration Till now, protocols were required to self-initialize their receivers list head, which is not very convenient, and is quite error prone. Indeed, it's too easy to copy-paste a protocol definition and forget to update the .receivers field to point to itself, resulting in mixed lists. Let's just do that in protocol_register(). And while we're at it, let's also zero the nb_receivers entry that works with it, so that the protocol definition isn't required to pre-initialize stuff related to internal book-keeping.	2024-08-21 17:37:46 +02:00
Willy Tarreau	034974106f	MINOR: socket: don't ban all custom families from reuseport The test on ss_family >= AF_MAX is too strict if we want to support new custom families, let's apply this to the real_family instead so that we check that the underlying socket supports reuseport.	2024-08-21 17:37:46 +02:00
Willy Tarreau	2a799b64b0	MINOR: protocol: add the real address family to the protocol For custom families, there's sometimes an underlying real address and it would be nice to be able to directly use the real family in calls to bind() and connect() without having to add explicit checks for exceptions everywhere. Let's add a .real_family field to struct proto_fam for this. For now it's always equal to the family except for non-transferable ones such as rhttp where it's equal to the custom one (anything else could fit).	2024-08-21 17:37:46 +02:00
Willy Tarreau	d592ebdbeb	MEDIUM: socket: always properly use the sock_domain for requested families Now we make sure to always look up the protocol's domain for an address family. Previously we would use it as-is, which prevented from properly using custom addresses (which is when they differ). This removes some hard-coded tests such as in log.c where UNIX vs UDP was explicitly checked for example. It requires a bit of care, however, so as to properly pass value 1 in the 3rd arg of the protocol_lookup() for DGRAM stuff. Maybe one day we'll change these for defines or enums to limit mistakes.	2024-08-21 17:36:58 +02:00
Willy Tarreau	ba4a416c66	MINOR: protocol: add a family lookup At plenty of places we have access to an address family which may include some custom addresses but we cannot simply convert them to the real families without performing some random protocol lookups. Let's simply add a proto_fam table like we have for the protocols. The protocols could even be indexed there, but for now it's not worth it.	2024-08-21 16:46:15 +02:00
Willy Tarreau	732913f848	MINOR: protocol: properly assign the sock_domain and sock_family When we finally split sock_domain from sock_family in 2.3, something was not cleanly finished. The family is what should be stored in the address while the domain is what is supposed to be passed to socket(). But for the custom addresses, we did the opposite, just because the protocol_lookup() function was acting on the domain, not the family (both of which are equal for non-custom addresses). This is an API bug but there's no point backporting it since it does not have visible effects. It was visible in the code since a few places were using PF_UNIX while others were comparing the domain against AF_MAX instead of comparing the family. This patch clarifies this in the comments on top of proto_fam, addresses the indexing issue and properly reconfigures the two custom families.	2024-08-21 16:46:15 +02:00
Willy Tarreau	67bf1d6c9e	MINOR: quic: support a tolerance for spurious losses Tests performed between a 1 Gbps connected server and a 100 mbps client, distant by 95ms showed that: - we need 1.1 MB in flight to fill the link - rare but inevitable losses are sufficient to make cubic's window collapse fast and long to recover - a 100 MB object takes 69s to download - tolerance for 1 loss between two ACKs suffices to shrink the download time to 20-22s - 2 losses go to 17-20s - 4 losses reach 14-17s At 100 concurrent connections that fill the server's link: - 0 loss tolerance shows 2-3% losses - 1 loss tolerance shows 3-5% losses - 2 loss tolerance shows 10-13% losses - 4 loss tolerance shows 23-29% losses As such while there can be a significant gain sometimes in setting this tolerance above zero, it can also significantly waste bandwidth by sending far more than can be received. While it's probably not a solution to real world problems, it repeatedly proved to be a very effective troubleshooting tool helping to figure different root causes of low transfer speeds. In spirit it is comparable to the no-cc congestion algorithm, i.e. it must not be used except for experimentation.	2024-08-21 08:34:30 +02:00
Willy Tarreau	fab0e99aa1	MINOR: quic: store the lost packets counter in the quic_cc_event element Upon loss detection, qc_release_lost_pkts() notifies congestion controllers about the event and its final time. However it does not pass the number of lost packets, that can provide useful hints for some controllers. Let's just pass this option.	2024-08-21 08:02:44 +02:00
Valentine Krasnobaeva	2e6e159ac4	BUG/MINOR: cfgparse-global: remove tune.fast-forward from common_kw_list Remove tune.fast-forward from common_kw_list. It was replaced by 'tune.disable-fast-forward' and it's no longer present in "if..else if.." parser from cfg_parse_global(). Otherwise, it may be shown as the best-match keyword for some tune options, which is now wrong. Should be backported in versions 2.9 and 3.0.	2024-08-20 19:16:34 +02:00
Valentine Krasnobaeva	731ef865e3	MINOR: cfgparse-global: move unsupported keywords in global list Following the previous commits and in order to clean up cfg_parse_global let's move unsupported keywords in the global list and let's add for them a dedicated parser.	2024-08-20 19:16:33 +02:00
Valentine Krasnobaeva	55309592db	MINOR: cfgparse-global: move tune options in global keywords list In order to clean up cfg_parse_global() and to add the support of the new MODE_DISCOVERY in configuration parsing, let's move the keywords related to tune options into the global keywords list and let's add for them two dedicated parsers. Tune options keywords are sorted between two parsers in dependency of parameters number, which a given tune option needs. tune options parser is called by section parser and follows the common API, i.e. it returns -1 on failure, 0 on success and 1 on recoverable error. In case of recoverable error we've previously returned ERR_ALERT (0x10) and we have emitted an alert message at startup. Section parser treats all rc > 0 as ERR_WARN. So in case, if some tune option was set twice in the global section, tune options parser will return 1 (in order to respect the common API), section parser will treat this as ERR_WARN and a warning message will be emitted during process startup instead of alert, as it was before.	2024-08-20 19:16:32 +02:00
Valentine Krasnobaeva	c46497f16f	MINOR: cfgparse-global: move 'expose-' in global keywords list Following the previous commit let's also move 'expose-' keywords in the global cfg_kws list and let's add for them a dedicated parser. This will simplify the configuration parsing in the new MODE_DISCOVERY, which allows to read only the keywords, needed at the early start of haproxy process (i.e. modes, pidfile, chosen poller).	2024-08-20 19:16:31 +02:00
Valentine Krasnobaeva	450ce3e61b	MINOR: cfgparse-global: move 'pidfile' in global keywords list This commit cleans up cfg_parse_global() and prepares the config parser to support MODE_DISCOVERY. This step is needed in early starting stage, just to figura out in which mode the process was started, to set some necessary parameteres needed for this mode and to continue the initialization stage. 'pidfile' makes part of such common keywords, which are needed to be parsed very early and which are used almost in all process modes (except the foreground, '-d'). 'pidfile' keyword parser is called by section parser and follows the common API, i.e. it returns -1 on failure, 0 on success and 1 on recoverable error. In case of recoverable error we've previously returned ERR_ALERT (0x10) and we have emitted an alert message at startup. Section parser treats all rc > 0 as ERR_WARN. So in case, if pidfile was already specified via command line, the keyword parser will return 1 (in order to respect the common API), section parser will treat this as ERR_WARN and a warning message will be emitted during process startup instead of alert, as it was before.	2024-08-20 19:16:30 +02:00
Valentine Krasnobaeva	f29be97ac7	BUG/MINOR: cfgparse-global: remove redundant goto In the case, when the given keyword was found in the global 'cfg_kws' list, we go to 'out' label anyway, after testing rc returned by the keyword's parser. So there is not a much gain if we perform 'goto out' jump specifically when rc > 0.	2024-08-20 19:16:29 +02:00
Valentine Krasnobaeva	74bc6f3d66	BUG/MINOR: cfgparse-global: clean common_kw_list This patch fixes commits `118ac11ce` ("MINOR: cfgparse-global: move mode's keywords in cfg_kw_list") and `83ff4db18` (MINOR: cfgparse-global: move no<poller_name> in cfg_kw_list). 'common_kw_list' serves to show the best-match keyword in cfg_parse_global(), if the given keyword was not parsed in "if..else if.." cases. cfg_parse_global() is still used as a parser for some keywords from the global section. Mode-specific and no<poller_name> keywords now have their own parsers. They no longer take place in the "if..else if.." from cfg_parse_global() and they are registered in the 'cfg_kws' list. So, there is no longer need to duplicate them in the 'common_kw_list'. Otherwise, they will be shown twice in parser error message.	2024-08-20 19:16:28 +02:00
Valentine Krasnobaeva	4291d10b44	BUG/MINOR: cfgparse-global: fix err msg in mworker keyword parser This patch fixes the commit `118ac11ce` ("cfgparse-global: move mode's keywords in cfg_kw_list"). Error message delivered by keyword parser in **err is always shown with ha_alert() by the caller cfg_parse_global(). The caller always supplies these alerts with the filename and the line number.	2024-08-20 19:16:27 +02:00
Amaury Denoyelle	0d6112b40b	MINOR: mux-quic: retry after small buf alloc failure Previous commit switch to small buffers for HTTP/3 HEADERS emission. This ensures that several parallel streams can allocate their own buffer without hitting the connection buffer limit based now on the congestion window size. However, this prevents the transmission of responses with uncommonly large headers. Indeed, if all headers cannot be encoded in a single buffer, an error is reported which cause the whole connection closure. Adjust this by implementing a realloc API exposed by QUIC MUX. This allows application layer to switch from a small to a default buffer and restart its processing. This guarantees that again headers not longer than bufsize can be properly transferred.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	b355e89bf9	MEDIUM: h3: allocate small buffers for headers frames A major change was recently implemented to change QUIC MUX Tx buffer allocation limit, which is now based on the current connection congestion window size. As this size may be smaller than the previous static value, it is likely that the limit will be reached more frequently. When using HTTP/3, the majority of requests streams are used for small object exchanges. Every responses start with a HEADERS frames which should be much smaller in size than the default buffer. But as the whole buffer size is accounted against the congestion window, a single stream can block others even if only emitting a single HEADERS frame which is suboptimal for bandwith usage, if the congestion window is small enough. To adapt to this new situation, rely on the newly available small buffers to transfer HEADERS frame response. This at least guarantee that several parallel streams could allocate their own buffer for the first part of the response, even with a small congestion window. The situation could be further improve to use various indication on the data size and select a small buffer if sufficient. This could be done for example via the Content-length value or HTX extra field. However this must be the subject of a dedicated patch.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	885e4c5cf8	MINOR: quic: support sbuf allocation in quic_stream This patch extends qc_stream_desc API to be able to allocate small buffers. QUIC MUX API is similarly updated as ultimatly each application protocol is responsible to choose between a default or a smaller buffer. Internally, the type of allocated buffer is remembered via qc_stream_buf instance. This is mandatory to ensure that the buffer is released in the correct pool, in particular as small and standard buffers can be configured with the same size. This commit is purely an API change. For the moment, small buffers are not used. This will changed in a dedicated patch.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	d0d8e57d47	MINOR: quic: define sbuf pool Define a new buffer pool reserved to allocate smaller memory area. For the moment, its usage will be restricted to QUIC, as such it is declared in quic_stream module. Add a new config option "tune.bufsize.small" to specify the size of the allocated objects. A special check ensures that it is not greater than the default bufsize to avoid unexpected effects.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	1de5f718cf	MINOR: quic/config: adapt settings to new conn buffer limit QUIC MUX buffer allocation limit is now directly based on the underlying congestion window size. previous static limit based on conn-tx-buffers is now unused. As such, this commit adds a warning to users to prevent that it is now obsolete. Secondly, update max-window-size setting. It is now the main entrypoint to limit both the maximum congestion window size and the number of QUIC MUX allocated buffer on emission. Remove its special value '0' which was used to automatically adjust it on now unused conn-tx-buffers.	2024-08-20 17:59:35 +02:00
Amaury Denoyelle	aeb8c1ddc3	MAJOR: mux-quic: allocate Tx buffers based on congestion window Each QUIC MUX may allocate buffers for MUX stream emission. These buffers are then shared with quic_conn to handle ACK reception and retransmission. A limit on the number of concurrent buffers used per connection has been defined statically and can be updated via a configuration option. This commit replaces the limit to instead use the current underlying congestion window size. The purpose of this change is to remove the artificial static buffer count limit, which may be difficult to choose. Indeed, if a connection performs with minimal loss rate, the buffer count would limit severely its throughput. It could be increase to fix this, but it also impacts others connections, even with less optimal performance, causing too many extra data buffering on the MUX layer. By using the dynamic congestion window size, haproxy ensures that MUX buffering corresponds roughly to the network conditions. Using QCC <buf_in_flight>, a new buffer can be allocated if it is less than the current window size. If not, QCS emission is interrupted and haproxy stream layer will subscribe until a new buffer is ready. One of the criticals parts is to ensure that MUX layer previously blocked on buffer allocation is properly woken up when sending can be retried. This occurs on two occasions : * after an already used Tx buffer is cleared on ACK reception. This case is already handled by qcc_notify_buf() via quic_stream layer. * on congestion window increase. A new qcc_notify_buf() invokation is added into qc_notify_send(). Finally, remove <avail_bufs> QCC field which is now unused. This commit is labelled MAJOR as it may have unexpected effect and could cause significant behavior change. For example, in previous implementation QUIC MUX would be able to buffer more data even if the congestion window is small. With this patch, data cannot be transferred from the stream layer which may cause more streams to be shut down on client timeout. Another effect may be more CPU consumption as the connection limit would be hit more often, causing more streams to be interrupted and woken up in cycle.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	000976af58	MINOR: mux-quic: define buf_in_flight Define a new QCC counter named <buf_in_flight>. Its purpose is to account the current sum of all allocated stream buffer size used on emission. For this moment, this counter is updated and buffer allocation and deallocation. It will be used to replace <avail_bufs> once congestion window is used as limit for buffer allocation in a future commit.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	f9777bea30	MINOR: h3: mark control stream as metadata A current work is performed to change QUIC MUX buffer allocation limit from a configurable static value to use the size of the congestion window instead. This change may cause the buffer allocation limit to be triggered more frequently. To ensure HTTP/3 control emission is not perturbed by this change, mark the stream with qcc_send_metadata(). This ensures that buffer allocation for this stream won't be subject to the connection limit. This is necessary to guarantee that SETTINGS and GOAWAY frames are emitted.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	4c4bf26f44	MEDIUM: mux-quic: implement API to ignore txbuf limit for some streams Define a new qc_stream_desc flag QC_SD_FL_OOB_BUF. This is to mark streams which are not subject to the connection limit on allocated MUX stream buffer. The purpose is to simplify handling of QUIC MUX streams which do not transfer data and as such are not driven by haproxy layer, for example HTTP/3 control stream. These streams interacts synchronously with QUIC MUX and cannot retry emission in case of temporary failure. This commit will be useful once connection buffer allocation limit is reimplemented to directly rely on the congestion window size. This will probably cause the buffer limit to be reached more frequently, maybe even on QUIC MUX initialization. As such, it will be possible to mark control streams and prevent them to be subject to the buffer limit. QUIC MUX expose a new function qcs_send_metadata(). It can be used by an application protocol to specify which streams are used for control exchanges. For the moment, no such stream use this mechanism.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	f4d1bd0b76	MINOR: mux-quic: account stream txbuf in QCC A limit per connection is put on the number of buffers allocated by QUIC MUX for emission accross all its streams. This ensures memory consumption remains under control. This limit is simply explained as a count of buffers which can be concurrently allocated for each connection. As such, quic_conn structure was used to account currently allocated buffers. However, a quic_conn nevers allocates new stream buffers. This is only done at QUIC MUX layer. As such, this commit moves buffer accounting inside QCC structure. This simplifies the API, most notably qc_stream_buf_alloc() usage. Note that this commit inverts the accounting. Previously, it was initially set to 0 and increment for each allocated buffer. Now, it is set to the maximum value and decrement for each buf usage. This is considered as clearer to use.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	635fbaaa4a	MINOR: quic: allocate stream txbuf via qc_stream_desc API This commit simply adjusts QUIC stream buffer allocation. This operation is conducted by QUIC MUX using qc_stream_desc layer. Previously, qc_stream_buf_alloc() would return a qc_stream_buf instance and QUIC MUX would finalized the buffer area allocation. Change this to perform the buffer allocation directly into qc_stream_buf_alloc(). This patch clarifies the interaction between QUIC MUX and qc_stream_desc. It is cleaner to allocate the buffer via qc_stream_desc as it is already responsible to free the buffer. It also ensures that connection buffer accounting is only done after the whole qc_stream_buf and its buffer are allocated. Previously, the increment operation was performed between the two steps. This was not an issue, as this kind of error triggers the whole connection closure. However, if in the future this is handled as a stream closure instead, this commit ensures that the buffer remains valid in all cases.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	c24c8667b2	MINOR: quic: define max-window-size config setting Define a new global keyword tune.quic.frontend.max-window-size. This allows to set globally the maximum congestion window size for each QUIC frontend connections. The default value is 0. It is a special value which automatically derive the size from the configured QUIC connection buffer limit. This is similar to the previous "quic-cc-algo" behavior, which can be used to override the maximum window size per bind line.	2024-08-20 17:02:29 +02:00
Amaury Denoyelle	280b61468a	MINOR: quic: extract config window-size parsing quic-cc-algo is a bind line keyword which allow to select a QUIC congestion algorithm. It can take an optional integer to specify the maximum window size. This value is an integer and support the suffixes 'k', 'm' and 'g' to specify respectively kilobytes, megabytes and gigabytes. Extract the maximum window size parsing in a dedicated function named parse_window_size(). It accepts as input an integer value with an optional suffix, 'k', 'm' or 'g'. The first invalid character is returned by the function to the caller. No functional change. This commit will allow to quickly implement a new keyword to configure a default congestion window size in the global section.	2024-08-20 16:07:22 +02:00
Nicolas CARPi	bba679026c	BUG/MINOR: stats: add lang attribute to html tag The "html" element of the stats page was missing a "lang" attribute. This change specifies the "en" value, which corresponds to english language. It is also a required element for WCAG Success Criterion 3.1.1, which renders the web more accessible through a set of requirements. In this case it allows assistive technologies such as screen readers to determine the language of the page. MDN page: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/lang HTML standard: https://html.spec.whatwg.org/multipage/dom.html#attr-lang WCAG criterion: https://www.w3.org/WAI/WCAG22/Understanding/language-of-page.html	2024-08-20 15:55:45 +02:00
Nicolas CARPi	9318a624a1	CLEANUP: stats: use modern DOCTYPE tag Switching the stats page doctype to the modern standard is shorter and less complex, and is the recommended doctype by current HTML standard. It makes it clear that we do not want to run in quirks mode. More information below. Quirks mode: https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode HTML Standard: https://html.spec.whatwg.org/multipage/syntax.html#the-doctype	2024-08-20 15:55:31 +02:00
Nicolas CARPi	c63d558e41	BUG/MINOR: stats: fix color of input elements in dark mode Previously the text color was dark, with a dark background, this makes it white, and thus readable. This is visible on the "Scope" input field.	2024-08-20 15:55:14 +02:00
Valentine Krasnobaeva	8b1dfa9def	MINOR: cfgparse: limit file size loaded via /dev/stdin load_cfg_in_mem() can continuously reallocate memory in order to load an extremely large input from /dev/stdin, until it fails with ENOMEM, which means that process has consumed all available RAM. In case of containers and virtualized environments it's not very good. So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will limit the size of input supplied via /dev/stdin.	2024-08-20 14:28:34 +02:00
Nathan Wehrman	fd48b28315	MINOR: Implements new log format of option tcplog clf Some systems require log formats in the CLF format and that meant that I could not send my logs for proxies in mode tcp to those servers. This implements a format that uses log variables that are compatble with TCP mode frontends and replaces traditional HTTP values in the CLF format to make them stand out. Instead of logging method and URI like this "GET /example HTTP/1.1" it will log "TCP " and for a response code I used "000" so it would be easy to separate from legitimate HTTP traffic. Now your log servers that require a CLF format can see the timings for TCP traffic as well as HTTP.	2024-08-20 07:46:34 +02:00
Aurelien DARRAGON	f8299bc5ea	MINOR: log: "drop" support for log-profile steps It is now possible to use "drop" keyword for "on" lines under a log-profile section to specify that no log at all should be emitted for the specified step (setting an empty format was not sufficient to do so because only the log payload would be empty, not the log header, thus the log would still be emitted). It may be useful to selectively disable logging at specific steps for a given log target (since the log profile may be set on log directives): log-profile myprof on request format "blabla" sd "custom sd" on response drop New testcase was added to reg-tests/log/log_profiles.vtc	2024-08-19 18:53:01 +02:00
Aurelien DARRAGON	41ca89bc6f	MEDIUM: log: relax some checks and emit diag warnings instead in lf_expr_postcheck() With `7a21c3a` ("MAJOR: log: implement proper postparsing for logformat expressions") which finally made postparsing checks reliable, we started to get report from users that couldn't start haproxy 3.0 with configs that used to work in the past. The current situation is described in GH #2642. While the checks are mostly relevant, it turns out there are not strictly needed anymore from a technical point of view. Most of them were useful in early logformat implementation to prevent runtime bugs due to the use of an alias or fetch at runtime from an incompatible proxy. It's been a few versions already that the code handling fetches and log aliases is robust enough to support fetches/aliases used from the wrong context: all it does is that the fetch/alias will silently fail if it's not available. This can be proved by the fact that even if the postparsing checks were partially broken in the past, it didn't cause runtime issues (at least on recent haproxy versions). Most of these checks can now be seen as configuration hints: when a check triggers, it will indicate a configuration inconsistency in most cases, but they are some corner cases where it is not possible to know at config time if the conditions will be met for the alias/fetch to work properly.. so instead of failing with a hard error like we did so far, let's just be more permissive and report our findings using "diag_warning": such warnings are only emitted when haproxy is started with '-dD' cli option. We also took this opportunity to improve messages clarity and make them more precise (report the offending item instead of complaining about the whole expression because of a single element). With this patch, configs that used to start before `7a21c3a` shouldn't trigger hard errors anymore. This may be backported in 3.0.	2024-08-16 14:25:10 +02:00
Valentine Krasnobaeva	911f4d93d4	BUG/MINOR: pattern: pat_ref_set: return 0 if err was found pat_ref_set_elt() returns 0, if we are run out of memory or can't parse a new map value. Any arror message emitted by pat_ref_set_elt() is saved in err buffer, if its provided by caller. These error messages are cumulated during the loop. pat_ref_set() is used to update values in map, referred to the same given key. If during the update pat_ref_set_elt() fails, let's retun 0 to caller immediately. We have the same non-unique key and the same new value in each loop. So it seems quite odd to cumulate the same error messages and print it in CLI: > add map @1 mytest.map << + 1.0.1.11 TestA + 1.0.1.11 TESTA + 1.0.1.11 test_a + > set map mytest.map 1.0.1.11 15 unable to parse '15' unable to parse '15' unable to parse '15'. cli_parse_set_map(), which calls pat_ref_set() to update map, will return only one error message with this patch: > set map mytest.map 1.0.1.11 15 unable to parse '15'. hlua_set_map() and http_action_set_map() don't provide error buffer and will just exit on the first error. This should be backported in all stable versions.	2024-08-13 16:13:43 +02:00
Valentine Krasnobaeva	4f2493f355	BUG/MINOR: pattern: pat_ref_set: fix UAF reported by coverity memprintf() performs realloc and updates then the pointer to an output buffer, where it has written the data. So free() is called on the previous buffer address, if it was provided. pat_ref_set_elt() uses memprintf() to write its error message as well as pat_ref_set(). So, when we re-enter into the while loop the second time and pat_ref_set_elt() has returned, the err ptr (previous value of merr) is already freed by memprintf() from pat_ref_set_el(). 'if (!found)' condition is false at this point, because we've found a node at the first loop. So, the second memprintf(), in order to write error messages, does again free(*err). This should be backported in all stable versions.	2024-08-13 16:13:41 +02:00
Willy Tarreau	0982bfd999	BUG/MINOR: tools: make fgets_from_mem() stop at the end of the input The memchr() used to look for the LF character must consider the end of input, not just the output buffer size. This was found by oss-fuzz: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=71096 No backport is needed.	2024-08-11 14:44:28 +02:00
William Lallemand	75944e266e	CLEANUP: mworker/cli: clean up the mode handling Cleanup the mode handling by refactoring the strings constant that are written multiple times	2024-08-09 17:47:20 +02:00
Amaury Denoyelle	48514c118c	BUG/MINOR: h3: properly reject too long header responses When encoding HTX to HTTP/3 headers on the response path, a bunch of ABORT_NOW() where used when buffer room was not enough. In most cases this is safe as output buffer has just been allocated and so is empty at the start of the function. However, with a header list longer than a whole buffer, this would cause an unexpected crash. Fix this by removing ABORT_NOW() statement with proper error return path. For the moment, this would cause the whole connection to be close rather than the stream only. This may be further improved in the future. Also remove ABORT_NOW() when encoding frame length at the end of headers or trailers encoding. Buffer room is sufficient as it was already checked prior in the same function. This should be backported up to 2.6. Special care should be handled however as this code path has changed frequently : * for 2.9 and older, the extra following statement must be inserted prior each newly added goto statement : h3c->err = H3_INTERNAL_ERROR; * for 2.6, trailers support is not implemented. As such, related chunks should just be ignored when backporting.	2024-08-09 17:41:16 +02:00
Amaury Denoyelle	8939d8e473	MINOR: mux-quic: do not trace error in qcc_send_frames() on empty list qcc_send_frames() can be called with an empty list and returns immediately with an error code. This is convenience to be able to call it in a while loop. Remove the trace with "error" when this is the case and replacing it with a less alarming "leaving on..." message. This should help debugging when traces are active.	2024-08-09 17:41:16 +02:00
Valentine Krasnobaeva	9fc69ebc0a	MINOR: proto_uxst: copy errno in errmsg for syscalls Let's copy errno in error messages, which we emit in cases when listen() or connect() fail. This is helpful for debugging.	2024-08-09 17:38:42 +02:00
Valentine Krasnobaeva	16e89f6b5c	BUG/MINOR: cfgparse: parse_cfg: fix null ptr dereference reported by coverity This commit fixes potential null ptr dereferences reported by coverity, see more details about it in the issues #2676 and #2668. 'outline' ptr, which is initialized to NULL explicitly as a temporary buffer to store split keywords may be in theory implicitly dereferenced in some corner cases (which we haven't encountered yet with real world configurations) in 'if (!**args)'. parse_line() code, called before under some conditions assigns: args[arg] = outline + outpos and outpos initial value is 0.	2024-08-09 15:43:29 +02:00
Valentine Krasnobaeva	eb82358690	BUG/MINOR: proto_uxst: delete fd from fdtab if listen() fails This patch is done mostly as a safeguard in order not to trigger BUG_ON(fdtab[fd].owner != NULL) check, if listen() will fail on UNIX domain socket. In uxst_bind_listener(), the pretty same logic of closing socket on error path was kept, as it was in tcp_bind_listener() before. The use of fd_delete() was not generalized, when the support of UNIX sock_stream protocol was implemented. So, let's remove fd from fdtab on failure, instead of closing it. Otherwise, uxst_bind_listener(), which could be called in loop for each receiver, will obtain the same fd via socket() for the next receiver. Then, it will bind it again and it will try to re-insert it in fdtab. This can be backported to all stable versions.	2024-08-09 15:23:28 +02:00
Amaury Denoyelle	f3c75a52df	BUG/MINOR: mux-quic: do not send too big MAX_STREAMS ID QUIC stream IDs are expressed as QUIC variable integer which cover the range for 0 to 2^62 - 1. As such, it is forbidden to send an ID for MAX_STREAMS flow-control frame which would allow to overcome this value. This patch fixes MAX_STREAMS emission to ensure sent value is valid. This also ensures that the peer cannot open a stream with an invalid ID as this would cause a flow-control violation instead. This must be backported up to 2.6.	2024-08-09 14:33:49 +02:00
Valentine Krasnobaeva	aae2ff7691	MINOR: startup: fix unused value reported by coverity Unused 0 is assigned to ret, as it's rewritten by error code of read_cfg(). This issue was reported by coverity.	2024-08-08 19:54:12 +02:00
Valentine Krasnobaeva	da82f08055	MINOR: cfgparse: load_cfg_in_mem: fix null ptr dereference reported by coverity This helps to optimize a bit load_cfg_in_mem() and fixes the potential null ptr dereference in fread() call. If (read_bytes + bytes_to_read) equals to initial chunk_size (zero), realloc is never called, *cfg_content keeps its NULL value. So, let's assure that initial number of bytes to read (read_bytes + bytes_to_read) is stricly positive, when we enter into loop at the first time.	2024-08-08 19:54:12 +02:00
William Lallemand	b75edf2f11	BUG/MEDIUM: mworker/cli: fix pipelined modes on master CLI Since commit `3d93ecc` ("BUG/MAJOR: cli: Restore non-interactive mode behavior with pipelined commands") and commit `598c7f16` ("BUG/MEDIUM: cli: Warn if pipelined commands are delimited by a \n"), the pipelined command on the master CLI are either broken or emit warnings depending on which version. The reason is that mode applied on the master CLI are saved on the in the current CLI session, and then reinserted for each pipelined command, however, these commande were inserted as new lines. For example: "@1; expert-mode on; debug dev log foo; debug dev log bar" Would be sent as: "expert mode on\ndebug dev log foo" "expert mode on\ndebug dev log bar" This patch fixes the issue by using the new ci_insert() function which inserts a string instead of a newline, and the command are now suffixed by ';' upon insertion allowing a correct pipelined command chain. This must be backported with the previous commit introducing ci_insert() in every stable version. This is broken since the 3.0 version, but it emits a warning in every version below, because `598c7f164` was backported.	2024-08-08 17:29:37 +02:00
William Lallemand	b2a8e8731d	MINOR: channel: implement ci_insert() function ci_insert() is a function which allows to insert a string <str> of size <len> at <pos> of the input buffer. This is the equivalent of ci_insert_line2() but without inserting '\r\n'	2024-08-08 17:29:37 +02:00
Valentine Krasnobaeva	46181e730a	MINOR: proto_tcp: tcp_bind_listener: copy errno in errmsg Let's copy errno in errmsg produced by tcp_bind_listener if it fails in a syscall(). This is helpful to debug issues, while binding listeners.	2024-08-08 16:34:13 +02:00
Valentine Krasnobaeva	81f48395b3	BUG/MINOR: proto_tcp: keep error msg if listen() fails If listen() fails, we need to keep the message about it, which is copied then in errmsg buffer on the error path. This buffer is properly provided by the caller (protocol_bind_all()) and reallocated if needed in memprintf(), but it was deleted without being returned. This can be backported to all stable versions.	2024-08-08 16:34:06 +02:00
Valentine Krasnobaeva	308c6881c0	BUG/MINOR: proto_tcp: delete fd from fdtab if listen() fails If listen() fails, fd should be deleted from fdtab, not just closed. Otherwise, sock_inet_bind_receiver(), which is called in loop for each receiver, will obtain the same fd via socket() for the next receiver, registered in the receivers list. Then, it will bind it again and it will try to re-insert it in fdtab, and fd_insert() will trigger the BUG_ON(fdtab[fd].owner != NULL) check. When tcp_bind_listener() code was implemented, the use of fd_delete() was not generalized and this one remained overlooked. This can be backported to all stable versions.	2024-08-08 16:33:53 +02:00
Valentine Krasnobaeva	c6cfa7cb4a	MINOR: startup: rename readcfgfile in parse_cfg As readcfgfile no longer opens configuration files and reads them with fgets, but performs only the parsing of provided data, let's rename it to parse_cfg by analogy with read_cfg in haproxy.c.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	5b52df4c4d	MEDIUM: startup: load and parse configs from memory Let's call load_cfg_in_ram() helper for each configuration file to load it's content in some area in memory. Adapt readcfgfile() parser function respectively. In order to limit changes in its scope we give as an argument a cfgfile structure, already filled in init_args() and in load_cfg_in_ram() with file metadata and content. Parser function (readcfgfile()) uses now fgets_from_mem() instead of standard fgets from libc implementations. SPOE filter parses its own configuration file, pointed by 'config' keyword in the configuration already loaded in memory. So, let's allocate and fill for this a supplementary cfgfile structure, which is not referenced in cfg_cfgfiles list. This structure and the memory with content of SPOE filter configuration are freed immediately in parse_spoe_flt(), when readcfgfile() returns. HAProxy OpenTracing filter also uses its own configuration file. So, let's follow the same logic as we do for SPOE filter.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	2bb34edb0b	MEDIUM: startup: make read_cfg() return immediately on ENOMEM This commit prepares read_cfg() to call load_cfg_in_mem() helper in order to load configuration files in memory. Before, read_cfg() calls the parser for all files from cfg_cfgfiles list and cumulates parser's errors and memprintf's errors in for_each loop. memprintf's errors did not stop this loop and were accounted just after. Now, as we plan to load configuration files in memory, we stop the loop, if memprintf() fails, and we show appropraite error message with ha_alert. Then process terminates. So not all cumulated syntax-related errors will be shown before exit in this case and we has to stop, because we run out of memory. If we can't open the current file or we fail to allocate a memory to store some configuration line, the previous behaviour is kept, process emits appropriate alert message and exits. If parser returns some syntax-related error on the current file, the previous behaviour is kept as well. We cumulate such errors for all parsed files and we check them just after the loop. All syntax-related errors for all files is shown then as before in ha_alert messages line by line during the startup. Then process will exit with 1. As now cfg_cfgfiles list contains many pointers to some memory areas with configuration files content and this content could be big, it's better to free the list explicitly, when parsing was finished. So, let's change read_cfg() to return some integer value to its caller init(), and let's perform the free routine at a caller level, as cfg_cfgfiles list was initialized and initially filled at this level.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	007f7f2f02	MINOR: tools: add fgets_from_mem Add fgets_from_mem() helper to read lines from configuration files, stored now as memory chunks. In order to limit changes in the first-level parser code (readcfgfile()), it is better to reimplement the standard fgets, i.e. to have a fgets, which can read the serialized data line by line from some memory area, instead of file stream, and can keep the same behaviour as libc implementations fgets.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	03e63b98ca	MINOR: cfgparse: load_cfg_in_mem: take in account file size Let's take in account the given file size, when its reported via stat. It's very convenient for large configuration files, as this allows to perform only the one memory allocation call for precisely needeed file size. This also allows to perform only the one call to fread(). We need to provide to fread() file_stat.st_size + 1 to be able to grab EOF. Like this it sets feof(f)=1 flag and this allows to exit from the loop immediately, just after fread call. If /dev/stdin or /dev/null is provided as a file, we continue to read the configuration chunk by chunk, stat doesn't report the size.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	5b9ed6e4be	MINOR: cfgparse: add load_cfg_in_mem Add load_cfg_in_mem() helper, which allows to store the content of a given file in memory.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	bafb0ce272	MINOR: startup: adapt list_append_word to use cfgfile list_append_word() helper was used before only to chain configuration file names in a list. As now we start to use cfgfile structure which represents entire file in memory and its metadata, let's adapt this helper to use this structure and let's rename it to list_append_cfgfile(). Adapt functions, which process configuration files and directories to use cfgfile structure and list_append_cfgfile() instead of wordlist.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	39f2a19620	REORG: tools: move list_append_word to cfgparse Let's move list_append_word to cfgparse.c as it is used only to fill cfg_cfgfiles list with configuration file names.	2024-08-07 18:41:41 +02:00
Aurelien DARRAGON	a6d1eb8f5d	MINOR: server: ensure max_events_at_once > 0 in server_atomic_sync() In `8f1fd96` ("BUG/MEDIUM: server/addr: fix tune.events.max-events-at-once event miss and leak"), we added a comment saying that tune.events.max-events-at-once is assumed to be strictly positive. It is so because the keyword parser forces values between 1 and 10000: we don't want less than 1 because it wouldn't make any sense, and 10k max because beyond that we could create contention in server_atomic_sync() Now as the above commit implements a do..while it heavily relies on the fact that the budget is at least 1. Upon soft-stop, we break away from the loop without decrementing the budget. With all that in mind, it is safe to assume that the 'remain' counter will only fall to 0 if the task runs out of budget while doing work, in which case the task still exists and must be rescheduled. As seen in GH #2667 this assumption was ambiguous, so let's make it official by adding a pair of BUG_ON() that make it explicit that it works because remain 'cannot' be 0 unless the entire budget was consumed. No backport needed.	2024-08-07 18:31:35 +02:00
Amaury Denoyelle	3ef1ee477d	BUG/MINOR: quic: prevent freeze after early QCS closure A connection freeze may occur if a QCS is released before transmitting any data. This can happen when an error is detected early by the stream, for example during HTTP response headers encoding, forcing the whole connection closure. In this case, a connection error is registered by the QUIC MUX to the lower layer. MUX is then release and xprt layer is notified to prepare CONNECTION_CLOSE emission. However, this is prevented because quic_conn streams tree is not empty as it contains the qc_stream_desc previously attached to the failed QCS instance. The connection will freeze until QUIC idle timeout. This situation is caused by an omission during qc_stream_desc release operation. In the described situation, qc_stream_desc current buffer is empty and can thus by removed, which is the purpose of this patch. This unblocks this previously failed situation, with qc_stream_desc removal from quic_conn tree. This issue can be reproduced by modifying H3/QPACK code to return an early error during HEADERS response processing. This must be backported up to 2.6, after a period of observation.	2024-08-07 18:14:29 +02:00
Willy Tarreau	d5da87b5dc	MINOR: mux-h3/trace: add a state trace on stream creation/destruction Logging below the developer level doesn't always yield very convenient traces as we don't know well where streams are allocated nor released. Let's just make that more explicit by using state-level traces for these important steps.	2024-08-07 16:02:59 +02:00
Willy Tarreau	23417ab9d4	MINOR: mux-h2/trace: add a state trace on stream creation/destruction Logging below the developer level doesn't always yield very convenient traces as we don't know well where streams are allocated nor released. Let's just make that more explicit by using state-level traces for these important steps.	2024-08-07 16:02:59 +02:00
Willy Tarreau	cc12d1b253	MINOR: mux-h1/trace: add a state trace on stream creation/upgrade Logging below the developer level doesn't always yield very convenient traces as we don't know well where streams are allocated nor released. Let's just make that more explicit by using state-level traces. Note that h1s destruction was already logged as closing connection or switching to idle mode.	2024-08-07 16:02:59 +02:00
Willy Tarreau	6191de6aa6	MINOR: mux-quic: add a trace context filling helper This helper is able to find a connection, a session, a stream, or a frontend from its args.	2024-08-07 16:02:59 +02:00
Willy Tarreau	b2cede590b	MINOR: mux-quic: don't leave dangling pointer after freeing qcs->sd In qcs_free() we're calling a few other functions after releasing qcs->sd. None of them make use of it for now but with traces that will change. Make sure to clear qcs->sd after releasing it.	2024-08-07 16:02:59 +02:00
Willy Tarreau	adfe0a30e1	MINOR: mux-h1: add a trace context filling helper This helper is able to find a connection, a session, a stream, a frontend or a backend from its args.	2024-08-07 16:02:59 +02:00
Willy Tarreau	6c6ef5ae12	MINOR: mux-h2: add a trace context filling helper This helper is able to find a connection, a session, a stream, a frontend or a backend from its args. Note that this required to always make sure that h2s->sess is reset on allocation because it's normally initialized later for backend streams, and producing traces between the two could pre-fill a bad pointer in the trace_ctx.	2024-08-07 16:02:59 +02:00
Willy Tarreau	10c8baca44	MINOR: trace: add a per-source helper to pre-fill the context Now sources which want to do it can provide a helper that can pre-fill some fields in the context based on their knowledge (e.g. mux streams).	2024-08-07 16:02:59 +02:00
Willy Tarreau	7d55a70f5a	MINOR: trace: move the known trace context into a dedicated struct We now have a trace_ctx to hold the sess, conn, qc, stream and so on. This will allow us to pass it across layers so that other helpers can help fill them. Ideally it should be passed as an argument to __trace_enabled() by __trace() so that it can be passed back to the trace callback. But it seems that trace callbacks are smart enough to figure all their info when they need them.	2024-08-07 16:02:59 +02:00
Willy Tarreau	d465610ec3	MEDIUM: trace: implement a "follow" mechanism With "follow" from one source to another, it becomes possible for a source to automatically follow another source's tracked pointer. The best example is the session: - the "session" source is enabled and has a "lockon session" -> its lockon_ptr is equal to the session when valid - other sources (h1,h2,h3 etc) are configured for "follow session" and will then automatically check if session's lockon_ptr matches its own session, in which case tracing will be enabled for that trace (no state change). It's not necessary to start/pause/stop traces when using this, only "follow" followed by a source with lockon enabled is needed. Some combinations might work better than others. At the moment the session is almost never known from the backend, but this may improve. The meta-source "all" is supported for the follower so that all sources will follow the tracked one.	2024-08-07 16:02:59 +02:00
Willy Tarreau	abb07af67e	MINOR: session/trace: enable very minimal session tracing By having traces at the session level, it becomes possible to start traces on session creation and pause them on session end. Doing so will soon open new possibilties to synchronize multiple traces.	2024-08-07 16:02:59 +02:00
Willy Tarreau	d2a49de9c7	MINOR: trace: support setting the sink and level for all sources at once It's extremely painful to have to set "trace <src> sink buf1" for all sources, then to do the same for "level developer" (for example). Let's have a possibility via a meta-source "all" to apply the change to all sources at once. This currently supports level and sink, which are not dependent on the source, this is a good start.	2024-08-07 16:02:59 +02:00
Willy Tarreau	6bf50dfccc	BUG/MINOR: quic/trace: make quic_conn_enc_level_init() emit NEW not CLOSE The event emitted by this trace was of type CLOSE instead of NEW, which would somtimes temporarily pause a started trace. This can be backported to 3.0, probably 2.6.	2024-08-07 16:02:59 +02:00
Willy Tarreau	7a22fbd453	BUG/MINOR: trace/quic: make "qconn" selectable as a lockon criterion The test was was performed but there's no way to set the option! Let's just add "qconn" to select the quic conn when the source supports it. This can be backported at least to 3.0, probably 2.6.	2024-08-07 16:02:59 +02:00
Willy Tarreau	0406efe9ad	BUG/MINOR: trace: automatically start in waiting mode with "start <evt>" The doc clearly says that "start <evt>" should leave the trace in pause mode until the indicated event appears. However it's not what's happening, the state is not changed until one command uses "now", so it's typically needed to configure the events with "start <evt>" then enable the waiting mode using "pause now". This is counter-intuitive and does not match the doc, so let's fix it so that "start <evt>" switches from stopped to waiting as long as at least one event is enabled. This can be backported to all versions.	2024-08-07 16:02:59 +02:00
Willy Tarreau	b5df6b5a31	BUG/MEDIUM: trace: fix null deref in lockon mechanism since TRACE_ENABLED() When calling TRACE_ENABLED(), which is called by TRACE_PRINTF(), we pass a NULL plockptr to __trace_enabled(). This argument is used when lockon is active, and may update the pointer. This is an overlook which also broke the lockon mechanism because now for calls from __trace(), it dereferences a pointer pointing to NULL, and never updates it due to the broken condition, so that trace() never sets up src->lockon_ptr. The bug was introduced in 2.8 by commit `8f9a9704bb` ("MINOR: trace: add a TRACE_ENABLED() macro to determine if a trace is active"), so the fix must be backported there.	2024-08-07 16:02:59 +02:00
Willy Tarreau	88a752ca78	BUG/MINOR: trace/quic: permit to lock on frontend/connect/session etc These ones were not proposed in the list of trackable elements. Note that this depends on previous commit: BUG/MINOR: trace/quic: enable conn/session pointer recovery from quic_conn This should be backported to at least 3.0, maybe even 2.6.	2024-08-07 16:02:59 +02:00
Willy Tarreau	aa1915a9f5	BUG/MINOR: trace/quic: enable conn/session pointer recovery from quic_conn In __trace_enabled(), a quic_conn was detected, but it was not possible to derive the connection nor the session from it, which was quite limiting in terms of ability to track a same instance. This should be backported to at least 3.0, maybe even 2.6.	2024-08-07 16:02:59 +02:00
Amaury Denoyelle	9f829ea3f3	MINOR: mux-quic: measure QCS lifetime and its blocking state Reuse newly defined tot_time structure to measure various values related to a QCS lifetime. First, a timer is used to comptabilize the total QCS lifetime. Then, two other timers are used to account the total time during which Tx from stream layer to MUX is blocked, either on lack of buffer or due to flow-control. These three timers are reported in qmux_dump_qcs_info(). Thus, they are available in traces and for QUIC MUX debug string sample.	2024-08-07 15:40:52 +02:00
Amaury Denoyelle	663416b4ef	MINOR: quic: dump quic_conn debug string for logs Define a new xprt_ops callback named dump_info. This can be used to extend MUX debug string with infos from the lower layer. Implement dump_info for QUIC stack. For now, only minimal info are reported : bytes in flight and size of the sending window. This should allow to detect if the congestion controller is fine. These info are reported via QUIC MUX debug string sample.	2024-08-07 15:40:52 +02:00
Amaury Denoyelle	630fa53c51	MINOR: mux-quic: implement debug string for logs Implement MUX_SCTL_DBG_STR for QUIC MUX. This returns info for the current QCS and QCC instances, reusing qmux_dump_qc{c,s}_info functions already used for traces, and the connection flags. This stream operation is useful for debug string sample support.	2024-08-07 15:40:52 +02:00
Amaury Denoyelle	eb4dfa3b36	MINOR: mux-quic: define dump functions for QCC and QCS Extract trace code to dump QCC and QCS instances into dedicated functions named qmux_dump_qc{c,s}_info(). This will allow to easily print QCC/QCS infos outside of traces.	2024-08-07 15:40:52 +02:00
Willy Tarreau	490cb16d3a	MINOR: mux-h2: implement the debug string for logs Now it permits to have this for a front and a back: <134>Jul 30 19:32:53 haproxy[24405]: 127.0.0.1:64860 [30/Jul/2024:19:32:53.732] test2 test2/s1 0/0/0/0/0 200 130 - - ---- 2/1/0/0/0 0/0 "GET /blah HTTP/2.0" h2s.id=1 .st=CLO .flg=0x7003 .rxbuf=0@(nil)+0/0 .sc=0x1e03fb0(.flg=0x00034482 .app=0x1e04020) .sd=0x1e03f30(.flg=0x50405601) .subs=(nil) h2c.st0=FRH .err=0 .maxid=1 .lastid=-1 .flg=0x100e00 .nbst=0 .nbsc=1, .glitches=0 .fctl_cnt=0 .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=1 .dsi=1 .dbuf=0@(nil)+0/0 .mbuf=[1..1\|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=(nil) conn.flg=0x80000300 <134>Jul 30 19:32:53 haproxy[24405]: 127.0.0.1:65246 [30/Jul/2024:19:32:53.732] test1 test1/s1 0/0/0/0/0 200 130 - - ---- 2/1/0/0/0 0/0 "GET /blah HTTP/1.1" h2s.id=1 .st=CLO .flg=0x7003 .rxbuf=0@(nil)+0/0 .sc=0x1dfc7b0(.flg=0x0006d01b .app=0x1c65fe0) .sd=0x1dfc820(.flg=0x1040ca01) .subs=(nil) h2c.st0=FRH .err=0 .maxid=1 .lastid=-1 .flg=0x108e00 .nbst=0 .nbsc=1, .glitches=0 .fctl_cnt=0 .send_cnt=0 .tree_cnt=1 .orph_cnt=0 .sub=1 .dsi=1 .dbuf=0@(nil)+0/0 .mbuf=[1..1\|32],h=[0@(nil)+0/0],t=[0@(nil)+0/0] .task=(nil) conn.flg=0x000300 Just with this in the front and back proxies respectively: log-format "$HAPROXY_HTTP_LOG_FMT %[bs.debug_str(15)]" log-format "$HAPROXY_HTTP_LOG_FMT %[fs.debug_str(15)]" For now the mux only implements muxs, muxc, conn. Xprt is ignored.	2024-08-07 14:07:41 +02:00
Willy Tarreau	921e04bf87	MINOR: stconn: add a new pair of sf functions {bs,fs}.debug_str These are passed to the underlying mux to retrieve debug information at the mux level (stream/connection) as a string that's meant to be added to logs. The API is quite complex just because we can't pass any info to the bottom function. So we construct a union and pass the argument as an int, and expect the callee to fill that with its buffer in return. Most likely the mux->ctl and ->sctl API should be reworked before the release to simplify this. The functions take an optional argument that is a bit mask of the layers to dump: muxs=1 muxc=2 xprt=4 conn=8 sock=16 The default (0) logs everything available.	2024-08-07 14:07:41 +02:00
Amaury Denoyelle	b2282082dd	MINOR: quic: enforce ACK reception is handled in order Add a new BUG_ON() in qc-stream_desc_ack(). It ensures that acknowledgement are always notify in-order. This is because out-of-order ACKs cannot be handled by qc_stream_desc layer which does not support gap in STREAM sent data. Prior to this fix, out-of-order ACKs are simply ignored without any error. This currently cannot happen thanks to careful qc_stream_desc_ack() invokation. If this assumption is broken in the future by inatteion, this would cause loss of ACK notification which will prevent qc_stream_desc release.	2024-08-07 11:08:20 +02:00
Amaury Denoyelle	e177cf341c	BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM STREAM frames have dedicated handling on retransmission. A special check is done to remove data already acked in case of duplicated frames, thus only unacked data are retransmitted. This handling is faulty in case of an empty STREAM frame with FIN set. On retransmission, this frame does not cover any unacked range as it is empty and is thus discarded. This may cause the transfer to freeze with the client waiting indefinitely for the FIN notification. To handle retransmission of empty FIN STREAM frame, qc_stream_desc layer have been extended. A new flag QC_SD_FL_WAIT_FOR_FIN is set by MUX QUIC when FIN has been transmitted. If set, it prevents qc_stream_desc to be freed until FIN is acknowledged. On retransmission side, qc_stream_frm_is_acked() has been updated. It now reports false if FIN bit is set on the frame and qc_stream_desc has QC_SD_FL_WAIT_FOR_FIN set. This must be backported up to 2.6. However, this modifies heavily critical section for ACK handling and retransmission. As such, it must be backported only after a period of observation. This issue can be reproduced by using the following socat command as server to add delay between the response and connection closure : $ socat TCP-LISTEN:<port>,fork,reuseaddr,crlf SYSTEM:'echo "HTTP/1.1 200 OK"; echo ""; sleep 1;' On the client side, ngtcp2 can be used to simulate packet drop. Without this patch, connection will be interrupted on QUIC idle timeout or haproxy client timeout with ERR_DRAINING on ngtcp2 : $ ngtcp2-client --exit-on-all-streams-close -r 0.3 <host> <port> "http://<host>:<port>/?s=32o" Alternatively to ngtcp2 random loss, an extra haproxy patch can also be used to force skipping the emission of the empty STREAM frame : diff --git a/include/haproxy/quic_tx-t.h b/include/haproxy/quic_tx-t.h index efbdfe687..1ff899acd 100644 --- a/include/haproxy/quic_tx-t.h +++ b/include/haproxy/quic_tx-t.h @@ -26,6 +26,8 @@ extern struct pool_head pool_head_quic_cc_buf; / Flag a sent packet as being probing with old data / #define QUIC_FL_TX_PACKET_PROBE_WITH_OLD_DATA (1UL << 5) +#define QUIC_FL_TX_PACKET_SKIP_SENDTO (1UL << 6) + / Structure to store enough information about TX QUIC packets. / struct quic_tx_packet { / List entry point. / diff --git a/src/quic_tx.c b/src/quic_tx.c index 2f199ac3c..2702fc9b9 100644 --- a/src/quic_tx.c +++ b/src/quic_tx.c @@ -318,7 +318,7 @@ static int qc_send_ppkts(struct buffer buf, struct ssl_sock_ctx ctx) tmpbuf.size = tmpbuf.data = dglen; TRACE_PROTO("TX dgram", QUIC_EV_CONN_SPPKTS, qc); - if (!skip_sendto) { + if (!skip_sendto && !(first_pkt->flags & QUIC_FL_TX_PACKET_SKIP_SENDTO)) { int ret = qc_snd_buf(qc, &tmpbuf, tmpbuf.data, 0, gso); if (ret < 0) { if (gso && ret == -EIO) { @@ -354,6 +354,7 @@ static int qc_send_ppkts(struct buffer buf, struct ssl_sock_ctx ctx) qc->cntrs.sent_bytes_gso += ret; } } + first_pkt->flags &= ~QUIC_FL_TX_PACKET_SKIP_SENDTO; b_del(buf, dglen + QUIC_DGRAM_HEADLEN); qc->bytes.tx += tmpbuf.data; @@ -2066,6 +2067,17 @@ static int qc_do_build_pkt(unsigned char pos, const unsigned char *end, continue; } + switch (cf->type) { + case QUIC_FT_STREAM_8 ... QUIC_FT_STREAM_F: + if (!cf->stream.len && (qc->flags & QUIC_FL_CONN_TX_MUX_CONTEXT)) { + TRACE_USER("artificially drop packet with empty STREAM frame", QUIC_EV_CONN_TXPKT, qc); + pkt->flags \|= QUIC_FL_TX_PACKET_SKIP_SENDTO; + } + break; + default: + break; + } + quic_tx_packet_refinc(pkt); cf->pkt = pkt; }	2024-08-07 11:03:32 +02:00
Amaury Denoyelle	714009b7bc	MINOR: quic: implement function to check if STREAM is fully acked When a STREAM frame is retransmitted, a check is performed to remove range of data already acked from it. This is useful when STREAM frames are duplicated and splitted to cover different data ranges. The newly retransmitted frame contains only unacked data. This process is performed similarly in qc_dup_pkt_frms() and qc_build_frms(). Refactor the code into a new function named qc_stream_frm_is_acked(). It returns true if frame data are already fully acked and retransmission can be avoided. If only a partial range of data is acknowledged, frame content is updated to only cover the unacked data. This patch does not have any functional change. However, it simplifies retransmission for STREAM frames. Also, it will be reused to fix retransmission for empty STREAM frames with FIN set from the following patch : BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM As such, it must be backported prior to it.	2024-08-07 10:57:10 +02:00
Amaury Denoyelle	bb9ac256a1	MINOR: quic: convert qc_stream_desc release field to flags qc_stream_desc had a field <release> used as a boolean. Convert it with a new <flags> field and QC_SD_FL_RELEASE value as equivalent. The purpose of this patch is to be able to extend qc_stream_desc by adding newer flags values. This patch is required for the following patch BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM As such, it must be backported prior to it.	2024-08-06 18:00:17 +02:00
Aurelien DARRAGON	8f1fd96d17	BUG/MEDIUM: server/addr: fix tune.events.max-events-at-once event miss and leak An issue has been introduced with `cd99440` ("BUG/MAJOR: server/addr: fix a race during server addr:svc_port updates"). Indeed, in the above commit we implemented the atomic_sync task which is responsible for consuming pending server events to apply the changes atomically. For now only server's addr updates are concerned. To prevent the task from causing contention, a budget was assigned to it. It can be controlled with the global tunable 'tune.events.max-events-at-once': the task may not process more than this number of events at once. However, a bug was introduced with this budget logic: each time the task has to be interrupted because it runs out of budget, we reschedule the task to finish where it left off, but the current event which was already removed from the queue wasn't processed yet. This means that this pending event (each tune.events.max-events-at-once) is effectively lost. When the atomic_sync task deals with large number of concurrent events, this bug has 2 known consequences: first a server's addr/port update will be lost every 'tune.events.max-events-at-once'. This can of course cause reliability issues because if the event is not republished periodically, the server could stay in a stale state for indefinite amount of time. This is the case when the DNS server flaps for instance: some servers may not come back UP after the incident as described in GH #2666. Another issue is that the lost event was not cleaned up, resulting in a small memory leak. So in the end, it means that the bug is likely to cause more and more degradation over time until haproxy is restarted. As a workaround, 'tune.events.max-events-at-once' may be set to the maximum number of events expected per batch. Note however that this value cannot exceed 10 000, otherwise it could cause the watchdog to trigger due to the task being busy for too long and preventing other threads from making any progress. Setting higher values may not be optimal for common workloads so it should only be used to mitigate the bug while waiting for this fix. Since tune.events.max-events-at-once defaults to 100, this bug only affects configs that involve more than 100 servers whose addr:port properties are likely to be updated at the same time (batched updates from cli, lua, dns..) To fix the bug, we move the budget check after the current event is fully handled. For that we went from a basic 'while' to 'do..while' loop as we assume from the config that 'tune.events.max-events-at-once' cannot be 0. While at it, we reschedule the task once thread isolation ends (it was not required to perform the reschedule while under isolation) to give the hand back faster to waiting threads. This patch should be backported up to 2.9 with `cd99440`. It should fix GH #2666.	2024-08-06 16:41:37 +02:00
Ilia Shipitsin	aaaacaaf4b	BUG/MINOR: fcgi-app: handle a possible strdup() failure This defect was found by the coccinelle script "unchecked-strdup.cocci". It can be backported to 2.2.	2024-08-06 08:21:49 +02:00
Frederic Lecaille	eb1a097a66	BUG/MINOR: quic: Too short datagram during packet building failures (aws-lc only) This issue was reported by Ilya (@Chipitsine) when building haproxy against aws-lc in GH #2663 where handshakeloss and handshakecorruption interop tests could lead haproxy to crash after having built too short datagrams: FATAL: bug condition "first_pkt->type == QUIC_PACKET_TYPE_INITIAL && (first_pkt->flags & (1UL << 0)) && length < 1200" matched at src/quic_tx.c:163 call trace(13): \| 0x55f4ee4dcc02 [ba d9 00 00 00 48 8d 35]: main-0x195bf2 \| 0x55f4ee4e3112 [83 3d 2f 16 35 00 00 0f]: qc_send+0x11f3/0x1b5d \| 0x55f4ee4e9ab4 [85 c0 0f 85 00 f6 ff ff]: quic_conn_io_cb+0xab1/0xf1c \| 0x55f4ee6efa82 [48 c7 c0 f8 55 ff ff 64]: run_tasks_from_lists+0x173/0x9c2 \| 0x55f4ee6f05d3 [8b 7d a0 29 c7 85 ff 0f]: process_runnable_tasks+0x302/0x6e6 \| 0x55f4ee671bb7 [83 3d 86 72 44 00 01 0f]: run_poll_loop+0x6e/0x57b \| 0x55f4ee672367 [48 8b 1d 22 d4 1d 00 48]: main-0x48d \| 0x55f4ee6755e0 [b8 00 00 00 00 e8 08 61]: main+0x2dec/0x335d This could happen after Handshake packet building failures which follow a successful Initial packet into the same datagram. In this case, the datagram could be emitted with a too short length (<1200 bytes). To fix this, store the datagram only if the first packet is not an Initial packet or if its length is big enough (>=1200 bytes). Must be backported as far as 2.6.	2024-08-05 13:40:51 +02:00
Frederic Lecaille	e12620a8a9	BUG/MINOR: quic: Too shord datagram during O-RTT handshakes (aws-lc only) By "aws-lc only", one means that this bug was first revealed by aws-lc stack. This does not mean it will not appeared for new versions of other TLS stacks which have never revealed this bug. This bug was reported by Ilya (@chipitsine) in GH #2657 where some QUIC interop tests (resumption, zerortt) could lead to crash with haproxy compiled against aws-lc TLS stack. These crashed were triggered by this BUG_ON() which detects that too short datagrams with at least one ack-eliciting Initial packet inside could be built. <0>2024-07-31T15:13:42.562717+02:00 [01\|quic\|5\|quic_tx.c:739] qc_prep_pkts(): next encryption level : qc@0x61d000041080 idle_timer_task@0x60d000006b80 flags=0x6000058 FATAL: bug condition "first_pkt->type == QUIC_PACKET_TYPE_INITIAL && (first_pkt->flags & (1UL << 0)) && length < 1200" matched at src/quic_tx.c:163 call trace(12): \| 0x563ea447bc02 [ba d9 00 00 00 48 8d 35]: main-0x1958ce \| 0x563ea4482703 [e9 73 fe ff ff ba 03 00]: qc_send+0x17e4/0x1b5d \| 0x563ea4488ab4 [85 c0 0f 85 00 f6 ff ff]: quic_conn_io_cb+0xab1/0xf1c \| 0x563ea468e6f9 [48 c7 c0 f8 55 ff ff 64]: run_tasks_from_lists+0x173/0x9c2 \| 0x563ea468f24a [8b 7d a0 29 c7 85 ff 0f]: process_runnable_tasks+0x302/0x6e6 \| 0x563ea4610893 [83 3d aa 65 44 00 01 0f]: run_poll_loop+0x6e/0x57b \| 0x563ea4611043 [48 8b 1d 46 c7 1d 00 48]: main-0x48d \| 0x7f64d05fb609 [64 48 89 04 25 30 06 00]: libpthread:+0x8609 \| 0x7f64d0520353 [48 89 c7 b8 3c 00 00 00]: libc:clone+0x43/0x5e That said everything was correctly done by qc_prep_ptks() to prevent such a case. But this relied on the hypothesis that the list of encryption levels it used was always built in the same order as follows for 0-RTT sessions: initial, early-data, handshake, application But this order is determined but the order the TLS stack derives the secrets for these encryption levels. For aws-lc, this order is not the same but as follows: initial, handshake, application, early-data During 0-RTT sessions, the server may have to build three ack-eliciting packets (with CRYPTO data inside) to reply to the first client packet: initial, hanshake, application. qc_prep_pkts() adds a PADDING frame to the last built packet for the last encryption level in the list. But after application level encryption, there is early-data encryption level. This prevented qc_prep_pkts() to build a padded applicaiton level last packet to send a 1200-bytes datagram. To fix this, always insert early-data encryption level after the initial encryption level into the encryption levels list when initializing this encryption level from quic_conn_enc_level_init(). Must be backported as far as 2.9.	2024-08-02 15:25:26 +02:00
Christopher Faulet	78b8b60030	BUG/MEDIUM: peer: Notify the applet won't consume data when it waits for sync When the peer applet is waiting for a synchronisation with the global sync task, we must notify it won't consume data. Otherwise, if some data are already waiting in the input buffer, the applet will be woken up in loop and this wil trigger the watchdog. Once synchronized, the applet is woken up. In that case, the peer applet must indicate it is going to consume data again. This patch should fix the issue #2656. It must be backported to 3.0.	2024-08-02 08:42:29 +02:00
Christopher Faulet	184f16ded7	BUG/MEDIUM: mux-h2: Propagate term flags to SE on error in h2s_wake_one_stream When a stream is explicitly woken up by the H2 conneciton, if an error condition is detected, the corresponding error flag is set on the SE. So SE_FL_ERROR or SE_FL_ERR_PENDING, depending if the end of stream was reported or not. However, there is no attempt to propagate other termination flags. We must be sure to properly set SE_FL_EOI and SE_FL_EOS when appropriate to be able to switch a pending error to a fatal error. Because of this bug, the SE remains with a pending error and no end of stream, preventing the applicative stream to trully abort it. It means on some abort scenario, it is possible to block a stream infinitely. This patch must be backported at least as far as 2.8. No bug was observed on older versions while the same code is inuse.	2024-08-02 08:42:28 +02:00
Christopher Faulet	6743e128f3	BUG/MEDIUM: h2: Only report early HTX EOM for tunneled streams For regular H2 messages, the HTX EOM flag is synonymous the end of input. So SE_FL_EOI flag must also be set on the stream-endpoint descriptor. However, there is an exception. For tunneled streams, the end of message is reported on the HTX message just after the headers. But in that case, no end of input is reported on the SE. But here, there is a bug. The "early" EOM is also report on the HTX messages when there is no payload (for instance a content-length set to 0). If there is no ES flag on the H2 HEADERS frame, it is an unexpected case. Because for the applicative stream and most probably for the opposite endpoint, the message is considered as finihsed. It is switched in its DONE state (or the equivalent on the endpoint). But, if an extra H2 frame with the ES flag is received, a TRAILERS frame or an emtpy DATA frame, an extra EOT HTX block is pushed to carry the HTX EOM flag. So an extra HTX block is emitted for a regular HTX message. It is totally invalid, it must never happen. Because it is an undefined behavior, it is difficult to predict the result. But it definitly prevent the applicative stream to properly handle aborts and errors because data remain blocked in the channel buffer. Indeed, the end of the message was seen, so no more data are forwarded. It seems to be an issue for 2.8 and upper. Harder to evaluate for older versions. This patch must be backported as far as 2.4.	2024-08-02 08:42:28 +02:00
Christopher Faulet	0ba6202796	BUG/MEDIUM: http-ana: Report error on write error waiting for the response When we are waiting for the server response, if an error is pending on the frontend side (a write error on client), it is handled as an abort and all regular response analyzers are removed, except the one responsible to release the filters, if any. However, while it is handled as an abort, the error is not reported, as usual, via http_reply_and_close() function. It is an issue because in that, the channels buffers are not reset. Because of this bug, it is possible to block a stream infinitely. The request side is waiting for the response side and the response side is blocked because filters must be released and this cannot be done because data remain blocked in channels buffers. So, in that case, calling http_reply_and_close() with no message is enough to unblock the stream. This patch must be backported as far as 2.8.	2024-08-02 08:42:28 +02:00
Amaury Denoyelle	7a5a30d28a	BUG/MINOR: h2: reject extended connect for h2c protocol This commit prevents forwarding of an HTTP/2 Extended CONNECT when "h2c" or "h2" token is set as targetted protocol. Contrary to the previous commit which deals with HTTP/1 mux, this time the request is rejected and a RESET_STREAM is reported to the client. This must be backported up to 2.4 after a period of observation.	2024-08-01 18:23:44 +02:00
Amaury Denoyelle	7b89aa5b19	BUG/MINOR: h1: do not forward h2c upgrade header token haproxy supports tunnel establishment through HTTP Upgrade mechanism. Since the following commit, extended CONNECT is also supported for HTTP/2 both on frontend and backend side. commit `9bf957335e` MEDIUM: mux_h2: generate Extended CONNECT from htx upgrade As specified by HTTP/2 rfc, "h2c" can be used by an HTTP/1.1 client to request an upgrade to HTTP/2. In haproxy, this is not supported so it silently ignores this. However, Connection and Upgrade headers are forwarded as-is on the backend side. If using HTTP/1 on the backend side and the server supports this upgrade mechanism, haproxy won't be able to parse the HTTP response. If using HTTP/2, mux backend tries to incorrectly convert the request to an Extended CONNECT with h2c protocol, which may also prevent the response to be transmitted. To fix this, flag HTTP/1 request with "h2c" or "h2" token in an upgrade header. On converting the header list to HTX, the upgrade header is skipped if any of this token is present and the H1_MF_CONN_UPG flag is removed. This issue can easily be reproduced using curl --http2 argument to connect to an HTTP/1 frontend. This must be backported up to 2.4 after a period of observation.	2024-08-01 18:23:32 +02:00
Amaury Denoyelle	a7a2db4ad5	BUG/MIONR: quic: fix fc_lost Control layer callback get_info has recently been implemented for QUIC. However, fc_lost always returned 0. This is because quic_get_info() does not use the correct input argument value to identify lost value. This does not need to be backported.	2024-08-01 11:35:27 +02:00
Amaury Denoyelle	522c3bea2c	BUG/MINOR: quic: fix fc_rtt/srtt values QUIC has recently implement get_info callback to return RTT/sRTT values. However, it uses milliseconds, contrary to TCP which uses microseconds. This cause smp fetch functions to return invalid values. Fix this by converting QUIC values to microseconds. This does not need to be backported.	2024-08-01 11:35:27 +02:00
Frederic Lecaille	f7f76b8b0d	MINOR: quic: Define ->get_info() control layer callback for QUIC This low level callback may be called by several sample fetches for frontend connections like "fc_rtt", "fc_rttvar" etc. Define this callback for QUIC protocol as pointer to quic_get_info(). This latter supports these sample fetches: "fc_lost", "fc_reordering", "fc_rtt" and "fc_rttvar". Update the documentation consequently.	2024-07-31 10:29:42 +02:00
Frederic Lecaille	1733dff42a	MINOR: tcp_sample: Move TCP low level sample fetch function to control layer Add ->get_info() new control layer callback definition to protocol struct to retreive statiscal counters information at transport layer (TCPv4/TCPv6) identified by an integer into a long long int. Move the TCP specific code from get_tcp_info() to the tcp_get_info() control layer function (src/proto_tcp.c) and define it as the ->get_info() callback for TCPv4 and TCPv6. Note that get_tcp_info() is called for several TCP sample fetches. This patch is useful to support some of these sample fetches for QUIC and to keep the code simple and easy to maintain.	2024-07-31 10:29:42 +02:00
Amaury Denoyelle	bba6baff30	BUG/MEDIUM: quic: prevent conn freeze on 0RTT undeciphered content Received QUIC packets are stored in quic_conn Rx buffer after header protection removal in qc_rx_pkt_handle(). These packets are then removed after quic_conn IO handler via qc_treat_rx_pkts(). If HP cannot be removed, packets are still copied into quic_conn Rx buffer. This can happen if encryption level TLS keys are not yet available. The packet remains in the buffer until HP can be removed and its content processed. An issue occurs if client emits a 0-RTT packet but haproxy does not have the shared secret, for example after a haproxy process restart. In this case, the packet is copied in quic_conn Rx buffer but its HP won't ever be removed. This prevents the buffer to be purged. After some time, if the client has emitted enough packets, Rx buffer won't have any space left and received packets are dropped. This will cause the connection to freeze. To fix this, remove any 0-RTT buffered packets on handshake completion. At this stage, 0-RTT packets are unnecessary anymore. The client is expected to reemit its content in 1-RTT packet which are properly deciphered. This can easily reproduce with HTTP/3 POST requests or retrieving a big enough object, which will fill the Rx buffer with ACK frames. Here is a picoquic command to provoke the issue on haproxy startup : $ picoquicdemo -Q -v 00000001 -a h3 <hostname> 20443 "/?s=1g" Note that allow-0rtt must be present on the bind line to trigger the issue. Else haproxy will reject any 0-RTT packets. This must be backported up to 2.6. This could be one of the reason for github issue #2549 but it's unsure for now.	2024-07-31 10:24:53 +02:00
William Lallemand	f76e8e50f4	BUILD: ssl: replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC Replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC in the code source, so we won't need to set USE_OPENSSL_AWSLC in the Makefile on the long term.	2024-07-30 18:53:08 +02:00
William Lallemand	1889b86561	BUG/MEDIUM: ssl: 0-RTT initialized at the wrong place for AWS-LC Revert patch `fcc8255` "MINOR: ssl_sock: Early data disabled during SSL_CTX switching (aws-lc)". The patch was done in the wrong callback which is never built for AWS-LC, and applies options on the SSL_CTX instead of the SSL, which should never be done elsewhere than in the configuration parsing. This was probably triggered by successfully linking haproxy against AWS-LC without using USE_OPENSSL_AWSLC. The patch also reintroduced SSL_CTX_set_early_data_enabled() in the ssl_quic_initial_ctx() and ssl_sock_initial_ctx(). So the initial_ctx does have the right setting, but it still needs to be applied to the selected SSL_CTX in the clienthello, because we need it on the selected SSL_CTX. Must be backported to 3.0. (ssl_clienthello.c part was in ssl_sock.c)	2024-07-30 18:53:08 +02:00
Willy Tarreau	376b147fff	BUG/MINOR: stconn: bs.id and fs.id had their dependencies incorrect The backend depends on the response and the frontend on the request, not the other way around. In addition, they used to depend on L6 (hence contents in the channel buffers) while they should only depend on L5 (permanent info known in the mux). This came in 2.9 with commit `24059615a7` ("MINOR: Add sample fetches to get the frontend and backend stream ID") so this can be backported there. (cherry picked from commit 61dd0156c82ea051779e6524cad403871c31fc5a) Signed-off-by: Willy Tarreau <w@1wt.eu>	2024-07-30 18:39:29 +02:00
Christopher Faulet	d9f41b1d6e	BUILD: mux-pt: Use the right name for the sedesc variable A typo was introduced in `760d26a86` ("BUG/MEDIUM: mux-pt/mux-h1: Release the pipe on connection error on sending path"). The sedesc variable is 'sd', not 'se'. This patch must be backported with the commit above.	2024-07-30 10:44:00 +02:00
Christopher Faulet	760d26a862	BUG/MEDIUM: mux-pt/mux-h1: Release the pipe on connection error on sending path When data are sent using the kernel splicing, if a connection error occurred, the pipe must be released. Indeed, in that case, no more data can be sent and there is no reason to not release the pipe. But it is in fact an issue for the stream because the channel will appear are not empty. This may prevent the stream to be released. This happens on 2.8 when a filter is also attached on it. On 2.9 and upper, it seems there is not issue. But it is hard to be sure and the current patch remains valid is all cases. On 2.6 and lower, the code is not the same and, AFAIK, there is no issue. This patch must be backported to 2.8. However, on 2.8, there is no zero-copy data forwarding. The patch must be adapted. There is no done_ff/resume_ff callback functions for muxes. The pipe must released in sc_conn_send() when an error flag is set on the SE, after the call to snd_pipe callback function.	2024-07-30 09:05:25 +02:00
Christopher Faulet	5dc45445ff	BUG/MEDIUM: stconn: Report error on SC on send if a previous SE error was set When a send on a connection is performed, if a SE error (or a pending error) was already reported earlier, we leave immediately. No send is performed. However, we must be sure to report the error at the SC level if necessary. Indeed, the SE error may have been reported during the zero-copy data forwarding. So during receive on the opposite side. In that case, we may have missed the opportunity to report it at the SC level. The patch must be backported as far as 2.8.	2024-07-30 09:05:25 +02:00
Willy Tarreau	5541d4995d	BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue() After checking that a server or backend is full, it remains possible to call pendconn_add() just after the last pending requests finishes, so that there's no more connection on the server for very low maxconn (typ 1), leaving new ones in queue till the timeout. The approach depends on where the request was queued, though: - when queued on a server, we can simply detect that we may dequeue pending requests and wake them up, it will wake our request and that's fine. This needs to be done in srv_redispatch_connect() when the server is set. - when queued on a backend, it means that all servers are done with their requests. It means that all servers were full before the check and all were empty after. In practice this will only concern configs with less servers than threads. It's where the issue was first spotted, and it's very hard to reproduce with more than one server. In this case we need to load-balance again in order to find a spare server (or even to fail). For this, we call the newly added dedicated function pendconn_must_try_again() that tells whether or not a blocked pending request was dequeued and needs to be retried. This should be backported along with pendconn_must_try_again() to all stable versions, but with extreme care because over time the queue's locking evolved.	2024-07-29 09:27:01 +02:00
Willy Tarreau	1a8f3a368f	MINOR: queue: add a function to check for TOCTOU after queueing There's a rare TOCTOU case that happens from time to time with maxconn 1 and multiple threads. Between the moment we see the queue full and the moment we queue a request, it's possible that the last request on the server or proxy ended and that no other one is left to offer it its place. Given that all this code path is performance-critical and we cannot afford to increase the lock duration, better recheck for the condition after queueing. For this we need to be able to check for the condition and cleanly dequeue a request. That's what this patch provides via the new function pendconn_must_try_again(). It will catch more requests than absolutely needed though it will catch them all. It may find that around 1/1000 of requests are at risk, though testing shows that in practice, it's around 1 per million that really gets stuck (other ones benefit from timing and finishing late requests). Maybe in the future some conditions might be refined but it's harmless. What happens to such requests is that they're dequeued and their pendconn freed, so that the caller can decide to try to LB or queue them again. For now the function is not used, it's just added separately for easier tracking.	2024-07-29 09:27:01 +02:00
Willy Tarreau	4316ef2eab	BUILD: cfgparse-quic: fix build error on Solaris due to missing netinet/in.h Since commit `35470d518` ("MINOR: quic: activate UDP GSO for QUIC if supported"), Solaris build fails due to netinet/udp.h being included without netinet/in.h. Adding it is sufficient to fix the problem. No backport is needed.	2024-07-28 14:59:23 +02:00
Christopher Faulet	46b1fec0e9	BUG/MEDIUM: jwt: Clear SSL error queue on error when checking the signature When the signature included in a JWT is verified, if an error occurred, one or more SSL errors are queued and never cleared. These errors may be then caught by the SSL stack and a fatal SSL error may be erroneously reported during a SSL received or send. So we must take care to clear the SSL error queue when the signature verification failed. This patch should fix issue #2643. It must be backported as far as 2.6.	2024-07-26 16:59:00 +02:00
Frederic Lecaille	4abaadd842	MINOR: quic: Dump TX in flight bytes vs window values ratio. Display the ratio of the numbers of bytes in flight by packet number spaces versus the current window values in percent.	2024-07-26 16:42:44 +02:00
Frederic Lecaille	76ff8afa2d	MINOR: quic: Add information to "show quic" for CUBIC cc. Add ->state_cli() new callback to quic_cc_algo struct to define a function called by the "show quic (cc\|full)" commands to dump some information about the congestion algorithm internal state currently in use by the QUIC connections. Implement this callback for CUBIC algorithm to dump its internal variables: - K: (the time to reach the cubic curve inflexion point), - last_w_max: the last maximum window value reached before intering the last recovery period. This is also the window value at the inflexion point of the cubic curve, - wdiff: the difference between the current window value and last_w_max. So negative before the inflexion point, and positive after.	2024-07-26 16:42:44 +02:00
Willy Tarreau	2dab1ba84b	MEDIUM: h1: allow to preserve keep-alive on T-E + C-L In 2.5-dev9, commit `631c7e866` ("MEDIUM: h1: Force close mode for invalid uses of T-E header") enforced a recently arrived new security rule in the HTTP specification aiming at preventing a class of content-smuggling attacks involving HTTP/1.0 agents. It consists in handling the very rare T-E + C-L requests or responses in close mode. It happens it does have an impact of a rare few and very old clients (probably running insecure TLS stacks by the way) that continue to send both with their POST requests. The impact is that for each and every request they'll have to reconnect, possibly negotiating a full TLS handshake that becomes harmful to the machine in terms of CPU computation. This commit adds a new option "h1-do-not-close-on-insecure-transfer-encoding" that does exactly what it says, it just asks not to close on such messages, even though the message continues to be sanitized and C-L dropped. It means that the risk is only between the sender and haproxy, which is limited, and might be the only acceptable solution for such environments having to deal with broken implementations. The cases are so rare that it should not need to be backported, or in the worst case, to the latest LTS if there is any demand.	2024-07-26 15:59:35 +02:00
Amaury Denoyelle	85131f91bf	BUG/MEDIUM: quic: fix invalid conn reject with CONNECTION_REFUSED quic-initial rules were implemented just recently. For some actions, a new flags field was added in quic_dgram structure. This is used to report the result of the rules execution. However, this flags field was left uninitialized. Depending on its value, it may close the connection to be wrongly rejected via CONNECTION_REFUSED. Fix this by properly set flags value to 0. No need to backport.	2024-07-26 15:24:35 +02:00
Amaury Denoyelle	08515af9df	MINOR: quic: implement send-retry quic-initial rules Define a new quic-initial "send-retry" rule. This allows to force the emission of a Retry packet on an initial without token instead of instantiating a new QUIC connection.	2024-07-25 15:39:39 +02:00
Amaury Denoyelle	69d7e9f3b7	MINOR: quic: implement reject quic-initial action Define a new quic-initial action named "reject". Contrary to dgram-drop, the client is notified of the rejection by a CONNECTION_CLOSE with CONNECTION_REFUSED error code. To be able to emit the necessary CONNECTION_CLOSE frame, quic_conn is instantiated, contrary to dgram-drop action. quic_set_connection_close() is called immediatly after qc_new_conn() which prevents the handshake startup.	2024-07-25 15:39:39 +02:00
Amaury Denoyelle	f91be2657e	MINOR: quic: pass quic_dgram as obj_type for quic-initial rules To extend quic-initial rules, pass quic_dgram instance to argument for the various actions. As such, quic_dgram is now supported as an obj_type and can be used in session origin field.	2024-07-25 15:39:39 +02:00
Amaury Denoyelle	1259700763	MINOR: quic: support ACL for quic-initial rules Add ACL condition support for quic-initial rules. This requires the extension of quic_parse_quic_initial() to parse an extra if/unless block. Only layer4 client samples are allowed to be used with quic-initial rules. However, due to the early execution of quic-initial rules prior to any connection instantiation, some samples are non supported. To be able to use the 4 described samples, a dummy session is instantiated before quic-initial rules execution. Its src and dst fields are set from the received datagram values.	2024-07-25 15:39:39 +02:00
Amaury Denoyelle	cafe596608	MEDIUM: quic: implement quic-initial rules Implement a new set of rules labelled as quic-initial. These rules as specific to QUIC. They are scheduled to be executed early on Initial packet parsing, prior a new QUIC connection instantiation. Contrary to tcp-request connection, this allows to reject traffic earlier, most notably by avoiding unnecessary QUIC SSL handshake processing. A new module quic_rules is created. Its main function quic_init_exec_rules() is called on Initial packet parsing in function quic_rx_pkt_retrieve_conn(). For the moment, only "accept" and "dgram-drop" are valid actions. Both are final. The latter drops silently the Initial packet instead of allocating a new QUIC connection.	2024-07-25 15:39:39 +02:00
Amaury Denoyelle	a72e82c382	MINOR: quic: delay Retry emission on quic-force-retry Currently, quic Retry packets are emitted for two different reasons after processing an Initial without token : - quic-force-retry is set on bind-line - an abnormal number of half-open connection is currently detected Previously, these two conditions were checked separately in different functions during datagram parsing. Uniformize this by moving quic-force-retry check in quic_rx_pkt_retrieve_conn() along the second condition check. The purpose of this patch is to uniformize datagram parsing stages. It is necessary to implement quic-initial rules in quic_rx_pkt_retrieve_conn() prior to any Retry emission. This prevents to emit unnecessary Retry if an Initial is subject to a reject rule.	2024-07-25 15:29:50 +02:00
Aurelien DARRAGON	e328056ddc	MEDIUM: sink: assume sft appctx stickiness As mentioned in `b40d804` ("MINOR: sink: add some comments about sft->appctx usage in applet handlers"), there are few places in the code where it looks like we assumed that the applet callbacks such as sink_forward_session_init() or sink_forward_io_handler() could be executing an appctx whose sft is detached from the appctx (appctx != sft->appctx). In practise this should not be happening since an appctx sticks to the same thread its entire lifetime, and the only times sft->appctx is effectively assigned is during the session/appctx creation (in process_sink_forward()) or release. Thus if sft->appctx wouldn't point to the appctx that the sft was bound to after appctx creation, it would probably indicate a bug rather than an expected condition. To further emphasize that and prevent the confusion, and since 3.1-dev4 was released, let's remove such checks and instead add a BUG_ON to ensure this never happens. In _sink_forward_io_handler(), the "hard_close" label was removed since there are no more uses for it (no hard errors may be caught from the function for now)	2024-07-25 14:56:19 +02:00

... 3 4 5 6 7 ...

18279 commits