haproxy

mirror of https://github.com/haproxy/haproxy.git synced 2026-04-28 17:49:36 -04:00

Author	SHA1	Message	Date
Willy Tarreau	eb0fe66c61	MINOR: mux-h2: create and initialize an rx offset per stream In H2, everything is accounted as budget. But if we want to moderate the rcv window that's not very convenient, and we'd rather have offsets instead so that we know where we are in the stream. Let's first add the fields to the struct and initialize them. The curr_rx_ofs indicates the position in the stream where next incoming bytes will be stored. last_adv_ofs tells what's the offset that was last advertised as the window limit, and next_max_ofs is the one that will need to be advertised, which is curr_rx_ofs plus the current window. next_max_ofs will have to cause a WINDOW_UPDATE to be emitted when it's higher than last_adv_ofs, and once the WU is sent, its value will have to be copied over last_adv_ofs. The problem is, for now wherever we emit a stream WU, we have no notion of stream (the stream might even not exist anymore, e.g. after aborting an upload), because we currently keep a counter of stream window to be acked for the current stream ID (h2c->dsi) in the connection (rcvd_s). Similarly there are a few places early in the frame header processing where rcvd_s is incremented without knowing the stream yet. Thus, lookups will be needed for that, unless such a connection-level counter remains used and poured into the stream's count once known (delicate). Thus for now this commit only creates the fields and initializes them.	2024-10-12 16:29:15 +02:00
Willy Tarreau	560e474cdd	MINOR: mux-h2: split the amount of rx data from the amount to ack We'll need to keep track of the total amount of data received for the current stream, and the amount of data to ack for the current stream, which might soon diverge as soon as we'll have to update the stream's offset with received data, which are different from those to be ACKed. One reason is that in case a stream doesn't exist anymore (e.g. aborted an upload), the rcvd_s info might get lost after updating the stream, so we do need to have an in-connection counter for that. What's done here is that the rcvd_s count is transferred to wu_s in h2c_send_strm_wu(), to be used as the counter to send, and both are considered as sufficient when non-null to call the function.	2024-10-12 16:29:15 +02:00
Willy Tarreau	8f09bdce10	MINOR: buffer: add a buffer list type with functions The buffer ring is problematic in multiple aspects, one of which being that it is only usable by one entity. With multiplexed protocols, we need to have shared buffers used by many entities (streams and connection), and the only way to use the buffer ring model in this case is to have each entity store its own array, and keep a shared counter on allocated entries. But even with the default 32 buf and 100 streams per HTTP/2 connection, we're speaking about 3210132 bytes = 103424 bytes per H2 connection, just to store up to 32 shared buffers, spread randomly in these tables. Some users might want to achieve much higher than default rates over high speed links (e.g. 30-50 MB/s at 100ms), which is 3 to 5 MB storage per connection, hence 180 to 300 buffers. There it starts to cost a lot, up to 1 MB per connection, just to store buffer indexes. Instead this patch introduces a variant which we call a buffer list. That's basically just a free list encoded in an array. Each cell contains a buffer structure, a next index, and a few flags. The index could be reduced to 16 bits if needed, in order to make room for a new struct member. The design permits initializing a whole freelist at once using memset(0). The list pointer is stored at a single location (e.g. the connection) and all users (the streams) will just have indexes referencing their first and last assigned entries (head and tail). This means that with a single table we can now have all our buffers shared between multiple streams, irrelevant to the number of potential streams which would want to use them. Now the 180 to 300 entries array only costs 7.2 to 12 kB, or 80 times less. Two large functions (bl_deinit() & bl_get()) were implemented in buf.c. A basic doc was added to explain how it works.	2024-10-12 16:29:15 +02:00
Willy Tarreau	ac66df4e2e	REORG: buffers: move some of the heavy functions from buf.h to buf.c Over time, some of the buffer management functions grew quite a bit, and were still forced to remain inlined since all defined in buf.h. Let's create buf.c and move the heaviest ones there. All those moved here were above 200 bytes.	2024-10-12 16:29:15 +02:00
Willy Tarreau	d288ddb575	CLEANUP: muxes: remove useless inclusion of ebmbtree.h Since 2.7 with commit `8522348482` ("BUG/MAJOR: conn-idle: fix hash indexing issues on idle conns"), we've been using eb64 trees and not ebmb trees anymore, and later we dropped all that to centralize the operations in the server. Let's remove the ebmbtree.h includes from the muxes that do not use them.	2024-10-12 16:29:15 +02:00
Willy Tarreau	cf3fe1eed4	MINOR: mux-h2/traces: print the size of the DATA frames DATA frames produce a special trace with the amount of transferred data in arg4, but this was not reported by h2_trace(). This commit just adds it.	2024-10-12 16:29:15 +02:00
Willy Tarreau	af064b497a	BUG/MINOR: mux-h2/traces: present the correct buffer for trailers errors traces The local "rxbuf" buffer was passed to the trace instead of h2s->rxbuf that is used when decoding trailers. The impact is essentially the impossibility to present some buffer contents in some rare cases. It may be backported but it's unlikely that anyone will ever notice the difference.	2024-10-12 16:29:15 +02:00
Willy Tarreau	0fa654ca92	BUILD: cache: silence an uninitialized warning at -Og with gcc-12.2 Building with gcc-12.2 -Og yields this incorrect warning in cache.c: In function 'release_entry_unlocked', inlined from 'http_action_store_cache' at src/cache.c:1449:4: src/cache.c:330:9: warning: 'object' may be used uninitialized [-Wmaybe-uninitialized] 330 \| release_entry(cache, entry, 1); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ src/cache.c: In function 'http_action_store_cache': src/cache.c:1200:29: note: 'object' was declared here 1200 \| struct cache_entry object, old; \| ^~~~~~ This is wrong, the only way to reach the function is with first!=NULL and the gotos that reach there are all those made with first==NULL. Let's just preset object to NULL to silence it.	2024-10-12 16:28:54 +02:00
William Lallemand	edf85a1d76	MINOR: cfgparse: simulate long configuration parsing with force-cfg-parser-pause This command is pausing the configuration parser for <timeout> milliseconds. This is useful for development or for testing timeouts of init scripts, particularly to simulate a very long reload. It requires the expose-experimental-directives to be set.	2024-10-11 17:40:37 +02:00
Amaury Denoyelle	232083c3e5	BUG/MEDIUM: mux-quic: ensure timeout server is active for short requests If a small request is received on QUIC MUX frontend, it can be transmitted directly with the FIN on attach operation. rcv_buf is skipped by the stream layer. Thus, it is necessary to ensure that there is similar behavior when FIN is reported either on attach or rcv_buf. One difference was that se_expect_data() was called only for rcv_buf but not on attach. This most obvious effect is that stream timeout was deactivated for this request : client timeout was disabled on EOI but server one not armed due to previous se_expect_no_data(). This prevents the early closure of too long requests. To fix this, add an invokation of se_expect_data() on attach operation. This bug can simply be detected using httpterm with delay request (for example /?t=10000) and using smaller client/server timeouts. The bug is present if the request is not aborted on timeout but instead continue until its proper HTTP 200 termination. This has been introduced by the following commit : `85eabfbf67` MEDIUM: mux-quic: Don't expect data from server as long as request is unfinished This must be backported up to 2.8.	2024-10-10 17:20:39 +02:00
Aurelien DARRAGON	7144e60cd2	MINOR: sample: postresolve sink names in debug() converter debug() converter used to resolve sink names during parsing time. Because of this, we were unable to specify sink names that were defined after the debug() converter was placed. Like in the previous commit, let's implement proper postparsing for the debug() converter, in order to be able to use sink names that are about to be defined later in the config file.	2024-10-10 16:55:15 +02:00
Aurelien DARRAGON	ed266589b6	MINOR: trace: postresolve sink names A previous known limitation about traces was that parsing was performed on the fly, meaning that when using "sink" keyword, only sinks that were either internal or previously defined in the config could be used. Indeed, it was not possible to use a ring section defined AFTER the traces section when using the 'sink' keyword from traces. This limitation was also mentioned in the config file. Let's get rid of that limitation by implementing proper postparsing for the sink parameter in traces section. To do this, make use of the new sink_find_early() helper to start referencing sink by their names even if they don't exist yet (if they are about to be defined later in the config) Traces commands on the cli are not concerned by this change.	2024-10-10 16:55:15 +02:00
Aurelien DARRAGON	1bdf6e884a	MEDIUM: sink: implement sink_find_early() sink_find_early() is a convenient function that can be used instead of sink_find() during parsing time in order to try to find a matching sink even if the sink is not defined yet. Indeed, if the sink is not defined, sink_find_early() will try to create it and mark it as forward-declared. It will also save informations from the caller to better identify it in case of errors. If the sink happens to be found in the config, it will transition from forward-declared type to its final type. Else, it means that the sink was not found in the config, in this case, during postresolve, we raise an error to indicate that the sink was not found in the configuration. It should help solve postresolving issue with rings, because for now only log targets implement proper ring postresolving.. but rings may be used at different places in the code, such as debug() converter or in "traces" section.	2024-10-10 16:55:15 +02:00
Damien Claisse	ba7c03c18e	MINOR: ssl: disable server side default CRL check with WolfSSL Patch `64a77e3ea5` disabled CRL check when no CRL file was provided, but it only did it on bind side. Add the same fix in server context initialization side. This allows to enable peer verification (verify required) on a server using TLS, without having to provide a CRL file.	2024-10-10 09:31:19 +02:00
Amaury Denoyelle	456c3997b2	BUG/MEDIUM: quic: properly decount out-of-order ACK on stream release Out-of-order STREAM ACK are buffered in its related streambuf tree. On insertion, overlapping or contiguous ranges are merged together. The total size of buffered ack range is stored in <room> streambuf member and reported to QUIC MUX layer on streambuf release. The objective is to ensure QUIC MUX layer can allocate Tx buffers conveniently to preserve a good transfer throughput. Streamdesc is the overall container of many streambufs. It may also been released when its upper QCS instance is freed, after all stream data have been emitted. In this case, the active streambuf is also released via custom code. However, in this code path, <room> was not reported to the QUIC MUX layer. This bug caused wrong estimation for the QUIC MUX txbuf window, with bytes reamining even after all ACK reception. This may cause transfer freeze on other connection streams, with RESET_STREAM emission on timeout client. To fix this, reuse the existing qc_stream_buf_release() function on streamdesc release. This ensures that notify_room is correctly used. No need to backport.	2024-10-09 17:47:16 +02:00
Amaury Denoyelle	f0049d0748	BUG/MINOR: quic: fix discarding of already stored out-of-order ACK To properly decount out-of-order acked data range, contiguous or overlapping ranges are first merged before their insertion in a tree. The first step ensure that a newly reported range is not completely covered by the existing tree ranges. However, one of the condition was incorrect. Fix this to ensure that the final range tree does not contain duplicated entry. The impact of this bug is unknown. However, it may have allowed the insertion of overlapping ranges, which could in turn cause an error in QUIC MUX txbuf window, with a possible transfer freeze. No need to backport.	2024-10-09 17:32:30 +02:00
Aurelien DARRAGON	f88f162868	BUG/MEDIUM: hlua: properly handle sample func errors in hlua_run_sample_{fetch,conv}() To execute sample fetches and converters from lua. hlua API leverages the sample API. Prior to executing the sample func, the arg checker is called from hlua_run_sample_{fetch,conv}() to detect potential errors. However, hlua_run_sample_{fetch,conv}() both pass NULL as <err> argument, but it is wrong for two reasons. First we miss an opportunity to report precise error messages to help the user know what went wrong during the check.. and more importantly, some val check functions consider that the <err> pointer is never NULL. This is the case for example with check_crypto_hmac(). Because of this, when such val check functions encounter an error, they will crash the process because they will try to de-reference NULL. This bug was discovered and reported by GH user @JB0925 on #2745. Perhaps val check functions should make sure that the provided <err> pointer is != NULL prior to de-referencing it. But since there are multiple occurences found in the code and the API isn't clear about that, it is easier to fix the hlua part (caller) for now. To fix the issue, let's always provide a valid <err> pointer when leveraging val_arg() check function pointer, and make use of it in case or error to report relevant message to the user before freeing it. It should be backported to all stable versions.	2024-10-08 12:00:42 +02:00
Aurelien DARRAGON	d0e0105181	BUG/MEDIUM: hlua: make hlua_ctx_renew() safe hlua_ctx_renew() is called from unsafe places where the caller doesn't expect it to LJMP.. however hlua_ctx_renew() makes use of Lua library function that could potentially raise errors, such as lua_newthread(), and it does nothing to catch errors. Because of this, haproxy could unexpectedly crash. This was discovered and reported by GH user @JB0925 on #2745. To fix the issue, let's simply make hlua_ctx_renew() safe by applying the same logic implemented for hlua_ctx_init() or hlua_ctx_destroy(), which is catching Lua errors by leveraging SET_SAFE_LJMP_PARENT() helper. It should be backported to all stable versions.	2024-10-08 12:00:36 +02:00
Aurelien DARRAGON	3ba924a4da	MINOR: action: add do-log action Thanks to the two previous commits, we can now expose the do-log action on all available action contexts, including the new quic-init context. Each context is responsible for exposing the do-log action by registering the relevant log steps, saving the idendifier, and then store it in the rule's context so that do_log_action() automatically uses it to produce the log during runtime. To use the feature, it is simply needed to use "do-log" (without argument) on an action directive, example: tcp-request connection do-log As mentioned before, each context where the action is exposed has its own log step identifier. Currently known identifiers are: quic-initial: quic-init tcp-request connection: tcp-req-conn tcp-request session: tcp-req-sess tcp-request content: tcp-req-cont tcp-response content: tcp-res-cont http-request: http-req http-response: http-res http-after-response: http-after-res Thus, these "additional" logging steps can be used as-is under log-profile section (after "on" keyword). However, although the parser will accept them, it makes no sense to use them with the "log-steps" proxy keyword, since the only path for these origins to trigger a log generation is through the explicit use of "do-log" action. This need was described in GH #401, it should help to conditionally trigger logs using ACL at specific key points.. and may either be used alone or combined with "log-steps" to add additional log "trackers" during transaction handling. Documentation was updated and some examples were added.	2024-10-04 21:38:14 +02:00
Aurelien DARRAGON	0e271f1d2a	MINOR: log: add do_log_parse_act() helper func Function may be used from places where per-context actions are usually registered (tcp_act.c, http_act.c, quic_rules.c.. to name a few) in order to expose the do_log() action.	2024-10-04 21:38:08 +02:00
Aurelien DARRAGON	e63c7da508	MINOR: log: add do_log() logging helper do_log() is quite similar to sess_log() or strm_log(), excepts that it may be called at any time during session handling in an opportunistic way as long as the session exists (the stream may or may not exist). Also, it will try to emit the log as INFO by default, unless set-log-level is used on the stream, or error origin flag is set.	2024-10-04 21:38:02 +02:00
Amaury Denoyelle	f6599cf5a6	MEDIUM: quic: decount out-of-order ACK data range for MUX txbuf window This commit is the last one of a serie whose objective is to restore QUIC transfer throughput performance to the state prior to the recent QUIC MUX buffer allocator rework. This gain is obtained by reporting received out-of-order ACK data range to the QUIC MUX which can then decount room in its txbuf window. This is implemented in QUIC streamdesc layer by adding a new invokation of notify_room callback. This is done into qc_stream_buf_store_ack() which handle out-of-order ACK data range. Previous commit has introduced merging of overlapping ACK data range. As such, it's easy to only report the newly acknowledged data range. As with in-order ACKs, this new notification is only performed on released streambuf. As such, when a streambuf instance is released, notify_room notification now also reports the total length of out-of-order ACK data range currently stored. This value is stored in a new streambuf member <room> to avoid unnecessary tree lookup. This <room> member also serves on in-order ACK notification to reduce the notified room. This prevents to report invalid values when overlap ranges are treated first out-of-order and then in-order, which would cause an invalid QUIC MUX txbuf window value. After this change has been implemented, performance has been significantly improved, both with ngtcp2-client rate usage and on interop goodput test. These values are now similar to the rate observed on older haproxy version before QUIC MUX buffer allocator rework.	2024-10-04 18:09:51 +02:00
Amaury Denoyelle	ae3e768d32	MEDIUM: quic: merge contiguous/overlapping buffered ack stream range Transfer throughput was deteriorated since recent rework of QUIC MUX txbuf allocator. This was partially restorated with the commit to decount individual in-order ACK from the MUX buffer window. To fully retrieve the old performance level, all ACKs must be decounted when handled by QUIC streamdesc layer, event out-of-order ranges. However, this is not easily implemented as several ranges may exist in parallel with overlap on the underlying data. It would cause miscalculation for QUIC MUX buffer window if such ranges were blindly reported. The proper solution is to first implement merge of contiguous or overlapping ACK data ranges to reduce the number of stored ranges to the minimal. This is the purpose of this patch. This is implemented in a new static function named qc_stream_buf_store_ack() into streamdesc layer. The merge algorithm is simple enough. First, it ensures the newly added range is not already fully covered by a preexisting entry. Then, it checks if there is contiguity/overlap with one or several ranges starting at the same of a greater offset. If true, the newly added entry is extended to cover them all, and all contiguous/overlapped ranges are removed. Finally, if there is contiguity or overlap with an entry starting at a smaller offset, no new range is instantiated and instead the smaller offset is extended. Now that contiguous or overlapped ranges cannot exits anymore, ACK data ranges tree instiatiation can used EB_ROOT_UNIQUE. Outside of the longer term objective which is to decount out-of-order ACKs from MUX txbuf window, this commit could also improve some performance and/or memory usage for connections where stream data fragmentation and packet reording is high.	2024-10-04 18:07:52 +02:00
Amaury Denoyelle	e7578084b0	MINOR: quic: implement dedicated type for out-of-order stream ACK QUIC streamdesc layer is responsible to handle reception of ACK for streams. It removes stream data from the underlying buffers on ACK reception. Streamdesc layer treats ACK in order at the stream level. Out of order ACKs are buffered in a tree until they can be handled on older data acknowledgement reception. Previously, qf_stream instance which comes from the quic_tx_packet was used as tree node to buffer such ranges. Introduce a new type dedicated to represent out of order stream ack data range. This type is named qc_stream_ack. It contains minimal infos only relative to the acknowledged stream data range. This allows to reduce size of frequently used quic_frame with the removal of tree node from qf_stream. Another side effect of this change is that now quic_frame are always released immediately on ACK reception, both in-order and out-of-order. This allows to also release the quic_tx_packet instance which should reduce memory consumption. The drawback of this change is that qc_stream_ack instance must be allocated on out-of-order ACK reception. As such, qc_stream_desc_ack() may fail if an error happens on allocation. For the moment, such error is silenly recovered up to qc_treat_rx_pkts() with the dropping of the received packet containing the ACK frame. In the future, it may be useful to close the connection as this error may only happens on low memory usage.	2024-10-04 17:56:45 +02:00
Amaury Denoyelle	4ff87db5fe	MEDIUM: quic: decount acknowledged data for MUX txbuf window Recently, a new allocation mechanism was implemented for Tx buffers used by QUIC MUX. Now, underlying congestion window size is used to determine if it is still possible or not to allocate a new buffer when necessary. This mechanism has render the QUIC stack more flexible. However, it also has brought some performance degradation, with transfer time longer in certain environment. It was first discovered on the measurement results of the interop. It can also easily be reproduced using the following ngtcp2-client example which forces a very small congestion window due to frequent loss : $ ngtcp2-client -q --no-quic-dump --no-http-dump --exit-on-all-streams-close -r 0.1 127.0.0.1 20443 "https://[::]:20443/?s=10m" This performance decrease is caused by the allocator which is now too strict. It may cause buffer underrun frequently at the MUX layer when the congestion window is too small, as new buffers cannot be allocated until the current one is fully acknowledged. This resuls in transfers with very bad throughput utilisation. The objective of this new serie of patches is to relax some restrictions to permit QUIC MUX to allocate new buffers more quickly, while preserving the initial limitation based on congestion window size. An interesting method for this is to notify QUIC MUX about newly available room on individual ACK reception, without waiting for the full bffer acknowledgement. This is easily implemented by adding a new notify_room invokation in QUIC streamdesc layer on ACK reception. However, ACK reception are handled in-order at the stream level. Out of order ACKs are buffered and are not decounted for now. This will be implemented in a future commit. Note that for a single buffer instance, data can in parallel be written by QUIC MUX and removed on ACK reception. This could cause room notification to QUIC MUX layer to report invalid values. As such, ACK reception are only accounted for released buffers. This ensures that such buffers won't received any new data. In the same time, buffer room is notified on release operation as it does not need acknowledgement. This commit has permit to improve performance for the ngtcp2-client scenario above. However, it is not yet sufficient enough for interop goodput test.	2024-10-04 17:31:26 +02:00
Amaury Denoyelle	324a49ed4d	MINOR: quic: strengthen qc_release_frm() quic_frame is the type used to represent frames emitted in a QUIC Tx packet. Each frame is attached to a packet, and can also be linked to other frames from the the same packet, or duplicated frames for retransmission. As such, quic_frame free operation is a tedious process. qc_release_frm() has been implemented to ensure quic_frame is always properly freed after detaching from all its list attach point. One particular point is to ensure that when a frame is released, the frame origin and all origin copies, including the current <frm> are flagged as acked and detached from the reflist. Add a BUG_ON() to ensure this loop is properly conducted when dealing with the current <frm> instance.	2024-10-04 16:00:05 +02:00
Christopher Faulet	131b877565	BUG/MINOR: stats: Fix the name for the total number of streams created Because of a copy/paste error, CurrStreams was reused by mistake. It should be "CumStreams" No backports needed.	2024-10-04 15:44:40 +02:00
Amaury Denoyelle	c1d714156e	BUG/MAJOR: mux-quic: do not crash on empty STREAM frame emission Most of the time STREAM frames emitted by QUIC MUX have some data in it. However, it is possible to use an empty frame when a delayed FIN must be transferred. Recently, QUIC MUX send callback notification has been refactored. Now, this callback is blindly called by quic_conn lower layer each time a STREAM frame is built into a newly Tx packet. QUIC MUX is responsible to ensure the notified frame corresponds to newly emitted data or retransmission. Offsets are used for this comparison, but this requires special care for empty FIN frames. Sadly, the comparison written to determine if an empty FIN frame was sent for the first time or retransmitted is not correct. This caused such frame to always be dismissed as retransmission in QUIC MUX sent callback. This prevented the related QCS instance to be removed from the send_list, causing qcc_io_send() to retry a new emission. This was finally interrupted by the BUG_ON() assertion to prevent an infinite loop. Fix this crash by updating the condition in QUIC MUX send callback. For empty STREAM frame, it is sufficient to check if QC_SF_FIN_STREAM was already removed or not to detect a retransmission. Indeed, empty STREAM frames are never used outside of delayed FIN reporting. No need to backport. This crash was introduced in the current dev branch by the following commit. `d7f4e5abf0` MEDIUM: quic: strengthen MUX send notification	2024-10-04 11:31:11 +02:00
Amaury Denoyelle	b74df9fbc9	BUG/MINOR: quic: fix trace on releasing STREAM frame after ack Fix NULL argument pass to qc_release_frm(). This allows to give more context on the traces inside it. Note that no crash occured as QUIC traces always check validity on first arg before derefencing it. No backport needed.	2024-10-02 17:10:51 +02:00
Amaury Denoyelle	58b7a72d07	BUG/MINOR: mux-quic: fix crash on qcc_init() early return qcc_release() may be used in case qcc_init() cannot complete. In this case, connection instance is NULL. As such, it cannot be dereferenced without testing it first. This should fix github coverity report #2739. No backport needed.	2024-10-02 17:06:31 +02:00
Christopher Faulet	cea1379cf1	BUG/MINOR: http-ana: Disable fast-fwd for unfinished req waiting for upgrade If a request is waiting for a protocol upgrade but it is not finished, the data fast-forwarding is disabled. Otherwise, the request analyzers will miss the end of the message. This case is possible since the commit 01fb1a54 ("BUG/MEDIUM: mux-h1/mux-h2: Reject upgrades with payload on H2 side only"). Indeed, before, a protocol upgrade was not allowed for request with payload. But it is now possible and this comes with a side-effect. It is not really satisfying but for now there is no other way to sync the muxes and the applicative stream. It seems to be a reasonnable fix for now, waiting for a deeper refactoring. This patch must be backported with the commit above.	2024-10-02 10:31:40 +02:00
Christopher Faulet	267ba1d889	MINOR: mux-h1: Use a dedicated function to conditionnaly set EOI flag on SE The same conditions are evaluated in h1_process_demux() and h1_fastfwd() to know if SE_FL_EOI flag must be set or not on the sedesc. So now, a dedicated function is used.	2024-10-02 10:22:51 +02:00
Christopher Faulet	6b39e245e1	BUG/MINOR: mux-h1: Fix condition to set EOI on SE during zero-copy forwarding During zero-copy data forwarding, the producer must set the EOI flag on the SE when end of the message is reached. It is already done but there is a case where this flag is set while it should not. When a request wants to perform a protocol upgrade and it is waiting for the server response, the flag must not be set because the HTTP message is finished but some data are possibly still expected, depending on the server response. On a 101-switching-protocol, more data will be sent because the producer is switch to TUNNEL state. So, now, the right condition is used. In DONE state, SE_FL_EOI flag is set on the sedesc iff: - it is the response - it is the request and the response is also in DONNE state - it is a request but no a protocol upgrade nor a CONNECT This patch must be backported as far as 2.9.	2024-10-02 10:22:51 +02:00
Christopher Faulet	27ee292731	MINOR: tcpcheck: Add support for an option host header value for httpchk option Support for headers and body hidden in the version for the "option httpchk" directive was removed. However a Host header is mandatory for HTTP/1.1 requests and some servers may return an error if it is not set. For now, to add it, an "http-check send" rule must be added. But it is not really handy to use an extra config line for this purpose. So now, it is possible to set the host header value, a log-format string, as extra argument to "option httpchk" directive. It must be the fourth argument: option httpchk GET / HTTP/1.1 www.srv.com While this patch is not a bug fix, it is simple enough to be backported if necessary. On 2.9 and older, lf_init_expr() does not exist and LIST_INIT() must be used instead.	2024-10-02 10:22:51 +02:00
Christopher Faulet	c39c351a73	MINOR: trace: Be able to chain commands for a source in one line In the configuration file or on the CLI, configuring traces for a specific source is a bit painful because this must be done in several lines. Thanks to this patch, it is now possible to fully configure traces for a source in one line. For instance, the following on the CLI: trace h1 sink stderr; trace h1 level developer; trace h1 verbosity complete; trace h1 start now can now be replaced by: trace h1 sink stderr level developer verbosity complete start now The same is true for the 'trace' directives in the configuration file.	2024-10-02 10:22:51 +02:00
Christopher Faulet	15a520d474	MINOR: config/trace: Add a 'traces' section to declare debug traces It is no longer supported to declare debug traces, via 'trace' directive, in a global section. A 'traces' directive must be used instead. The syntax of the 'trace' directive in these sections remains the same. But it is no longer experimental. The main reason for this change is to avoid to have a ring section defined before a global one. Indeed, for now, forward declarations of ring sections are not supported. So to configure traces, you had to add a ring section before the global one defining the traces. Most of time, that meant to have two global sections : global [...] # global settings ring <name> [...] global [...] # trace config In addition, it will be possible to easily extend the traces section by adding some new directives.	2024-10-02 10:22:51 +02:00
Willy Tarreau	53f52e67a0	BUG/MEDIUM: queue: always dequeue the backend when redistributing the last server An interesting bug was revealed by commit `5541d4995d` ("BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()"). When shutting down a server to redistribute its connections, no check is made on the backend's queue. If we're turning off the last server and the backend has pending connections, these ones will wait there till the queue timeout. But worse, since the commit above, we can enter an endless loop in the following situation: - streams are present in the backend's queue - streams are purged on the last server via srv_shutdown_streams() - that one calls pendconn_redistribute(srv) which does not purge the backend's pendconns - a stream performs some load balancing and enters assign_server_and_queue() - assign_server() is called in turn - the LB algo is non-deterministic and there are entries in the backend's queue. The function notices it and returns SRV_STATUS_FULL - assign_server_and_queue() calls pendconn_add() to add the connection to the backend's queue - on return, pendconn_must_try_again() is called, it figures there's no stream served anymore on the server nor the proxy, so it removes the pendconn from the queue and returns 1 - assign_server_and_queue() loops back to the beginning to try again, while the conditions have not changed, resulting in an endless loop. Ideally a change count should be used in the queues so that it's possible to detect that some dequeuing happened and/or that a last stream has left. But that wouldn't completely solve the problem that is that we must never ever add to a queue when there's no server streams to dequeue the new entries. The current solution consists in making pendconn_redistribute() take care of the proxy after the server in case there's no more server available on the proxy. It at least ensures that no pending streams are left in the backend's queue when shutting streams down or when the last server goes down. The try_again loop remains necessary to deal with inevitable races during pendconn additions. It could be limited to a few rounds, though, but it should never trigger if the conditions are sufficient to permit it to converge. One way to reproduce the issue is to run a config with a single server with maxconn 1 and plenty of threads, then run in loops series of: "disable server px/s;shutdown sessions server px/s; wait 100ms server-removable px/s; show servers conn px; enable server px/s" on the CLI at ~10/s while injecting with around 40 concurrent conns at 40-100k RPS. In this case in 10s - 1mn the crash can appear with a backtrace like this one for at least 1 thread: #0 pendconn_add (strm=strm@entry=0x17f2ce0) at src/queue.c:487 #1 0x000000000064797d in assign_server_and_queue (s=s@entry=0x17f2ce0) at src/backend.c:1064 #2 0x000000000064a928 in srv_redispatch_connect (s=s@entry=0x17f2ce0) at src/backend.c:1962 #3 0x000000000064ac54 in back_handle_st_req (s=s@entry=0x17f2ce0) at src/backend.c:2287 #4 0x00000000005ae1d5 in process_stream (t=t@entry=0x17f4ab0, context=0x17f2ce0, state=<optimized out>) at src/stream.c:2336 It's worth noting that other threads may often appear waiting after the poller and one in server_atomic_sync() waiting for isolation, because the event that is processed when shutting the server down is consumed under isolation, and having less threads available to dequeue remaining requests increases the probability to trigger the problem, though it is not at all necessary (some less common traces never show them). This should carefully be backported wherever the commit above was backported.	2024-10-01 18:57:51 +02:00
Amaury Denoyelle	8d68717a41	MEDIUM: quic: refactor buffered STREAM ACK consuming For the moment, streamdesc layer can only deal with in-order ACK at the stream level. Received out-of-order ACKs are buffered in a tree attached to a streambuf instance. Previously, caller of qc_stream_desc_ack() was responsible to implement consumption of these buffered ACKs. Refactor this by implementing it directly at the streamdesc layer within qc_stream_desc_ack(). This simplifies quic_rx ACK handling and ensure buffered ACKs are consumed as soon as possible.	2024-10-01 16:22:23 +02:00
Amaury Denoyelle	cc4384aeb7	MEDIUM: quic: handle out-of-order ACK at streamdesc layer qc_stream_desc_ack() is the entrypoint for streamdesc layer to handle a new acknowledgement of previously emitted STREAM data. Previously, it was only able to deal with in-order ACK offset. The caller was responsible to buffer out-of-order ACKs. Change this by dealing with the latter case directly in qc_stream_desc_ack(). This notably simplify ACK handling in quic_rx module.	2024-10-01 16:22:20 +02:00
Amaury Denoyelle	62558a9285	MINOR: quic: move buffered ACK to streambuf QUIC streamdesc layer is used to manage QUIC MUX stream txbuf data storage until acknowledgment. Currently, it only supports in-order acknowledgment at the stream level. This requires to be able to buffer out-of-order ACKs until they can be handled. Previously, these ACKs were stored in a tree to the streamdesc instance. Move this indexed storage at the streambuf instance. This commit is purely an architecture change. However, it will allow to extend ACK management in future patches, such as the ability to merge overlapping out-of-order ACKs.	2024-10-01 16:19:42 +02:00
Amaury Denoyelle	943e48dadd	MINOR: quic: store streambuf in a streamdesc tree qc_stream_desc layer is used by QUIC MUX to store emitted STREAM data until their acknowledgement. Each stream with Tx capability can allocate its own qc_stream_desc. In turn, each stream desc can have one or multiple data buffers. This is useful when a MUX stream releases a buffer and allocate a new one, to preserve bandwith without waiting to receive all acknowledgement of the previous buffer. Each buffer is encapsulated in a qc_stream_buf structure. Previously, it was stored as a list into qc_stream_desc. Change this storage to use a tree instead. Each buffer is indexed by their offset. This commit does not introduce functional changes. However, this rearchitecture will be necessary for future commit to extend ACK management which require fetching individual buffer instance, not just the first or last element of a streamdesc, by their offset.	2024-10-01 16:19:41 +02:00
Amaury Denoyelle	f4a83fbb14	MINOR: quic: do not remove qc_stream_desc automatically on ACK handling qc_stream_desc_ack() is used to handle ACK received for STREAM frame. It removes acknowledged data from their underlying buffer. If all data were removed after ACK handling, qc_stream_desc instance would automatically be freed at the end of qc_stream_desc_ack(). However, this renders the function complicated to use. Simplify this by removing this automatic removal. Now, caller is responsible to check after ACK handling if qc_stream_desc instance can be removed. This is easily done using qc_stream_desc_done() helper.	2024-10-01 16:19:25 +02:00
Amaury Denoyelle	db68f8ed86	MINOR: quic: refactor STREAM room notification qc_stream_desc is an intermediary layer between QUIC MUX and quic_conn. It is a facility which permits to store data to emit and keep them for retransmission until acknowledgment. This layer is responsible to notify QUIC MUX each time a buffer is freed. This is necessary as MUX buffer allocation is limited by the underlying congestion window size. Refactor this to use a mechanism similar to send notification. A new callback notify_room can now be registered to qc_stream_desc instance. This is set by QUIC MUX to qmux_ctrl_room(). On MUX QUIC free, special care is now taken to reset notify_room callback to NULL. Thanks to this refactoring, further adjustment have been made to refine the architecture. One of them is the removal of qc_stream_desc QC_SD_FL_OOB_BUF, which is now converted to a MUX layer flag QC_SF_TXBUF_OOB.	2024-10-01 16:19:25 +02:00
Amaury Denoyelle	d7f4e5abf0	MEDIUM: quic: strengthen MUX send notification Previous commit implement a refactor of MUX send notification from quic_conn layer. With this new architecture, a proper callback is defined for each qc_stream_desc instance. This architecture change allows to simplify notification from quic_conn layer. First, ensure the MUX callback to properly ignore retransmission of an already emitted frame. Luckily, this can be handled easily by comparing offsets and FIN status. Also, each QCS instance can now be unregistered from send notification just prior qc_stream_desc releasing. This ensures a QCS is never manipulated from quic_conn after its emission ending. Both these changes render the send notification more robust. As a nice effect, flag QUIC_FL_CONN_TX_MUX_CONTEXT can be removed as it is now unneeded.	2024-10-01 16:19:25 +02:00
Amaury Denoyelle	6ad99af0a9	MINOR: quic: refactor MUX send notification For STREAM emission, MUX QUIC generates one or several frames and emit them via qc_send_mux(). Lower layer may use them as-is, or split them to lower chunk to fit in a QUIC packet. It is then responsible to notify the MUX to report the amount of data sent. Previously, this was done via a direct call from quic_conn to MUX using qcc_streams_sent_done(). Modify this to have a better isolation accross layers. Define a send callback handled by the qc_stream_desc instance. This allows the MUX to register each QCS instance individually to the renamved qmux_ctrl_send() which replaces qcc_streams_sent_done(). At quic_conn layer, qc_stream_desc_send() can be used now. This is a wrapper to qc_stream_desc layer to invoke the send callback if registered. This mechanism of qc_stream_desc callback should be extended later to implement other notifications accross the QUIC stack.	2024-10-01 16:19:25 +02:00
Amaury Denoyelle	4859d8e71d	MINOR: quic: remove unneeded notification of txbuf room When a stream buffer is freed, qc_stream_desc notify MUX. This is useful if MUX is waiting for Tx buffer allocation. Remove this notification in qc_stream_desc(). This is because the function is called when all stream data have been acknowledged and thus notified. This function can also be called with some data unacknowledged, but in this case this is only true just before connection closure. As such, it is useful to notify the MUX in this condition.	2024-10-01 16:19:25 +02:00
Amaury Denoyelle	12782da020	MINOR: mux-quic: strengthen qcs_send_metadata() usage This function is reserved for QCS instance where no data was emitted. A BUG_ON() ensures this by checking that streamdesc buf_list is empty. However, this condition would not be enough if data were previously emitted but already fully acknowledged. Thus, extend the condition by also checking the streamdesc ack_offset is 0.	2024-10-01 16:17:03 +02:00
Amaury Denoyelle	fdc16c1e01	MINOR: quic: ensure txbuf realloc is only performed on empty buffer QUIC application protocol layer has the ability to either allocate a standard buffer or a smaller one. The latter is useful when only small data are transferred to prevent consuming too much of the QUIC MUX buffer window. This operation is performed using qc_stream_buf_realloc(). Add a new BUG_ON() in it to ensure no data is present in the buffer. Indeed, this would cause to data loss, or even crash when trying to acknowledge data. Note that for the moment qc_stream_buf_realloc() is only use for HTTP/3 headers transmission, and this usage is conform to the new BUG_ON. This commit is thus not a bug fix, but only to strengthen the API.	2024-10-01 11:51:51 +02:00
Amaury Denoyelle	172404a8ec	MINOR: mux-quic: complete Tx infos for QCS dump Complete debug info when a QCS instance is dumped either on traces or show quic. Display the value of Tx offset both soft and real, along with the current flow-control limit.	2024-10-01 11:51:51 +02:00
Valentine Krasnobaeva	f18b52cc80	MINOR: cfgparse-global: add dedicated parser for env keywords This commit prepares the config parser to support MODE_DISCOVERY and, thus, refactored master-worker mode. The latter implies, that master process reads only the 'DISCOVERY' tagged keywords from the global section and it must call for this an appropriate keyword parser. So, let's move the code, which parses env keywords, from the global section parser to its own keyword registered parser.	2024-10-01 10:37:29 +02:00
Valentine Krasnobaeva	df68f7ec96	BUG/MINOR: cfgparse-global: fix allowed args number for setenv Keywords setenv and presetenv take 2 arguments: variable name and value. So, the total number, that should be passed to alertif_too_many_args is 2 ("setenv <name> <value>") instead of 3. For alertif_too_many_args the first argument index is 0. This should be backported in all stable versions.	2024-10-01 10:35:09 +02:00
Christopher Faulet	273d322b6f	MINOR: stream/stats: Expose the total number of streams ever created in stats A shared counter is added in the thread context to track the total number of streams created on the thread. This number is then reported in stats. It will be a useful information to diagnose some bugs.	2024-09-30 16:55:53 +02:00
Christopher Faulet	18ee22ff76	MINOR: stream/stats: Expose the current number of streams in stats A shared counter is added in the thread context to track the current number of streams. This number is then reported in stats. It will be a useful information to diagnose some bugs.	2024-09-30 16:55:53 +02:00
Christopher Faulet	6a94b7419e	MINOR: stream: Support dynamic changes of the number of connection retries Thanks to the previous patch, it is now possible to add an action to dynamically change the maxumum number of connection retires for a stream. "set-retries" action may now be used to do so, from a "tcp-request content" or a "http-request" rule. This action accepts an expression or an integer between 0 and 100. The integer value is checked during the configuration parsing and leads to an error if it is not in the expected range. However, for the expression, the value is retrieve at runtime. So, invalid value are just ignored. Too high value is forbidden to avoid any trouble. 100 retries seems already be an amazingly hight value. In addition, the option is only available on backend or listen sections. Because the max retries is limited to 100 at most, it can be stored as a unsigned short. This save some space in the stream structure.	2024-09-30 16:55:53 +02:00
Christopher Faulet	91e785edc9	MINOR: stream: Rely on a per-stream max connection retries value Instead of directly relying on the backend parameter to limit the number of connection retries, we now use a per-stream value. This value is by default inherited from the backend value when it is set. So for now, there is no change except the stream value is used instead of the backend value. But thanks to this change, it will be possible to dynamically change this value.	2024-09-30 16:55:53 +02:00
Christopher Faulet	0d91de2be4	MINOR: action: Export release_expr_int_action() release function This function was only used by TCP actions and was private to tcp_act.c file. However, it make sense to make it public to be used by any action relying on an int-or-expression argument.	2024-09-30 16:55:53 +02:00
Christopher Faulet	688abb6f30	BUG/MINOR: mcli: Pretend the mux have more data to deliver between two commands Since the commit "OPTIM: stconn: Don't pretend mux have more data to deliver on EOI/EOS/ERROR", the SC no longer pretend its mux have more data to deliver when one of EOI/EOS/ERROR flags are set on its sedesc. However, for the master cli, it is an issue because any EOI/EOS at the end of a command is in fact detected on the attempt to get the next command. To do so, the stream is reset. Because if the commit above, the next received is never performed. To fix the issue, when the stream is reset, the front SC pretend its mux have more data to deliver. This patch must only be bacported if the commit above is backported.	2024-09-30 16:55:53 +02:00
Christopher Faulet	bca5e14235	OPTIM: stconn: Don't pretend mux have more data to deliver on EOI/EOS/ERROR Doing some benchs on the 3.0, we encountered a small loss on requests/sec on small objects compared to the 2.8 . After bisecting the issue, it appeared that this was introduced when the mux-to-mux zero-copy data forwarding was implemented in 2.9-dev8. Extra subscribes on receives at the end of the message were responsible of the loss. A basic configuration, sending H2 requests to a H1 server returning responses without payload is enough to observe the issue. With the following command, we can observe a huge increase of epoll_ctl calls on 2.9/3.x: h2load -c 100 -m 10 -n 100000 http://... On 2.8 we have around 3200 calls to epoll_ctl against more than 20k on 3.1. The fix seems obvious. After a receive, there is no reason to state a mux have more data to deliver if EOI/EOS/ERROR flag was set on the stream-endpoint descriptor. With this change, extra calls to epoll_ctl disappear. However it is a sensitive part so it is important to keep an eye on it and to not backport it. Thanks to Willy and Emeric to have spot the issue.	2024-09-30 16:55:48 +02:00
Willy Tarreau	11051ed9c7	OPTIM: channel: speed up co_getline()'s search of the end of line Previously, co_getline() was essentially used for occasional parsing in peers's banner or Lua, so it could afford to read one character at a time. However now it's also used on the TCP log path, where it can consume up to 40% CPU as mentioned in GH issue #2731. Let's speed it up by using memchr() to look for the LF, and copying the data at once using memcpy(). Previously it would take 2.44s to consume 1 GB of log on a single thread of a Core i7-8650U, now it takes 1.56s (-36%).	2024-09-30 11:36:39 +02:00
Willy Tarreau	1d403caf8a	MINOR: server: make srv_shutdown_sessions() call pendconn_redistribute() When shutting down server sessions, the queue was not considered, which is a problem if some element reached the queue at the moment the server was going down, because there will be no more requests to kick them out of it. Let's always make sure we scan the queue to kick these streams out of it and that they can possibly find a more suitable server. This may make a difference in the time it takes to shut down a server on the CLI when lots of servers are in the queue. It might be interesting to backport this to 3.0 but probably not much further.	2024-09-27 19:01:38 +02:00
Willy Tarreau	1385e33eb0	BUG/MINOR: queue: make sure that maintenance redispatches server queue Turning a server to maintenance currently doesn't redispatch the server queue unless there's an explicit "option redispatch" and no "option persist", while the former has never really been the purpose of this test. Better refine this so that forced maintenance also causes the queue to be flushed, and possibly redispatched unless the proxy has option persist. This way now when turning a server to maintenance, the queue is immediately flushed and streams can decide what to do. This can be backported, though there's no need to go far since it was never directly reported and only noticed as part of debugging some rare "shutdown sessions" strangeness, which it might participate to.	2024-09-27 18:54:07 +02:00
Willy Tarreau	b8e3b0a18d	BUG/MEDIUM: stream: make stream_shutdown() async-safe The solution found in commit `b500e84e24` ("BUG/MINOR: server: shut down streams under thread isolation") to deal with inter-thread stream shutdown doesn't work fine because there exists code paths involving a server lock which can then deadlock on thread_isolate(). A better solution then consists in deferring the shutdown to the stream itself and just wake it up for that. The only thing is that TASK_WOKEN_OTHER is a bit too generic and we need to pass at least 2 types of events (SF_ERR_DOWN and SF_ERR_KILLED), so we're now leveraging the new TASK_F_UEVT1 and _UEVT2 flags on the task's state to convey these info. The caller only needs to wake the task up with these flags set, and the stream handler will then finish the job locally using stream_shutdown_self(). This needs to be carefully backported to all branches affected by the dequeuing issue and containing any of the `5541d4995d` ("BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()"), and/or `b11495652e` ("BUG/MEDIUM: queue: implement a flag to check for the dequeuing").	2024-09-27 12:15:41 +02:00
Willy Tarreau	d1c398b786	Revert "BUG/MINOR: server: shut down streams under thread isolation" This reverts commit `b500e84e24`. Thread isolation does not work well for this, there exists code paths which already hold the server's lock and result in a deadlock. Let's revert that and address it better without isolation.	2024-09-27 10:17:31 +02:00
Aurelien DARRAGON	e3eb6a9035	MEDIUM: log: consider log-steps proxy setting for existing log origins During tcp/http transaction processing, haproxy may produce logs at different steps during the processing (accept, connect, request, response, close). But the behavior is hardly configurable because haproxy will only emit a single log per transaction, and by default it will try to produce the log once all log aliases or fetches used in the logformat could be satisfied, which means the log is often emitted during connection teardown, unless "option logasap" is used. We were often asked to have a way to emit multiple logs for a single transaction, like for instance emit log during accept, then request, response and close for instance, see GH #401 for more context. Thanks to "log-steps" keyword introduced by commit "MINOR: log: introduce "log-steps" proxy keyword", it is now possible to explictly configure when logs should be generated by haproxy when processing a transaction. This commit adds the required checks so that log-steps proxy option is properly considered for existing logs generated by haproxy. If "log-steps" is not specified on the proxy, the old behavior is preserved. Note: a slight cpu overhead should only be visible when "log-steps" keyword will be used due to the implementation relying on eb32 lookup instead of basic bitfield check as described in "MINOR: proxy: add log_steps struct member". However, the default behavior shouldn't be affected. When combining log-steps with log-profiles, user has the ability to explicitly control how and when haproxy should generate logs during requests handling.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	4189eb7aca	MINOR: log: add log_orig_proxy() helper function Function may be used on proxy where log-steps are used to check if a given log origin should be handled or not.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	c043d5d372	MINOR: log: introduce "log-steps" proxy keyword For now it is only available for proxies with frontend capability because log-steps are only evaluated under sess_log() or strm_log() which essentially focus on the frontend side when it comes to log settings so it's better to keep it this way for better consistency, at least for now. For now the setting does nothing (it is not considered during runtime), it will be implemented and documented in upcoming commits.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	9341792baf	MINOR: proxy: add log_steps struct member add proxy->conf.log_steps eb32 root tree which will be used to store the log origin identifiers that should result in haproxy emitting a log as configured by the user using upcoming "log-steps" proxy keyword. It was chosen to use eb32 tree instead of simple bitfield because despite the slight overhead it is more future-proof given that we already implemented the prerequisites for seamless custom log origins registration that will also be usable from "log-steps" proxy keyword.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	b882402a29	MINOR: log: support extra log origins for '%OG' alias Following previous commits, let's improve log_orig_to_str() so that extra log origins (registered through log_orig_register()) can be translated to string from origin ID. For that, it is required to add eb_32 tree node to log_origin struct in order to enable quick integer lookup during runtime. Slow name lookup using the list is acceptable for config parsing, but it is not the case during runtime when log_orig_to_str() is expected to be used. Also, to prevent duplicated info, get rid of ->id field and use ->tree.key instead	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	f8bb9d5c57	MINOR: log: explicitly handle extra log origins as error when relevant Thanks to previous commit, we can know check for log_orig optional flags in functions taking struct log_orig as parameter. Let's take this opportunity to add the LOG_ORIG_FL_ERROR flag and check this flag at a few places to handle the log message differently because if the flag is set then the caller expects the log to be handled as an error explicitly. e.g.: in _process_send_log_override(), if the flag is set, use the error log format instead of the dedicated one.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	3c15ee05e9	MINOR: log: introduce log_orig flags Rename 'enum log_orig' to 'enum log_orig_id', since this enum specifically contains the log origin ids. Add 'struct log_orig' which wraps 'enum log_orig' with optional flags (no flags defined for now). Add log_orig() helper func that takes id and flags as parameter and returns log_orig struct initialized with input arguments. Update functions taking log origin as parameter so they explicitly take log orig id or log orig wrapper as argument depending on the level of context expected by the function.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	6567e37680	MINOR: log: handle extra log origins in _process_send_log_override() Thanks to the previous commit, it is now possible to register additional log origins that may be used from log-profile section as 'on' steps. As such, let's make _process_send_log_override() function aware of them by trying to lookup in the tree of extra logging steps in the default switch-case catchall. If the log origin id matches with the id of the extra logging step, we use the associated log format instead of the "any" log format.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	818475c5cc	MINOR: log: introduce extra log profile steps add a way to register additional log origins using log_origin_register() that may be used as log profile steps from log profile sections. For now this does nothing as no extra origins are registered and extra log origins are not yet considered for runtime logging paths. When specifying an extra logging step for on <step> under log-profile section, the logging step is stored within a binary tree for efficient lookup during runtime. No performance impact should be expected if extra log origins are not being used, and slight performance impact if extra log origins are used. Don't forget to update the documentation when new log origins are added (both %OG log alias and on <step> log-profile keyword are concerned.	2024-09-26 16:53:07 +02:00
Aurelien DARRAGON	facf259d88	MINOR: log: fix indent in strm_log() `8f34320e15` ("MINOR: log: provide log origin in logformat expressions using '%OG'") caused wrong indent in strm_log()	2024-09-26 16:53:07 +02:00
Oliver Dala	a889413f5e	BUG/MEDIUM: cli: Deadlock when setting frontend maxconn The proxy lock state isn't passed down to relax_listener through dequeue_proxy_listeners, which causes a deadlock in relax_listener when it tries to get that lock. Backporting: Older versions didn't have relax_listener and directly called resume_listener in dequeue_proxy_listeners. lpx should just be passed directly to resume_listener then. The bug was introduced in commit `001328873c` [cf: This patch should fix the issue #2726. It must be backported as far as 2.4]	2024-09-25 17:12:11 +02:00
Christopher Faulet	14a413033c	BUG/MEDIUM: cli: Be sure to catch immediate client abort A client abort while nothing was sent is properly handled except when this immediately happens after the connection was accepted. The read0 event is caught before the CLI applet is created. In that case, the shutdown is not handled and the applet is no longer wakeup. In that case, the stream remains blocked and no timeout are armed. The bug was due to the fact that when the applet I/O handler was called for the first time, the applet context was initialized and nothing more was performed. A shutdown, if any, would be handled on the next call. In that case, it was too late. Now, afet the init step, we loop to eval the first command. There is no command here but the shutdown will be tested. This patch should fix the issue #2727. It must be backported to 3.0.	2024-09-24 18:01:38 +02:00
Aurelien DARRAGON	d622f9d5b6	MEDIUM: mailers: warn about deprecated legacy mailers As mentioned in 2.8 announce on the mailing list [1] and on the wiki [2], use of legacy mailers is now deprecated and will not be supported anymore starting with version 3.3. Use of Lua script (AKA Lua mailers) is now encouraged (and fully supported since 2.8) for this purpose, as it offers more flexibility (e.g: alerts can be customized) and is more future-proof. Configurations relying on legacy mailers will now raise a warning. Users willing to keep their existing mailers config in a working state should simply add the following line to their global section: # mailers.lua file as provided in the git repository # adjust path as needed lua-load examples/lua/mailers.lua [1]: https://www.mail-archive.com/haproxy@formilux.org/msg43600.html [2]: https://github.com/haproxy/wiki/wiki/Breaking-changes	2024-09-23 20:16:27 +02:00
Willy Tarreau	fdf38ed7fc	BUG/MINOR: proxy: also make the cli and resolvers use the global name As detected by ASAN on the CI, two places still using strdup() on the proxy names were left by commit `b325453c3` ("MINOR: proxy: use the global file names for conf->file"). No backport is needed.	2024-09-21 20:08:06 +02:00
Willy Tarreau	b500e84e24	BUG/MINOR: server: shut down streams under thread isolation Since the beginning of thread support, the shutdown of streams attached to a server was run under the server's lock, but that's not sufficient. It indeed turns out that shutting down streams (either from the CLI using "shutdown sessions server XXX" or due to "on-error shutdown-sessions") iterates over all the streams to shut them down, but stream_shutdown() has no way to protect its actions against concurrent actions from the stream itself on another thread, and streams offer no such provisions anyway. The impact is some rare but possible crashes when shutting down streams from the CLI in cmopetition with high server traffic. The probability is low enough to mark it minor, though it was observed in the field. At least since 2.4 the streams are arranged in per-thread lists, so it likely would be possible using the event subsystem to delegate these events to dedicated per-thread tasks which would address the problem. But server streams don't get killed often enough to justify such extra complexity, so better just run the loop under thread isolation. It also shows that the internal API could probably be improved to support a lighter thread exclusion instead of full isolation: various places want to only exclude one thread and here it could work. But again there's no point doing this for now. This patch should be backported to all stable branches. It's important to carefully check that this srv_shutdowns_streams() function is never called itself under isolation in older versions (though at first glance it looks OK).	2024-09-21 19:35:35 +02:00
Willy Tarreau	e77c73316a	MEDIUM: cfgparse: warn about deprecated use of duplicate server names As discussed below, there are too many problems and limitations caused by still supporting duplicate server names. That's already particularly complicated and dissuasive to use since it requires these servers to have explicit IDs to be accept. Let's now warn on any duplicate, even with explicit IDs and remind that this will become forbidden in 3.3. Link: https://www.mail-archive.com/haproxy@formilux.org/msg45185.html	2024-09-20 17:15:11 +02:00
Willy Tarreau	029d75df1e	OPTIM: cfgparse: speed up duplicate server detection Surprisingly, the duplicate server name detection has never made use of the names tree, so lookups were still in O(N^2). It took 1 second to validate 50k servers spread into 25 backends at 2k per backend. By simply using the tree (and since the current server already is in the tree), we just have to walk using ebpt_prev_dup to visit previous servers with the same name. We can then detect which ones conflict without having an ID set and error. The config check time is now 1/4 of the previous one for 2k servers per backend, and more importantly it will make it simpler to check for any duplicates later.	2024-09-20 17:14:50 +02:00
Willy Tarreau	ccd1ecba1d	MEDIUM: cfgparse: drop duplicate named defaults sections after use It has never been permitted to explicitly reference named defaults sections for which there are duplicate names. This means that when a duplicate defaults section is found, there's no point in keeping it since it will never be used for lookups, so it can be dropped. However, some such defaults sections might have some rules in them that are implicitly referenced by proxies placed after them. In this case they cannot be removed. What is done here is that upon each new named section creation, if another one is found with the same name, its config location is stored into the new proxy's {prev_file,prev_line} pair, and the old section is either destroyed if its refcount is null, or just unindexed. The dup check when creating a new proxy now consists in checking the prev_line instead of performing a dup lookup on the defaults section. This will guarantee that we can't find duplicate defaults sections in their tree anymore, while still keeping track of what's allocated and releasing everything upon exit. Beyond the consistency gain, there are nice savings for large configs involving many defaults sections: a test with 300k sections saved about 1.9 GB of RAM, and started 25% faster likely thanks to spending less time allocating memory.	2024-09-20 16:35:32 +02:00
Willy Tarreau	c8b813771d	MINOR: proxy: add a list of orphaned defaults sections We'll soon delete unreferenced and duplicated named defaults sections from the list of proxies. The problem with this is that this list (in fact a name-based tree) is used to release all of them at the end. Let's add a list of orphaned defaults sections, typically those containing "http-check send" statements or various other rules, and that are implicitly inherited by a proxy hence have a non-zero refcount while also having a name. These now makes it possible to remove them from the name index while still keeping their memory around for the lifetime of the process, and cleaning it at the end.	2024-09-20 15:59:04 +02:00
Willy Tarreau	cb4c236fac	BUG/MINOR: cfgparse: detect another uncaught case of duplicate defaults The following sequence was not properly caught: defaults def backend back from def defaults def But this one was: defaults def defaults def backend back from def Let's check when defaults are declared that they're not already referenced. Better not backport this. While it will catch broken configs (possibly some with backends pasted after the wrong defaults), these might still work by accident. It may be reported as a diag warning though.	2024-09-20 15:58:10 +02:00
Willy Tarreau	5b221d1e41	CLEANUP: cfgparse: factor proxy vs log-forward collisions This simplifies the check added in `1a38684fbc` ("MEDIUM: cfgparse: detect collisions between defaults and log-forward"), by factoring it with the other existing one. The tests are ugly in that code because a first block tests pure proxies, a second one proxies or defaults and inside that one we have special cases for defaults. Let's just move the tests to the "any proxy type" block.	2024-09-20 14:13:14 +02:00
Willy Tarreau	b325453c36	MINOR: proxy: use the global file names for conf->file Proxy file names are assigned a bit everywhere (resolvers, peers, cli, logs, proxy). All these elements were enumerated and now use copy_file_name(). The only ha_free() call was turned to drop_file_name(). As a bonus side effect, a 300k backend config saved 14 MB of RAM.	2024-09-19 15:38:19 +02:00
Willy Tarreau	9ab21a3c2d	CLEANUP: stick-table: make the file location point to a global file name The file name used to point to the calling function's stack for stick tables, which was OK during parsing but remained dangling afterwards. At least it was already marked const so as not to accidentally free it. Let's make it point to a file_name_node now.	2024-09-19 15:38:19 +02:00
Willy Tarreau	d6c060c5ae	MINOR: tools: add minimal file name management In proxies, stick-tables, servers, etc... at plenty of places we store a file name and a line number. Some file names are the result of strdup() (e.g. in proxies), others not (e.g. stick-tables) and leave dangling pointers at the end of parsing. The risk of double-free is not null either. In order to stop this, let's first add a simple tool that allows to register short strings inside a global list, these strings happening to be server names. The strings are either duplicated and stored upon failure to find them, or just added to this storage. Since file names are not expected to disappear before the end of the process, for now we don't even implement refcounting, and we free them all at the end. There's already a drop_file_name() function to reset the pointer like ha_free() used to do, and even if not strictly needed it's a good habit to get used to doing it. The strings are returned as const so that they're stored as-is in structs, and that nasty free() calls are easily caught. The pointer points to the char[] storage inside the node itself. This way later if we want to implement refcounting, it will be trivial to just look up a string and change its associated node's refcount. If needed, comparisons can also be made on pointers. For now they're not used yet and are released on deinit().	2024-09-19 15:36:58 +02:00
Willy Tarreau	1a38684fbc	MEDIUM: cfgparse: detect collisions between defaults and log-forward Sadly, when log-forward were introduced they took great care of avoiding collision with regular proxies but defaults were missed (they need to be explicitly checked for). So now we have to move them to a warning for 3.1 instead of rejecting them.	2024-09-18 18:08:15 +02:00
Willy Tarreau	d8f4b07e40	MEDIUM: cfgparse: warn about colliding names between defaults and proxies In order to complete the checks added in `303a66573d` ("MEDIUM: cfgparse: warn about proxies having the same names"), we also need to warn about regular proxies having the same name as defaults sections as well as defaults sections having the same name as proxies, since defaults sections are inherently proxies, albeit stored in a separate list for now.	2024-09-18 18:08:06 +02:00
Amaury Denoyelle	fcd6d29acf	BUG/MINOR: mux-quic: report glitches to session Glitch counter was implemented for QUIC/HTTP3. The counter is stored in the QCC MUX connection instance. However, this is never reported at the session level which is necessary if glitch counter is tracked via a stick-table. To fix this, use session_add_glitch_ctr() in various QUIC MUX functions which may increment glitch counter. This should be backported up to 3.0.	2024-09-18 16:11:03 +02:00
Willy Tarreau	303a66573d	MEDIUM: cfgparse: warn about proxies having the same names As discussed below, there are too many problems and uncaught bugs in the parser when trying to support proxies having similar names but different types. There's specific code to detect the presence of stick-tables in a pair of such proxies for example. It's even possible that certain combinations of backend+listen that were not previously detected have some nasty side effects. According to the proposal in the discussion, this is now deprecated in 3.1 (thus we emit a warning) and will become forbidden in 3.3. A backport might be useful, but reporting a diag_warning only, not a classical warning, so as not to break setups running in zero-warning mode. It was verified with a config involving all 9 combinations of (frontend,backend,listen) followed by one of the same three that all collisions are now properly blocked and that only back+front are kept and emit a warning. Link: https://www.mail-archive.com/haproxy@formilux.org/msg45185.html	2024-09-17 19:55:00 +02:00
Willy Tarreau	c70906c8a1	BUG/MINOR: cfgparse: detect incorrect overlap of same backend names As reported below, it's possible to declare a backend then a proxy with the same name, because for the proxy we check a frontend capability (the first one to be tested): backend b listen b bind :8888 Let's check the two capabilities in this case and not just the frontend. Better not backport this, as there's a risk of breakage of existing setups that work by accident. It might make sense to report them as diag warnings though. Link: https://www.mail-archive.com/haproxy@formilux.org/msg45185.html	2024-09-17 19:55:00 +02:00
Aurelien DARRAGON	17e52c922b	BUG/MINOR: cfgparse-listen: fix option httpslog override warning message "option httpslog" override warning messaged used to be reported as "option httplog", probably as a result of copy paste without adjusting the context. Let's fix that to prevent emitting confusing warning messages The issue exists since `98b930d` ("MINOR: ssl: Define a default https log format"), thus it should be backported up to 2.6	2024-09-17 15:40:02 +02:00
Aurelien DARRAGON	bc4bf5779f	BUG/MINOR: fix missing "'option httpslog' overrides previous 'option tcplog clf'..." detection Same as b85edd44db0 ("BUG/MINOR: fix missing "log-format overrides previous 'option tcplog clf'..." detection") but for "option httpslog" keyword. No backport needed unless `fd48b28` ("MINOR: Implements new log format of option tcplog clf") is.	2024-09-17 15:40:02 +02:00
Aurelien DARRAGON	607b9adc9b	BUG/MINOR: fix missing "log-format overrides previous 'option tcplog clf'..." detection In commit `fd48b28315` ("MINOR: Implements new log format of option tcplog clf") "option tcplog clf" detection was correcly added for "option tcplog" and "option httplog", but "log-format" case was overlooked. Thus, this config would report erroneous warning message: defaults option tcplog clf log-format "ok" [WARNING] (727893) : config : parsing [test.conf:3]: 'log-format' overrides previous 'log-format' in 'defaults' section. No backport needed unless `fd48b28315` is.	2024-09-17 14:41:58 +02:00
Willy Tarreau	499e057644	MEDIUM: clock: don't compute before_poll when using monotonic clock There's no point keeping both clocks up to date; if the monotonic clock is ticking, let's just refrain from updating the wall clock one before polling since we won't use it. We still do it after polling however as we need a wall clock time to communicate with outside. This saves one gettimeofday() call per loop and two timeval comparisons.	2024-09-17 09:08:10 +02:00
Willy Tarreau	24496803d1	MEDIUM: clock: use the monotonic clock for idle time calculation By just keeping a copy of the last known value before entering polling, we can apply the same algorithm as we're currently using, except that it's now applied to the monotonic clock instead of the wall clock, when it's detected that it's ticking. This improves idle time calculation accuracy by making it independent on the wall clock.	2024-09-17 09:08:10 +02:00
Willy Tarreau	4150851ce5	MEDIUM: clock: opportunistically use CLOCK_MONOTONIC for the internal time We already collect CLOCK_MONOTONIC when it's available when leaving the poller, but it's only used for profiling. The functions that return it set the value to zero when it's not available, so we can use that to detect if it works or not. The idea is that if the monotonic time is non-zero, it is ticking and usable, then we use if for now_ns, otherwise we use the corrected date. We continue to apply the now_offset to the returned value because it helps forcing an early time wrap-around. Proceeding like this presents two benefits: - on systems supporting this, the time is much more robust against time changes - when it works, it saves us from having to go through the time correction code, which is usually cheap, but better avoided anyway. Note that idle time calculation continues to rely on the wall-clock time.	2024-09-17 09:08:10 +02:00
Willy Tarreau	f793845f4a	MEDIUM: clock: collect the monotonic time in clock_local_update_date() Now we collect this clock in clock_local_update_date(), the closest from the poller, which is also used when busy-polling, and the values is set into the thread's curr_mono_time which did not exist before. Later, clock_leaving_poll() just sets the prev_mono_time value from the curr_ one instead of retrieving the time at this specific point. It also means that the monotonic time will now also cover the time needed to update the global time, which should be negligible. Note that we don't collect the CPU time in the clock_local_update_date() function even though it's tempting, because when doing busy-polling, it would be collected on each round while being useless. Doing so will make sure that the local time always knows the monotonic time when it is available.	2024-09-17 09:08:10 +02:00
Willy Tarreau	42e699903e	MINOR: clock: test all clock_gettime() return values Till now we were only using clock_gettime() for profiling, so if it would fail it was no big deal. We intend to use it as the main clock as well now, so we need to more reliably detect its absence or failure and gracefully fall back to other options. Without the test we would return anything present in the stack, which is neither clean nor easy to detect.	2024-09-17 09:08:10 +02:00
Christopher Faulet	afc50f2445	BUG/MEDIUM: cache/stats: Wait to have the request before sending the response It seems obvious. On a classical workflow, the request headers analysis is finished when these applets are woken up for the first time. So they don't take care to really have the request to start to process it and to send the response. But with a filter, it is possible to stop the request analysis after the applet creation. If this happens for the stats applet, this leads to a crash because we retrieve the request start-line without checking if it is available. For the cache applet, the response is just immediatly sent. And here it is a problem if the compression is enabled. In that case too, this may lead to a crash because the compression may be enabled but not initialized. For a true server, there is no issue because the connection cannot be established. The server is chosen only after the request analysis. The issue with applets is that once created, an applet is quickly switched to the established state. So it is probably a point that must be carefully reviewed and probably reworked. In the mean time, as a fix, in the cache and the stats applet, we just take care to have the request before sending the response. This will do the trick. The patch must be backported as far as 2.6. On 2.6, the patch must be adapted.	2024-09-16 22:55:40 +02:00
Christopher Faulet	4de6632693	MINOR: proxy: Rename accept-invalid-http-* options With these options, it is possible to accept some invalid messages that may considered as unsafe and may result as vulnerabilities. The naming is not explicit enough on this point. These option must really be considered as dangerous and only used as a temporary workaround. Unfortunately, when used, it is probably because there are some legacy and unsupported applications in place. Nevermind. The documentation warns about the use of these options. Now the name of the options itself is a warning. So now, "accept-invalid-http-request" and "accept-invalid-http-response" options are deprecated and replaced by "accept-unsafe-violations-in-http-request" and "accept-unsafe-violations-in-http-response" options.	2024-09-16 22:55:25 +02:00
Aurelien DARRAGON	1e0920f855	BUG/MINOR: peers: local entries updates may not be advertised after resync Since commit `864ac3117` ("OPTIM: stick-tables: check the stksess without taking the read lock"), when entries for a local table are learned from another peer upon resynchro, and this is the only peer haproxy speaks to, local updates on such entries are not advertised to the peer anymore, until they eventually expire and can be recreated upon local updates. This is due to the fact that ts->seen is always set to 0 when creating new entry, and also when touch_remote is performed on the entry. Indeed, while `864ac3117` attempts to avoid useless updates, it didn't consider entries learned from a remote peer. Such entries are exclusively learned in peer_treat_updatemsg(): once the entry is created (or updated) with new data, touch_remote is used to commit the change. However, unlike touch_local, entries committed using touch_remote will not be advertised to the peer from which the entry was just learned (otherwise we would enter a looping situation). Due to the above patch, once an entry is learned from the (unique) remote peer, 'seen' will be stuck to 0 so it will never be advertised for its whole lifetime. Instead, when entries are learned from a peer, we should consider that the peer that taught us the entry has seen it. To do this, let's set seen=1 in peer_treat_updatemsg() after calling touch_remote(). This way, if we happen to perform updates on this entry, it will be properly advertized to relevant peers. This patch should not affect the performance gain documented in `864ac3117` given that the test scenario didn't involved entries learned by remote peers, but solely locally created entries advertised to remote peers upon updates. This should be backported in 3.0 with `864ac3117`.	2024-09-16 14:06:39 +02:00
Willy Tarreau	5d350d1e50	OPTIM: vars: use multiple name heads in the vars struct Given that the original list-based version was using a list head as the root of the variables, while the tree is using a single pointer, it made sense to reuse that space to place multiple roots, indexed on the lower bits of the name hash. Two roots slightly increase the performance level, but the best gain is obtained with 4 roots. The performance is now always above that of the list, even with small counts, and with 100 vars, it's 21% higher than before, or 67% higher than with the list. We keep the same lock (it could have made sense to use one lock per head), because most of the variables in large configs are attached to a stream or a session, hence are not shared between threads. Thus there's no point in sharding the pointer.	2024-09-15 23:51:51 +02:00
Willy Tarreau	47ec7c681e	OPTIM: vars: use a cebtree instead of a list for variable names Configs involving many variables can start to eat a lot of CPU in name lookups. The reason is that the names themselves are dynamic in that they are relative to dynamic objects (sessions, streams, etc), so there's no fixed index for example. The current implementation relies on a standard linked list, and in order to speed up lookups and avoid comparing strings, only a 64-bit hash of the variable's name is stored and compared everywhere. But with just 100 variables and 1000 accesses in a config, it's clearly visible that variable name lookup can reach 56% CPU with a config generated this way: for i in {0..100}; do printf "\thttp-request set-var(txn.var%04d) int(%d)" $i $i; for j in {1..10}; do [ $i -lt $j ] \|\| printf ",add(txn.var%04d)" $((i-j)); done; echo; done The performance and a 4-core skylake 4.4 GHz reaches 85k RPS with a perf profile showing: Samples: 170K of event 'cycles', Event count (approx.): 142378815419 Overhead Shared Object Symbol 56.39% haproxy [.] var_to_smp 6.65% haproxy [.] var_set.part.0 5.76% haproxy [.] sample_process_cnv 3.23% haproxy [.] sample_conv_var2smp 2.88% haproxy [.] sample_conv_arith_add 2.33% haproxy [.] __pool_alloc 2.19% haproxy [.] action_store 2.13% haproxy [.] vars_get_by_desc 1.87% haproxy [.] smp_dup [above, var_to_smp() calls var_get() under the read lock]. By switching to a binary tree, the cost is significantly lower, the performance reaches 117k RPS (+37%) with this profile: Samples: 170K of event 'cycles', Event count (approx.): 142323631229 Overhead Shared Object Symbol 40.22% haproxy [.] cebu64_lookup 7.12% haproxy [.] sample_process_cnv 6.15% haproxy [.] var_to_smp 4.75% haproxy [.] cebu64_insert 3.79% haproxy [.] sample_conv_var2smp 3.40% haproxy [.] cebu64_delete 3.10% haproxy [.] sample_conv_arith_add 2.36% haproxy [.] action_store 2.32% haproxy [.] __pool_alloc 2.08% haproxy [.] vars_get_by_desc 1.96% haproxy [.] smp_dup 1.75% haproxy [.] var_set.part.0 1.74% haproxy [.] cebu64_first 1.07% [kernel] [k] aq_hw_read_reg 1.03% haproxy [.] pool_put_to_cache 1.00% haproxy [.] sample_process The performance lowers a bit earlier than with the list however. What can be seen is that the performance maintains a plateau till 25 vars, starts degrading a little bit for the tree while it remains stable till 28 vars for the list. Then both cross at 42 vars and the list continues to degrade doing a hyperbole while the tree resists better. The biggest loss is at around 32 variables where the list stays 10% higher. Regardless, given the extremely narrow band where the list is better, it looks relevant to switch to this in order to preserve the almost linear performance of large setups. For example at 1000 variables and 10k lookups, the tree is 18 times faster than the list. In addition this reduces the size of the struct vars by 8 bytes since there's a single pointer, though it could make sense to re-invest them into a secondary head for example.	2024-09-15 23:49:01 +02:00
Willy Tarreau	a0205f9de4	IMPORT: import cebtree (compact elastic binary trees) This is an import of the compact elastic binary trees at commit a9cd84a ("OPTIM: descent: better prefetch less and for writes when deleting") These will be used to replace certain lists (and possibly certain tree nodes as well). They're as fast (or even faster) than ebtrees for lookups, as fast for insertion and slower for deletion, and a node only uses 2 pointers (like a list). The only changes were cebtree.h where common/tools.h was replaced with ebtree.h which we already have and already provides the needed functions and macros, and the addition of a wrapper cebtree-prv.h in src/ to redirect to import/cebtree-prv.h.	2024-09-15 23:44:59 +02:00
Willy Tarreau	6e92988e20	MINOR: vars: remove the emptiness tests in callers before pruning All callers of vars_prune_* currently check the list for emptiness. Let's leave that to vars_prune() itself, it will ease some changes in the code. Thanks to the previous inlining of the vars_prune() function, there's no performance loss, and even a very tiny 0.1% gain.	2024-09-15 23:44:16 +02:00
Willy Tarreau	2c1a9c3a43	OPTIM: vars: inline vars_prune() to avoid many calls Many configs don't have variables and call it for no reason, and even configs with variables don't necessarily have some in all scopes.	2024-09-15 23:42:09 +02:00
Willy Tarreau	aad6b771dd	OPTIM: vars: remove the unneeded lock in vars_prune_* vars_prune() and vars_prune_all() take the variable lock while purging all variables from a head. However this is not needed: - proc scope variables are only purged during deinit, hence no lock is needed ; - all other scopes are attached to entities bound to a single thread so no lock is needed either. Removing the lock saves about 0.5% CPU on variables-intensive setups, but above all simplify the code, so let's do it.	2024-09-15 23:05:50 +02:00
Willy Tarreau	51ade2f1db	OPTIM: sample: don't check casts for samples of same type Originally when converters were created, they were mostly for casting types. Nowadays we have many artithmetic converters to perform operations on integers, and a number of converters operating on strings. Both of these categories most often do not need any cast since the input and output types are the same, which is visible as the cast function is c_none. However, profiling shows that when heavily using arithmetic converters, it's possible to spend up to ~7% of the time in sample_process_cnv(), a good part of which is only in accessing the sample_casts[] array. Simply avoiding this lookup when input and ouput types are equal saves about 2% CPU on such setups doing intensive use of converters.	2024-09-15 12:43:56 +02:00
Willy Tarreau	b11495652e	BUG/MEDIUM: queue: implement a flag to check for the dequeuing As unveiled in GH issue #2711, commit `5541d4995d` ("BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()") does have some side effects in that it can occasionally cause an endless loop. As Christopher analysed it, the problem is that process_srv_queue(), which uses a trylock in order to leave only one thread in charge of the dequeueing process, can lose the lock race against pendconn_add(). If this happens on the last served request, then there's no more thread to deal with the dequeuing, and assign_server_and_queue() will loop forever on a condition that was initially exepected to be extremely rare (and still is, except that now it can become sticky). Previously what was happening is that such queued requests would just time out and since that was very rare, nobody would notice. The root of the problem really is that trylock. It was added so that only one thread dequeues at a time but it doesn't offer only that guarantee since it also prevents a thread from dequeuing if another one is in the process of queuing. We need a different criterion. What we're doing now is to set a flag "dequeuing" in the server, which indicates that one thread is currently in the process of dequeuing requests. This one is atomically tested, and only if no thread is in this process, then the thread grabs the queue's lock and dequeues. This way it will be serialized with pendconn_add() and no request addition will be missed. It is not certain whether the original race covered by the fix above can still happen with this change, so better keep that fix for now. Thanks to @Yenya (Jan Kasprzak) for the precise and complete report allowing to spot the problem. This patch should be backported wherever the patch above was backported.	2024-09-13 08:35:47 +02:00
Willy Tarreau	adaba6f904	BUG/MINOR: clock: validate that now_offset still applies to the current date We want to make sure that now_offset is still valid for the current date: another thread could very well have updated it by detecting a backwards jump, and at the very same moment the time got fixed again, that we retrieve and add to the new offset, which results in a larger jump. Normally, for this to happen, it would mean that before_poll was also affected by the jump and was detected before and bounded within 2 seconds, resulting in max 2 seconds perturbations. Here we try to detect this situation and fall back to re-adjusting the offset instead. It's more of a strengthening of what's done by commit `e8b1ad4c2b` ("BUG/MEDIUM: clock: also update the date offset on time jumps") than a pure fix, in that the issue was not direclty observed but it's visibly possible by reading the code, so this should be backported along with the patch above. This is related to issue GH #2704. Note that this could be simplified in terms of operations by migrating the deadlines to nanoseconds, but this was the path to least intrusive changes.	2024-09-12 19:09:19 +02:00
Willy Tarreau	af48e4cc6b	BUG/MINOR: clock: make time jump corrections a bit more accurate Since commit `e8b1ad4c2b` ("BUG/MEDIUM: clock: also update the date offset on time jumps") we try to update the now_offet based on the last known valid date. But if it's off compared to the global_now_ns date shared by other threads, we'll get the time off a little bit. When this happens, we should consider the most recent of these dates so that if the global date was already known to be more recent, we should use it and stick to it. This will avoid setting too large an offset that could in turn provoke a larger jump on another thread. This is related to issue GH #2704. This can be backported to other branches having the patch above.	2024-09-12 18:27:03 +02:00
Willy Tarreau	ad98edd00a	BUG/MINOR: polling: fix time reporting when using busy polling Since commit `beb859abce` ("MINOR: polling: add an option to support busy polling") the time and status passed to clock_update_local_date() were incorrect. Indeed, what is considered is the before_poll date related to the configured timeout which does not correspond to what is passed to the poller. That's not correct because before_poll+the syscall's timeout will be crossed by the current date 100 ms after the start of the poller. In practice it didn't happen when the poller was limited to 1s timeout but at one minute it happens all the time. That's particularly visible when running a multi-threaded setup with busy polling and only half of the threads working (bind ... thread even). In this case, the fixup code of clock_update_local_date() is executed for each round of busy polling. The issue was made really visible starting with recent commit `e8b1ad4c2b` ("BUG/MEDIUM: clock: also update the date offset on time jumps") because upon a jump, the shared offset is reset, while it should not be in this specific case. What needs to be done instead is to pass the configured timeout of the poller (and not of the syscall), and always pass "interrupted" set so as to claim we got an event (which is sort of true as it just means the poller returned instantly). In this case we can still detect backwards/forward jumps and will use a correct boundary for the maximum date that covers the whole loop. This can be backported to all versions since the issue was introduced with busy-polling in 1.9-dev8.	2024-09-12 17:47:13 +02:00
Christopher Faulet	1900ca475f	MEDIUM: h1: Accept invalid T-E values with accept-invalid-http-response option Since the 2.6, A parsing error is reported when the chunked encoding is found twice. As stated in RFC9112, A sender must not apply the chunked transfer coding more than once to a message body. It means only one chunked coding must be found. In addition, empty values are also rejected becaues it is forbidden by RFC9110. However, in both cases, it may be useful to relax the rules for trusted legacy servers when accept-invalid-http-response option is set. Especially because it was accepted on 2.4 and older. In addition, T-E header is now sanitized before sending it. It is not a problem Because it is a hop-by-hop header Note that it remains invalid on client side because there is no good reason to relax the parsing on this side. We can argue a server is trusted so we can decide to support some legacy behavior. It is not true on client side and it is highly suspicious if a client is sending an invalid T-E header. Note also we continue to reject unsupported T-E values (so all codings except "chunked"). Because the "TE" header is sanitized and cannot contain other value than "Trailers", there is absolutely no reason for a server to use something else. This patch should fix the issue #2677. It could probably be backported as far as 2.6 if necessary.	2024-09-12 09:21:57 +02:00
Willy Tarreau	2b95c77c08	DOC: server: document what to check for when adding new server keywords It's too easy to overlook the dynamic servers when adding new server keywords, and the fields on each keyword line are totally obscure. This commit adds a title to each column of the table and explains what is expected and what to check for when adding a keyword.	2024-09-10 18:50:12 +02:00
Damien Claisse	ce6a621ae3	MINOR: server: allow init-state for dynamic servers Commit `50322df` introduced the init-state keyword, but it didn't enable it for dynamic servers. However, this feature is perfectly desirable for virtual servers too, where someone would like a server inlived through "set server be1/srv1 state ready" to be put out of maintenance in down state until the next health check succeeds. At reading the code, it seems that it's only a matter of allowing this keyword for dynamic servers, as current code path calls srv_adm_set_ready() which incidentally triggers a call to _srv_update_status_adm().	2024-09-10 18:18:38 +02:00
Willy Tarreau	9f8d9c9e8b	BUG/MINOR: pattern: do not leave a leading comma on "set" error messages Commit `4f2493f355` ("BUG/MINOR: pattern: pat_ref_set: fix UAF reported by coverity") dropped the condition to concatenate error messages and as such introduced a leading comma in front of all of them. Then commit `911f4d93d4` ("BUG/MINOR: pattern: pat_ref_set: return 0 if err was found") changed the behavior to stop at the first error anyway, so all the mechanics dedicated to the concatenation of error messages is no longer needed and we can simply return the error as-is, without inserting any comma. This should be backported where the patches above are backported.	2024-09-10 08:55:29 +02:00
Christopher Faulet	a99d58819f	BUG/MINOR: h1-htx: Don't flag response as bodyless when a tunnel is established This reverts commit `225a4d02e1`. When a 200-OK response is replied to a CONNECT request or a 101-Switching-protocol, a tunnel is considered as established between the client and the server. However, we must not declare the reponse as bodyless. Of course, there is no payload, but tunneled data are expected. Because of this bug, the zero-copy forwarding is disabled on the server side. This patch must be backported as far as 2.9.	2024-09-09 19:01:47 +02:00
Christopher Faulet	f6e193f1b0	BUG/MAJOR: mux-h1: Wake SC to perform 0-copy forwarding in CLOSING state When the mux is woken up on I/O events, if the zero-copy forwarding is enabled, receives are blocked. In this case, the SC is woken up to be able to perform 0-copy forwarding to the other side. This works well, except for the H1C in CLOSING state. Indeed, in that case, in h1_process(), the SC is not woken up because only RUNNING H1 connections are considered. As consequence, the mux will ignore connection closure. The H1 connection remains blocked, waiting for the shutdown timeout. If no timeout is configured, the H1 connection is never closed leading to a leak. This patch should fix leak reported by Damien Claisse in the issue #2697. It should be backported as far as 2.8.	2024-09-09 19:01:47 +02:00
William Lallemand	021ac6a108	MEDIUM: ssl/cli: "dump ssl cert" allow to dump a certificate in PEM format The new "dump ssl cert" CLI command allows to dump a certificate stored into HAProxy memory. Until now it was only possible to dump the description of the certificate using "show ssl cert", but with this new command you can dump the PEM content on the filesystem. This command is only available on a admin stats socket. $ echo "@1 dump ssl cert cert.pem" \| socat /tmp/master.sock - -----BEGIN PRIVATE KEY----- [...] -----END PRIVATE KEY----- -----BEGIN CERTIFICATE----- [...] -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- [...] -----END CERTIFICATE-----	2024-09-09 16:54:48 +02:00
Aurelien DARRAGON	68cfb222b5	BUG/MEDIUM: pattern: prevent UAF on reused pattern expr Since `c5959fd` ("MEDIUM: pattern: merge same pattern"), UAF (leading to crash) can be experienced if the same pattern file (and match method) is used in two default sections and the first one is not referenced later in the config. In this case, the first default section will be cleaned up. However, due to an unhandled case in the above optimization, the original expr which the second default section relies on is mistakenly freed. This issue was discovered while trying to reproduce GH #2708. The issue was particularly tricky to reproduce given the config and sequence required to make the UAF happen. Hopefully, Github user @asmnek not only provided useful informations, but since he was able to consistently trigger the crash in his environment he was able to nail down the crash to the use of pattern file involved with 2 named default sections. Big thanks to him. To fix the issue, let's push the logic from `c5959fd` a bit further. Instead of relying on "do_free" variable to know if the expression should be freed or not (which proved to be insufficient in our case), let's switch to a simple refcounting logic. This way, no matter who owns the expression, the last one attempting to free it will be responsible for freeing it. Refcount is implemented using a 32bit value which fills a previous 4 bytes structure gap: int mflags; /* 80 4 / / XXX 4 bytes hole, try to pack / long unsigned int lock; / 88 8 */ (output from pahole) Even though it was not reproduced in 2.6 or below by @asmnek (the bug was revealed thanks to another bugfix), this issue theorically affects all stable versions (up to `c5959fd`), thus it should be backported to all stable versions.	2024-09-09 16:07:05 +02:00
Aurelien DARRAGON	8157c1caf2	BUG/MEDIUM: pattern: prevent uninitialized reads in pat_match_{str,beg} Using valgrind when running map_beg or map_str, the following error is reported: ==242644== Conditional jump or move depends on uninitialised value(s) ==242644== at 0x2E4AB1: pat_match_str (pattern.c:457) ==242644== by 0x2E81ED: pattern_exec_match (pattern.c:2560) ==242644== by 0x343176: sample_conv_map (map.c:211) ==242644== by 0x27522F: sample_process_cnv (sample.c:1330) ==242644== by 0x2752DB: sample_process (sample.c:1373) ==242644== by 0x319917: action_store (vars.c:814) ==242644== by 0x24D451: http_req_get_intercept_rule (http_ana.c:2697) In fact, the error is legit, because in pat_match_{beg,str}, we dereference the buffer on len+1 to check if a value was previously set, and then decide to force NULL-byte if it wasn't set. But the approach is no longer compatible with current architecture: data past str.data is not guaranteed to be initialized in the buffer. Thus we cannot dereference the value, else we expose us to uninitialized read errors. Moreover, the check is useless, because we systematically set the ending byte to 0 when the conditions are met. Finally, restoring the older value after the lookup is not relevant: indeed, either the sample is marked as const and in such case it is already duplicated, or the sample is not const and we forcefully add a terminating NULL byte outside from the actual string bytes (since we're past str.data), so as we didn't alter effective string data and that data past str.data cannot be dereferenced anyway as it isn't guaranteed to be initialized, there's no point in restoring previous uninitialized data. It could be backported in all stable versions. But since this was only detected by valgrind and isn't known to cause issues in existing deployments, it's probably better to wait a bit before backporting it to avoid any breakage.. although the fix should be theoretically harmless.	2024-09-09 15:57:30 +02:00
Aurelien DARRAGON	3449525a02	BUG/MINOR: pattern: prevent const sample from being tampered in pat_match_beg() This is a complementary patch to `a68affeaa` ("BUG/MINOR: pattern: a sample marked as const could be written"). Indeed the same logic from pat_match_str() is used there, but we lack the check to ensure that the sample is not const before writing data to it. It could be backported to all stable versions.	2024-09-09 15:57:23 +02:00
Willy Tarreau	ef8d8215de	BUG/MEDIUM: clock: detect and cover jumps during execution After commit `e8b1ad4c2` ("BUG/MEDIUM: clock: also update the date offset on time jumps"), @firexinghe mentioned that the issue was still present in their case. In fact it depends on the load, which affects the probability that the time changes between two poll() calls vs that it changes during poll(). The time correction code used to only deal with the latter. But under load if it changes between two poll() calls, what happens then is that before_poll is off, and after returning from poll(), the date is within bounds defined by before_poll, so no correction is applied. After many tests, it turns out that the most reliable solution without using CLOCK_MONOTONIC is to prevent before_poll from being earlier than the previous after_poll (trivial), and to cover forward jumps, we need to enforce a margin. Given that the watchdog kills a looping task within 2 seconds and that no sane setup triggers it, it seems that 2 seconds remains a safe enough margin. This means that in the worst case, some forward jumps of up to 2 seconds will not be corrected, leading to an apparent fast time and low rates. But this is supposed to be an exceptional event anyway (typically an admin or crontab running ntpdate). For future versions, given that we now opportunistically call now_mono_time() before and after poll(), that returns zero if not supported, we could imagine relying on this one for the thread's local time when it's non-null.	2024-09-08 19:15:38 +02:00
Christopher Faulet	001fb1a548	BUG/MEDIUM: mux-h1/mux-h2: Reject upgrades with payload on H2 side only Since `1d2d77b27` ("MEDIUM: mux-h1: Return a 501-not-implemented for upgrade requests with a body"), it is no longer possible to perform a protocol upgrade for requests with a payload. The main reason was to be able to support protocol upgrade for H1 client requesting a H2 server. In that case, the upgrade request is converted to a CONNECT request. So, it is not possible to convey a payload in that case. But, it is a problem for anyone wanting to perform upgrades on H1 server using requests with a payload. It is uncommon but valid. So, now, it is the H2 multiplexer responsibility to reject upgrade requests, on server side, if there is a payload. An INTERNAL_ERROR is returned for the H2S in that case. On H1 side, the upgrade is now allowed, but only if the server waits for the end of the request to return the 101-Switching-protocol response. Indeed, it is quite hard to synchronise the frontend side and the backend side in that case. Asking to servers to fully consume the request payload before returned the response seems reasonable. This patch should fix the issue #2684. It could be backported after a period of observation, as far as 2.4 if possible. But only if it is not too hard. It depends on "MINOR: mux-h1: Set EOI on SE during demux when both side are in DONE state".	2024-09-06 09:16:18 +02:00
Christopher Faulet	ad1ef94612	MINOR: mux-h1: Set EOI on SE during demux when both side are in DONE state For now, this case is already handled for all requests except for those waiting for a tunnel establishment (CONNECT and protocol upgrades). It is not an issue because only bodyless requests are supported in these cases. So the request is always finished at the end of headers and therefore before the response. However, to relax conditions for full H1 protocol upgrades (H1 client and server), this case will be necessary. Indeed, the idea is to be able to perform protocol upgrades for requests with a payload. Today, the "Upgrade:" header is removed before sending the request to the server. But to support this case, this patch is required to properly finish transaction when the server does not perform the upgrade.	2024-09-06 09:00:13 +02:00
Aaron Kuehler	50322dff81	MEDIUM: server: add init-state Allow the user to set the "initial state" of a server. Context: Servers are always set in an UP status by default. In some cases, further checks are required to determine if the server is ready to receive client traffic. This introduces the "init-state {up\|down}" configuration parameter to the server. - when set to 'fully-up', the server is considered immediately available and can turn to the DOWN sate when ALL health checks fail. - when set to 'up' (the default), the server is considered immediately available and will initiate a health check that can turn it to the DOWN state immediately if it fails. - when set to 'down', the server initially is considered unavailable and will initiate a health check that can turn it to the UP state immediately if it succeeds. - when set to 'fully-down', the server is initially considered unavailable and can turn to the UP state when ALL health checks succeed. The server's init-state is considered when the HAProxy instance is (re)started, a new server is detected (for example via service discovery / DNS resolution), a server exits maintenance, etc. Link: https://github.com/haproxy/haproxy/issues/51	2024-09-05 11:13:10 +02:00
Willy Tarreau	e8b1ad4c2b	BUG/MEDIUM: clock: also update the date offset on time jumps In GH issue #2704, @swimlessbird and @xanoxes reported problems handling time jumps. Indeed, since 2.7 with commit `4eaf85f5d9` ("MINOR: clock: do not update the global date too often") we refrain from updating the global offset in case it didn't change. But there's a catch: in case of a large time jump, if the poller was interrupted, the local time remains the same and we return immediately from there without updating the offset. It then becomes incorrect regarding the "date" value, and upon subsequent call to the poller, there's no way to detect a jump anymore so we apply the old, incorrect offset and the date becomes wrong. Worse, going back to the original time (then in the past), global_now_ns remains higher than the local time and neither get updated anymore. What is missing in practice is to immediately update the offset when detecting a time jump. In an ideal world, the offset would be updated upon every call, that's what was being done prior to commit above but it's extremely CPU intensive on large systems. However we can perfectly afford to update the offset every time we detect a time jump, as it's not as common. This needs to be backported as far as 2.8. Thanks to both participants above for providing very helpful details.	2024-09-04 16:55:43 +02:00
Ilya Shipitsin	1f6e5f7a61	CLEANUP: assorted typo fixes in the code and comments This is 43rd iteration of typo fixes	2024-09-03 17:49:21 +02:00
Christopher Faulet	e1cae42879	BUG/MEDIUM: mux-pt: Fix condition to perform a shutdown for writes in mux_pt_shut() A regression was introduced in the commit `76fa71f7a` ("BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown") because of a typo on the connection flags. CO_FL_SOCK_WR_SH flag must be tested to prevent a call to conn_sock_shutw() and not CO_FL_SOCK_RD_SH. Concretly, most of time, it is harmeless because shutdown for writes is always performed before any shutdown for reads. Except in case describe by the commit above. But it is not clear if it has an impact or not. This patch must be backported with the commit above, so as far as 2.9.	2024-09-03 15:25:05 +02:00
Frederic Lecaille	7e19432fd4	BUG/MINOR: Crash on O-RTT RX packet after dropping Initial pktns This bug arrived with this naive commit: BUG/MINOR: quic: Too shord datagram during O-RTT handshakes (aws-lc only) which omitted to consider the case where the Initial packet number space could be discarded before receiving 0-RTT packets. To fix this, append/insert the O-RTT (early-data) packet number space into the encryption level list depending on the presence or not of the Initial packet number space. This issue was revealed when using aws-lc as TLS stack in GH #2701 issue. Thank you to @Tristan971 for having reported this issue. Must be backported where the commit mentionned above is supposed to be backported: as far as 2.9.	2024-09-03 15:23:06 +02:00
Willy Tarreau	f8bff3b531	BUG/MINOR: mux-spop: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf That's the equivalent of the mux-h2 one, except that here there's no real risk to loop since normally we cannot feed data that bypass the closed state check (e.g. no zero-copy forward). But it still remains dirty to be able to leave and empty mbuf with MFULL and MROOM set, so better clear them as well. No backport is needed since this is only in 3.1.	2024-09-03 14:39:04 +02:00
Willy Tarreau	830e50561c	BUG/MAJOR: mux-h2: always clear MUX_MFULL and DEM_MROOM when clearing the mbuf There exists an extremely tricky code path that was revealed in 3.0 by the glitches feature, though it might theoretically have existed before. TL;DR: a mux mbuf may be full after successfully sending GOAWAY, and discard its remaining contents without clearing H2_CF_MUX_MFULL and H2_CF_DEM_MROOM, then endlessly loop in h2_send(), until the watchdog takes care of it. What can happen is the following: Some data are received, h2_io_cb() is called. h2_recv() is called to receive the incoming data. Then h2_process() is called and in turn calls h2_process_demux() to process input data. At some point, a glitch limit is reached and h2c_error() is called to close the connection. The input frame was incomplete, so some data are left in the demux buffer. Then h2_send() is called, which in turn calls h2_process_mux(), which manages to queue the GOAWAY frame, turning the state to H2_CS_ERROR2. The frame is sent, and h2_process() calls h2_send() a last time (doing nothing) and leaves. The streams are all woken up to notify about the error. Multiple backend streams were waiting to be scheduled and are woken up in turn, before their parents being notified, and communicate with the h2 mux in zero-copy-forward mode, request a buffer via h2_nego_ff(), fill it, and commit it with h2_done_ff(). At some point the mux's output buffer is full, and gets flags H2_CF_MUX_MFULL. The io_cb is called again to process more incoming data. h2_send() isn't called (polled) or does nothing (e.g. TCP socket buffers full). h2_recv() may or may not do anything (doesn't matter). h2_process() is called since some data remain in the demux buf. It goes till the end, where it finds st0 == H2_CS_ERROR2 and clears the mbuf. We're now in a situation where the mbuf is empty and MFULL is still present. Then it calls h2_send(), which doesn't call h2_process_mux() due to MFULL, doesn't enter the for() loop since all buffers are empty, then keeps sent=0, which doesn't allow to clear the MFULL flag, and since "done" was not reset, it loops forever there. Note that the glitches make the issue more reproducible but theoretically it could happen with any other GOAWAY (e.g. PROTOCOL_ERROR). What makes it not happen with the data produced on the parsing side is that we process a single buffer of input at once, and there's no way to amplify this to 30 buffers of responses (RST_STREAM, GOAWAY, SETTINGS ACK, WINDOW_UPDATE, PING ACK etc are all quite small), and since the mbuf is cleared upon every exit from h2_process() once the error was sent, it is not possible to accumulate response data across multiple calls. And the regular h2_snd_buf() path checks for st0 >= H2_CS_ERROR so it will not produce any data there either. Probably that h2_nego_ff() should check for H2_CS_ERROR before accepting to deliver a buffer, but this needs to be carefully studied. In the mean time the real problem is that the MFULL flag was kept when clearing the buffer, making the two inconsistent. Since it doesn't seem possible to trigger this sequence without the zero-copy-forward mechanism, this fix needs to be backported as far as 2.9, along with previous commit "MINOR: mux-h2: try to clear DEM_MROOM and MUX_MFULL at more places" which will strengthen the consistency between these checks. Many thanks to Annika Wickert for her detailed report that allowed to diagnose this problem. CVE-2024-45506 was assigned to this problem.	2024-09-03 14:39:04 +02:00
Willy Tarreau	e9cdedb39b	MINOR: mux-h2: try to clear DEM_MROOM and MUX_MFULL at more places The code leading to H2_CF_MUX_MFULL and H2_CF_DEM_MROOM being cleared is quite complex and assumptions about its state are extremely difficult when reading the code. There are indeed long sequences where the mux might possibly be empty, still having the flag set until it reaches h2_send() which will clear it after the last send. Even then it's not obviour whether it's always guaranteed to release the flag when invoked in multiple passes. Let's just simplify the conditionnn so that h2_send() does not depend on "sent" anymore and that h2_timeout_task() doesn't leave the flags set on the buffer on emptiness. While it doesn't seem to fix anything, it will make the code more robust against future changes.	2024-09-03 14:39:04 +02:00
Christopher Faulet	0d4271cdae	BUG/MEDIUM: mux-h1: Properly handle empty message when an error is triggered When a 400/408/500/501 error is returned by the H1 multiplexer, we first try to get the error message of the proxy before using the default one. This may be configured to be mapped on /dev/null or on an empty file. In that case, no message is emitted, as expected. But everything is handled as the error was successfully sent. However, there is an bug here. In h1_send_error() function, this case is not properly handled. The flag H1C_F_ABRTED is not set on the H1 connection as it should be and h1_close() function is not called, leaving the H1 connection in an undefined state. It is especially an issue when a "empty" 408-Request-Time-out error is emitted while there are data blocked in the output buffer. In that case, the connection remains openned until the client closes and a "cR--"/408 is logged repeatedly, every time the client timeout is reached. This patch must backported as far as 2.8.	2024-09-03 14:28:42 +02:00
Frederic Lecaille	15a737eb5f	BUG/MINOR: quic: unexploited retransmission cases for Initial pktns. qc_prep_hdshk_fast_retrans() job is to pick some packets to be retransmitted from Initial and Handshake packet number spaces. A packet may be coalesced to a first one into the same datagram. When a coalesced packet is inspected for retransmission, it is skipped if its length would make the total datagram length it is attached to exceeding the anti-amplification limit. But in this case, the first packet must be kept for the current retransmission. This is tracked by this trace statemement: TRACE_PROTO("will probe Initial packet number space", QUIC_EV_CONN_SPPKTS, qc); This was not the case because of the wrong "goto end" statement. This latter must be run only if the Initial packet number space must not be probe with the first packet found as coalesced to another one which must be skipped. This bug was revealed by AWS-LC interop runner with handshakeloss and handshakecorruption which always fail because this stack leads the server to send more Initial packets. Thank you to Ilya (@chipitsine) for this issue report in GH #2663. Must be backported as far as 2.6.	2024-09-03 11:47:51 +02:00
Christopher Faulet	d4781bd5e7	BUG/MEDIUM: cli: Always release back endpoint between two commands on the mcli When several commands are chained on the master CLI, the same client connection is used. Because, it is a TCP connection, the mux PT is used. It means there is no stream at the mux level. It is not possible to release the applicative stream between each commands as for the HTTP. So, to work around this limitation, between two commands, the master CLI is resetting the stream. It does exactly what it was performed on HTTP to manage keep-alive connections on old HAProxy versions. But this part was copied from a code dealing with connection only while the back endpoint can be an applet or a mux for the master cli. The previous fix on the mux PT ("BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown") revealed a bug. Between two commands, the back endpoint was only released if the connection's XPRT was closed. This works if the back endpoint is an applet because there is no connection. But for commands sent to a worker, a connection is used. At this stage, this only works if the connection's XPRT is closed. Otherwise, the old endpoint is never detached leading to undefined behavior on the next command execution (most probably a crash). Without the commit above, the connection's XPRT is always closed on shutdown. It is no longer true. At this stage, we must inconditionnally release the back endpoint by resetting the corresponding sedesc to fix the bug. This patch must be backported with the commit above in all stable versions. On 2.4 and lower, it will need to be adapted.	2024-09-02 18:31:35 +02:00
Christopher Faulet	76fa71f7a8	BUG/MEDIUM: mux-pt: Never fully close the connection on shutdown When a shutdown is reported to the mux (shutdown for reads or writes), the connexion is immediately fully closed if the mux detects the connexion is closed in both directions. Only the passthrough multiplexer is able to perform this action at this stage because there is no stream and no internal data. Other muxes perform a full connection close during the mux's release stage. It was working quite well since recently. But, in theory, the bug is quite old. In fact, it seems possible for the lower layer to report an error on the connection in same time a shutdown is performed on the mux. Depending on how events are scheduled, the following may happen: 1. An connection error is detected at the fd layer and a wakeup is scheduled on the mux to handle the event. 2. A shutdown for writes is performed on the mux. Here the mux decides to fully close the connexion. If the xprt is not used to log info, it is released. 3. The mux is finally woken up. It tries to retrieve data from the xprt because it is not awayre there was an error. This leads to a crash because of a NULL-deref. By reading the code, it is not obvious. But it seems possible with SSL connection when the handshake is rearmed. It happens when a SSL_ERROR_WANT_WRITE is reported on a SSL_read() attempt or a SSL_ERROR_WANT_READ on a SSL_write() attempt. This bug is only visible if the XPRT is not used to log info. So it is no so common. This patch should fix the 2nd crash reported in the issue #2656. It must first be backported as far as 2.9 and then slowly to all stable versions.	2024-09-02 15:50:25 +02:00
Christopher Faulet	f9adcdf039	MEDIUM: bwlim: Use a read-lock on the sticky session to apply a shared limit There is no reason to acquire a write-lock on the sticky session when a shared limit is applied because only the frequency is updated. The sticky session itself is not modified. We must just take care it is not removed in the mean time. So a read-lock may be used instead.	2024-09-02 15:50:25 +02:00
Christopher Faulet	a7f6b0ac03	MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates Add a factor parameter to stick-tables, called "brates-factor", that is applied to in/out bytes rates to work around the 32-bits limit of the frequency counters. Thanks to this factor, it is possible to have bytes rates beyond the 4GB. Instead of counting each bytes, we count blocks of bytes. Among other things, it will be useful for the bwlim filter, to be able to configure shared limit exceeding the 4GB/s. For now, this parameter must be in the range ]0-1024].	2024-09-02 15:50:25 +02:00
Frederic Lecaille	db13df3d6e	BUG/MINOR: quic: Crash from trace dumping SSL eary data status (AWS-LC) This bug follows this patch: MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event. where a new third variable was added to be dumped from QUIC_EV_CONN_IO_CB trace event. The quic_trace() code did not reveal there was already another variable passed as third argument but not dumped. This leaded to crash when dereferencing a point to an int in place of a point to an SSL object. This issue was reproduced only by handshakecorruption aws-lc interop test with s2n-quic as client. Note that this patch must be backported with this one: BUG/MEDIUM: quic: always validate sender address on 0-RTT which depends on the commit mentionned above.	2024-09-02 10:01:41 +02:00
Aperence	20efb856e1	MEDIUM: protocol: add MPTCP per address support Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension that enables a TCP connection to use different paths. Multipath TCP has been used for several use cases. On smartphones, MPTCP enables seamless handovers between cellular and Wi-Fi networks while preserving established connections. This use-case is what pushed Apple to use MPTCP since 2013 in multiple applications [2]. On dual-stack hosts, Multipath TCP enables the TCP connection to automatically use the best performing path, either IPv4 or IPv6. If one path fails, MPTCP automatically uses the other path. To benefit from MPTCP, both the client and the server have to support it. Multipath TCP is a backward-compatible TCP extension that is enabled by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...). Multipath TCP is included in the Linux kernel since version 5.6 [3]. To use it on Linux, an application must explicitly enable it when creating the socket. No need to change anything else in the application. This attached patch adds MPTCP per address support, to be used with: mptcp{,4,6}@<address>[:port1[-port2]] MPTCP v4 and v6 protocols have been added: they are mainly a copy of the TCP ones, with small differences: names, proto, and receivers lists. These protocols are stored in __protocol_by_family, as an alternative to TCP, similar to what has been done with QUIC. By doing that, the size of __protocol_by_family has not been increased, and it behaves like TCP. MPTCP is both supported for the frontend and backend sides. Also added an example of configuration using mptcp along with a backend allowing to experiment with it. Note that this is a re-implementation of Bj�rn's work from 3 years ago [4], when haproxy's internals were probably less ready to deal with this, causing his work to be left pending for a while. Currently, the TCP_MAXSEG socket option doesn't seem to be supported with MPTCP [5]. This results in a warning when trying to set the MSS of sockets in proto_tcp:tcp_bind_listener. This can be resolved by adding two new variables: sock_inet(6)_mptcp_maxseg_default that will hold the default value of the TCP_MAXSEG option. Note that for the moment, this will always be -1 as the option isn't supported. However, in the future, when the support for this option will be added, it should contain the correct value for the MSS, allowing to correctly set the TCP_MAXSEG option. Link: https://www.rfc-editor.org/rfc/rfc8684.html [1] Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2] Link: https://www.mptcp.dev [3] Link: https://github.com/haproxy/haproxy/issues/1028 [4] Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5] Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be> Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>	2024-08-30 18:53:49 +02:00
Aperence	2f171fe36a	MEDIUM: sock: use protocol when creating socket Use the protocol configured for a connection when creating the socket, instead of always using 0. This change is needed to allow new protocol to be used when creating the sockets, such as MPTCP. Note however that this patch won't change anything for now, as the only other value that proto->sock_prot could hold is IPPROTO_TCP, which has the same behavior as 0 when passed to socket.	2024-08-30 18:53:49 +02:00
Aperence	38618822e1	MINOR: server: add a alt_proto field for server Add a new field alt_proto to the server structures that specify if an alternate protocol should be used for this server. This field can be transparently passed to protocol_lookup to get an appropriate protocol structure. This change allows thus to create servers with different protocols, and not only TCP anymore.	2024-08-30 18:53:49 +02:00
Aperence	a7b04e383a	MINOR: tools: extend str2sa_range to add an alt parameter Add a new parameter "alt" that will store wether this configuration use an alternate protocol. This alt pointer will contain a value that can be transparently passed to protocol_lookup to obtain an appropriate protocol structure. This change is needed to allow for example the servers to know if it need to use an alternate protocol or not.	2024-08-30 18:53:49 +02:00
Willy Tarreau	2bc513dd31	BUILD: quic: fix build errors on FreeBSD since recent GSO changes The following commits broke the build on FreeBSD when QUIC is enabled: `35470d518` ("MINOR: quic: activate UDP GSO for QUIC if supported") `448d3d388` ("MINOR: quic: add GSO parameter on quic_sock send API") Indeed, it turns out that netinet/udp.h requires sys/types.h to be included before. Let's just change the includes order to fix the build. No backport is needed.	2024-08-30 18:53:49 +02:00
Frederic Lecaille	f627b9272b	BUG/MEDIUM: quic: always validate sender address on 0-RTT It has been reported by Wedl Michael, a student at the University of Applied Sciences St. Poelten, a potential vulnerability into haproxy as described below. An attacker could have obtained a TLS session ticket after having established a connection to an haproxy QUIC listener, using its real IP address. The attacker has not even to send a application level request (HTTP3). Then the attacker could open a 0-RTT session with a spoofed IP address trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests. To mitigate this vulnerability, one decided to use a token which can be provided to the client each time it successfully managed to connect to haproxy. These tokens may be reused for future connections to validate the address/path of the remote peer as this is done with the Retry token which is used for the current connection, not the next one. Such tokens are transported by NEW_TOKEN frames which was not used at this time by haproxy. So, each time a client connect to an haproxy QUIC listener with 0-RTT enabled, it is provided with such a token which can be reused for the next 0-RTT session. If no such a token is presented by the client, haproxy checks if the session is a 0-RTT one, so with early-data presented by the client. Contrary to the Retry token, the decision to refuse the connection is made only when the TLS stack has been provided with enough early-data from the Initial ClientHello TLS message and when these data have been accepted. Hopefully, this event arrives fast enough to allow haproxy to kill the connection if some early-data have been accepted without token presented by the client. quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN frame with this newly implemented token to be transported inside. quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre() and modified to be reused and derive the secret for the new token implementation. quic_token_validate() has been implemented to validate both the Retry and the new token implemented by this patch. When this is a non-retry token which could not be validated, the datagram received is marked as requiring a Retry packet to be sent, and no connection is created. When the Initial packet does not embed any non-retry token and if 0-RTT is enabled the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon as the TLS stack detects that some early-data have been provided and accepted by the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted() new function. The secret TLS handshake is interrupted as soon as possible returnin 0 from ha_quic_add_handshake_data(). The connection is also marked as requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb()) knows how to behave: kill the connection after having sent a Retry packet. About TLS stack compatibility, this patch is supported by aws-lc. It is disabled for wolfssl which does not support 0-RTT at this time thanks to HAVE_SSL_0RTT_QUIC. This patch depends on these commits: MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event. MINOR: quic: Implement qc_ssl_eary_data_accepted(). MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct) BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder MINOR: quic: Token for future connections implementation. MINOR: quic: Implement quic_tls_derive_token_secret(). MINOR: tools: Implement ipaddrcpy(). Must be backported as far as 2.6.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	8854cef036	MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event. Dump the early data status from QUIC_EV_CONN_IO_CB trace event. This is very helpful to know if the QUIC server has accepted the early data received from clients.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	e926378375	MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct) Modify qf_new_token structure to use a static buffer with QUIC_TOKEN_LEN as size as defined by the token for future connections (quic_token.c). Modify consequently the NEW_TOKEN frame parser (see quic_parse_new_token_frame()). Also add comments to denote that the NEW_TOKEN parser function is used only by clients and that its builder is used only by servers.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	76c80605a6	BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder quic_build_new_token_frame() is the function which is called to build a NEW_TOKEN frame into a buffer. The position pointer for this buffer was not updated, leading the NEW_TOKEN frame to be malformed. Must be backported as far as 2.6.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	f5b09dc452	MINOR: quic: Token for future connections implementation. There exist two sorts of token used by QUIC. They are both used to validate the peer address (path validation). Retry are used for the current connection the client want to open. This patch implement the other sort of tokens which after having been received from a connection, may be provided for the next connection from the same IP address to validate it (or validate the network path between the client and the server). The token generation is implemented by quic_generate_token(), and the token validation by quic_token_chek(). The same method is used as for Retry tokens to build such tokens to be reused for future connections. The format is very simple: one byte for the format identifier to distinguish these new tokens for the Retry token, followed by a 32bits timestamps. As this part is ciphered with AEAD as cryptographic algorithm, 16 bytes are needed for the AEAD tag. 16 more random bytes are added to this token and a salt to derive the AEAD secret used to cipher the token. In addition to this salt, this is the client IP address which is used also as AAD to derive the AEAD secret. So, the length of the token is fixed: 37 bytes.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	74caa0eece	MINOR: quic: Implement quic_tls_derive_token_secret(). This is function is similar to quic_tls_derive_retry_token_secret(). Its aim is to derive the secret used to cipher the token to be used for future connections. This patch renames quic_tls_derive_retry_token_secret() to a more and reuses its code to produce a more generic one: quic_do_tls_derive_token_secret(). Two arguments are added to this latter to produce both quic_tls_derive_retry_token_secret() and quic_tls_derive_token_secret() new function which calls quic_do_tls_derive_token_secret().	2024-08-30 17:04:09 +02:00
Frederic Lecaille	fb7a092203	MINOR: tools: Implement ipaddrcpy(). Implement ipaddrcpy() new function to copy only the IP address from a sockaddr_storage struct object into a buffer.	2024-08-30 17:04:09 +02:00
Nicolas CARPi	a33407b499	CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE There was a typo in the macro name, where LENGTH was incorrectly written. This didn't cause any issue because the typo appeared in all occurrences in the codebase.	2024-08-30 14:58:59 +02:00
Nicolas CARPi	534e7e4598	CLEANUP: haproxy: fix typos in code comment Use "from" instead of "form" in ha_random_boot function code comments.	2024-08-30 14:58:59 +02:00
Christopher Faulet	e4812404c5	BUG/MEDIUM: stream: Prevent mux upgrades if client connection is no longer ready If an early error occurred on the client connection, we must prevent any multiplexer upgrades. Indeed, it is unexpected for a mux to be initialized with no xprt. On a normal workflow it is impossible. So it is not an issue. But if a mux upgrade is performed at the stream level, an early error on the connection may have already been handled by the previous mux and the connection may be already fully closed. If the mux upgrade is still performed, a crash can be experienced. It is possible to have a crash with an implicit TCP>HTTP upgrade if there is no data in the input buffer. But it is also possible to get a crash with an explicit "switch-mode http" rule. It must be backported to all stable versions. In 2.2, the patch must be applied directly in stream_set_backend() function.	2024-08-28 16:38:20 +02:00
Christopher Faulet	4ef5251c44	BUG/MEDIUM: mux-h2: Set ES flag when necessary on 0-copy data forwarding When DATA frames are sent via the 0-copy data forwarding, we must take care to set the ES flag on the last DATA frame. It should be performed in h2_done_ff() when IOBUF_FL_EOI flag was set by the producer. This flag is here to know when the producer has reached the end of input. When this happens, the h2s state is also updated. It is switched to "half-closed local" or "closed" state depending on its previous state. It is mainly an issue on uploads because the server may be blocked waiting for the end of the request. A workaround is to disable the 0-copy forwarding support the the H2 by setting "tune.h2.zero-copy-fwd-send" directive to off in your global section. This patch should fix the issue #2665. It must be backported as far as 2.9.	2024-08-28 10:05:34 +02:00
Christopher Faulet	0d142e0756	MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status The "429" status can now be specified on retry-on directives. PR_RE_* flags were updated to remains sorted. This patch should fix the issue #2687. It is quite simple so it may safely be backported to 3.0 if necessary.	2024-08-28 10:05:34 +02:00
William Lallemand	d2fc1ab66e	MEDIUM: ssl/sample: add ssl_fc_sigalgs_bin sample fetch This new sample fetch allow to extract the binary list contained in the signature_algorithms (13) TLS extensions. https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.3	2024-08-26 15:17:40 +02:00
William Lallemand	e8fecef0ff	MEDIUM: ssl: capture the signature_algorithms extension from Client Hello Activate the capture of the TLS signature_algorithms extension from the Client Hello. This list is stored in the ssl_capture buffer when the global option "tune.ssl.capture-cipherlist-size" is enabled.	2024-08-26 15:17:40 +02:00
William Lallemand	ac5c7158f9	MEDIUM: ssl/sample: add ssl_fc_supported_versions_bin sample fetch This new sample fetch allow to extract the binary list contained in the supported_versions (43) TLS extensions. https://datatracker.ietf.org/doc/html/rfc8446#section-4.2.1	2024-08-26 15:17:40 +02:00
William Lallemand	ce7fb6628e	MEDIUM: ssl: capture the supported_versions extension from Client Hello Activate the capture of the TLS supported_versions extension from the Client Hello. This list is stored in the ssl_capture buffer when the global option "tune.ssl.capture-cipherlist-size" is enabled.	2024-08-26 15:12:42 +02:00
William Lallemand	3c0a0f1e1b	CLEANUP: ssl: cleanup the clienthello capture In order to add more extensions, clean up the clienthello capture function a little bit.	2024-08-26 15:12:42 +02:00
Frederic Lecaille	414e3aa6bc	BUILD: quic: 32bits build broken by wrong integer conversions for printf() Since these commits the 32bits build is broken due to several errors as follow: CC src/quic_cli.o src/quic_cli.c: In function ‘dump_quic_full’: src/quic_cli.c:285:94: error: format ‘%ld’ expects argument of type ‘long int’, but argument 5 has type ‘uint64_t’ {aka ‘long long unsigned int’} [-Werror=format=] 285 \| chunk_appendf(&trash, " [initl] rx.ackrng=%-6zu tx.inflight=%-6zu(%ld%%)\n", \| ~~^ \| \| \| long int \| %lld 286 \| pktns->rx.arngs.sz, pktns->tx.in_flight, 287 \| pktns->tx.in_flight * 100 / qc->path->cwnd); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| \| \| uint64_t {aka long long unsigned int} Replace several %ld by %llu with ull as printf conversion in quic_clic.c and a %ld by %lld with (long long) as printf conversion in quic_cc_cubic.c. Thank you to Ilya (@chipitsine) for having reported this issue in GH #2689. Must be backported to 3.0.	2024-08-26 11:21:48 +02:00
William Lallemand	7a03ab426f	BUILD: tools: environ is not defined in OS X and BSD Add extern char environ which in order to build the new functions to manipulate the environment. Indeed the variable environ is not required to be declared by POSIX, so it need to be declared manually: "In addition, the following variable, which must be declared by the user if it is to be used directly: extern char environ;" https://pubs.opengroup.org/onlinepubs/9699919799/functions/environ.html	2024-08-23 19:39:57 +02:00
Valentine Krasnobaeva	28ca7fc594	BUG/MINOR: haproxy: free init_env in deinit only if allocated This fixes `7b78e1571` (" MINOR: mworker: restore initial env before wait mode"). In cases, when haproxy starts without any configuration, for example: 'haproxy -vv', init_env array to backup env variables is never allocated. So, we need to check in deinit(), when we free its memory, that init_env is not a NULL ptr.	2024-08-23 19:08:53 +02:00
Valentine Krasnobaeva	7b78e1571b	MINOR: mworker: restore initial env before wait mode This patch is the follow-up of `1811d2a6ba` (MINOR: tools: add helpers to backup/clean/restore env). In order to avoid unexpected behaviour in master-worker mode during the process reload with a new configuration, when the old one has contained '*env' keywords, let's backup its initial environment before calling parse_cfg() and let's clean and restore it in the context of master process, just before it enters in a wait polling loop. This will garantee that new workers will have a new updated environment and not the previous one inherited from the master, which does not read the configuration, when it's in a wait-mode.	2024-08-23 17:06:59 +02:00
Valentine Krasnobaeva	1811d2a6ba	MINOR: tools: add helpers to backup/clean/restore env 'setenv', 'presetenv', 'unsetenv', 'resetenv' keywords in configuration could modify the process runtime environment. In case of master-worker mode this creates a problem, as the configuration is read only once before the forking a worker and then the master process does the reexec without reading any config files, just to free the memory. So, during the reload a new worker process will be created, but it will inherited the previous unchanged environment from the master in wait mode, thus it won't benefit the changes in configuration, related to '*env' keywords. This may cause unexpected behavior or some parser errors in master-worker mode. So, let's add a helper to backup all process env variables just before it will read its configuration. And let's also add helpers to clean up the current runtime environment and to restore it to its initial state (as it was before parsing the config).	2024-08-23 17:06:33 +02:00
Amaury Denoyelle	960d68a5af	MINOR: mux-quic: correct qcc_bufwnd_full() documentation Fix returned value domment of qcc_bufwnd_full() which was incorrect.	2024-08-23 16:25:04 +02:00
Amaury Denoyelle	ecfedc2570	MINOR: mux-quic: add buf_in_flight to QCC debug infos Dump <buf_in_flight> QCC field both in QUIC MUX traces and "show quic". This could help to detect if MUX does not allocate enough buffers compared to quic_conn current congestion window.	2024-08-22 17:48:23 +02:00
Nathan Wehrman	5c07d58e08	MINOR: config: Created env variables for http and tcp clf formats Since we already have variables for the other formats and the change is trivial I thought it would be a nice addition for completeness	2024-08-22 09:15:58 +02:00
Willy Tarreau	9911b53d75	CLEANUP: protocol: no longer initialize .receivers nor .nb_receivers Protocol definitions no longer need to initialize these internal fields, as they're now properly initialized during protocol registration.	2024-08-21 17:37:46 +02:00
Willy Tarreau	1cb3b0b745	MINOR: protocol: always initialize the receivers list on registration Till now, protocols were required to self-initialize their receivers list head, which is not very convenient, and is quite error prone. Indeed, it's too easy to copy-paste a protocol definition and forget to update the .receivers field to point to itself, resulting in mixed lists. Let's just do that in protocol_register(). And while we're at it, let's also zero the nb_receivers entry that works with it, so that the protocol definition isn't required to pre-initialize stuff related to internal book-keeping.	2024-08-21 17:37:46 +02:00
Willy Tarreau	034974106f	MINOR: socket: don't ban all custom families from reuseport The test on ss_family >= AF_MAX is too strict if we want to support new custom families, let's apply this to the real_family instead so that we check that the underlying socket supports reuseport.	2024-08-21 17:37:46 +02:00
Willy Tarreau	2a799b64b0	MINOR: protocol: add the real address family to the protocol For custom families, there's sometimes an underlying real address and it would be nice to be able to directly use the real family in calls to bind() and connect() without having to add explicit checks for exceptions everywhere. Let's add a .real_family field to struct proto_fam for this. For now it's always equal to the family except for non-transferable ones such as rhttp where it's equal to the custom one (anything else could fit).	2024-08-21 17:37:46 +02:00
Willy Tarreau	d592ebdbeb	MEDIUM: socket: always properly use the sock_domain for requested families Now we make sure to always look up the protocol's domain for an address family. Previously we would use it as-is, which prevented from properly using custom addresses (which is when they differ). This removes some hard-coded tests such as in log.c where UNIX vs UDP was explicitly checked for example. It requires a bit of care, however, so as to properly pass value 1 in the 3rd arg of the protocol_lookup() for DGRAM stuff. Maybe one day we'll change these for defines or enums to limit mistakes.	2024-08-21 17:36:58 +02:00
Willy Tarreau	ba4a416c66	MINOR: protocol: add a family lookup At plenty of places we have access to an address family which may include some custom addresses but we cannot simply convert them to the real families without performing some random protocol lookups. Let's simply add a proto_fam table like we have for the protocols. The protocols could even be indexed there, but for now it's not worth it.	2024-08-21 16:46:15 +02:00
Willy Tarreau	732913f848	MINOR: protocol: properly assign the sock_domain and sock_family When we finally split sock_domain from sock_family in 2.3, something was not cleanly finished. The family is what should be stored in the address while the domain is what is supposed to be passed to socket(). But for the custom addresses, we did the opposite, just because the protocol_lookup() function was acting on the domain, not the family (both of which are equal for non-custom addresses). This is an API bug but there's no point backporting it since it does not have visible effects. It was visible in the code since a few places were using PF_UNIX while others were comparing the domain against AF_MAX instead of comparing the family. This patch clarifies this in the comments on top of proto_fam, addresses the indexing issue and properly reconfigures the two custom families.	2024-08-21 16:46:15 +02:00
Willy Tarreau	67bf1d6c9e	MINOR: quic: support a tolerance for spurious losses Tests performed between a 1 Gbps connected server and a 100 mbps client, distant by 95ms showed that: - we need 1.1 MB in flight to fill the link - rare but inevitable losses are sufficient to make cubic's window collapse fast and long to recover - a 100 MB object takes 69s to download - tolerance for 1 loss between two ACKs suffices to shrink the download time to 20-22s - 2 losses go to 17-20s - 4 losses reach 14-17s At 100 concurrent connections that fill the server's link: - 0 loss tolerance shows 2-3% losses - 1 loss tolerance shows 3-5% losses - 2 loss tolerance shows 10-13% losses - 4 loss tolerance shows 23-29% losses As such while there can be a significant gain sometimes in setting this tolerance above zero, it can also significantly waste bandwidth by sending far more than can be received. While it's probably not a solution to real world problems, it repeatedly proved to be a very effective troubleshooting tool helping to figure different root causes of low transfer speeds. In spirit it is comparable to the no-cc congestion algorithm, i.e. it must not be used except for experimentation.	2024-08-21 08:34:30 +02:00
Willy Tarreau	fab0e99aa1	MINOR: quic: store the lost packets counter in the quic_cc_event element Upon loss detection, qc_release_lost_pkts() notifies congestion controllers about the event and its final time. However it does not pass the number of lost packets, that can provide useful hints for some controllers. Let's just pass this option.	2024-08-21 08:02:44 +02:00
Valentine Krasnobaeva	2e6e159ac4	BUG/MINOR: cfgparse-global: remove tune.fast-forward from common_kw_list Remove tune.fast-forward from common_kw_list. It was replaced by 'tune.disable-fast-forward' and it's no longer present in "if..else if.." parser from cfg_parse_global(). Otherwise, it may be shown as the best-match keyword for some tune options, which is now wrong. Should be backported in versions 2.9 and 3.0.	2024-08-20 19:16:34 +02:00
Valentine Krasnobaeva	731ef865e3	MINOR: cfgparse-global: move unsupported keywords in global list Following the previous commits and in order to clean up cfg_parse_global let's move unsupported keywords in the global list and let's add for them a dedicated parser.	2024-08-20 19:16:33 +02:00
Valentine Krasnobaeva	55309592db	MINOR: cfgparse-global: move tune options in global keywords list In order to clean up cfg_parse_global() and to add the support of the new MODE_DISCOVERY in configuration parsing, let's move the keywords related to tune options into the global keywords list and let's add for them two dedicated parsers. Tune options keywords are sorted between two parsers in dependency of parameters number, which a given tune option needs. tune options parser is called by section parser and follows the common API, i.e. it returns -1 on failure, 0 on success and 1 on recoverable error. In case of recoverable error we've previously returned ERR_ALERT (0x10) and we have emitted an alert message at startup. Section parser treats all rc > 0 as ERR_WARN. So in case, if some tune option was set twice in the global section, tune options parser will return 1 (in order to respect the common API), section parser will treat this as ERR_WARN and a warning message will be emitted during process startup instead of alert, as it was before.	2024-08-20 19:16:32 +02:00
Valentine Krasnobaeva	c46497f16f	MINOR: cfgparse-global: move 'expose-' in global keywords list Following the previous commit let's also move 'expose-' keywords in the global cfg_kws list and let's add for them a dedicated parser. This will simplify the configuration parsing in the new MODE_DISCOVERY, which allows to read only the keywords, needed at the early start of haproxy process (i.e. modes, pidfile, chosen poller).	2024-08-20 19:16:31 +02:00
Valentine Krasnobaeva	450ce3e61b	MINOR: cfgparse-global: move 'pidfile' in global keywords list This commit cleans up cfg_parse_global() and prepares the config parser to support MODE_DISCOVERY. This step is needed in early starting stage, just to figura out in which mode the process was started, to set some necessary parameteres needed for this mode and to continue the initialization stage. 'pidfile' makes part of such common keywords, which are needed to be parsed very early and which are used almost in all process modes (except the foreground, '-d'). 'pidfile' keyword parser is called by section parser and follows the common API, i.e. it returns -1 on failure, 0 on success and 1 on recoverable error. In case of recoverable error we've previously returned ERR_ALERT (0x10) and we have emitted an alert message at startup. Section parser treats all rc > 0 as ERR_WARN. So in case, if pidfile was already specified via command line, the keyword parser will return 1 (in order to respect the common API), section parser will treat this as ERR_WARN and a warning message will be emitted during process startup instead of alert, as it was before.	2024-08-20 19:16:30 +02:00
Valentine Krasnobaeva	f29be97ac7	BUG/MINOR: cfgparse-global: remove redundant goto In the case, when the given keyword was found in the global 'cfg_kws' list, we go to 'out' label anyway, after testing rc returned by the keyword's parser. So there is not a much gain if we perform 'goto out' jump specifically when rc > 0.	2024-08-20 19:16:29 +02:00
Valentine Krasnobaeva	74bc6f3d66	BUG/MINOR: cfgparse-global: clean common_kw_list This patch fixes commits `118ac11ce` ("MINOR: cfgparse-global: move mode's keywords in cfg_kw_list") and `83ff4db18` (MINOR: cfgparse-global: move no<poller_name> in cfg_kw_list). 'common_kw_list' serves to show the best-match keyword in cfg_parse_global(), if the given keyword was not parsed in "if..else if.." cases. cfg_parse_global() is still used as a parser for some keywords from the global section. Mode-specific and no<poller_name> keywords now have their own parsers. They no longer take place in the "if..else if.." from cfg_parse_global() and they are registered in the 'cfg_kws' list. So, there is no longer need to duplicate them in the 'common_kw_list'. Otherwise, they will be shown twice in parser error message.	2024-08-20 19:16:28 +02:00
Valentine Krasnobaeva	4291d10b44	BUG/MINOR: cfgparse-global: fix err msg in mworker keyword parser This patch fixes the commit `118ac11ce` ("cfgparse-global: move mode's keywords in cfg_kw_list"). Error message delivered by keyword parser in **err is always shown with ha_alert() by the caller cfg_parse_global(). The caller always supplies these alerts with the filename and the line number.	2024-08-20 19:16:27 +02:00
Amaury Denoyelle	0d6112b40b	MINOR: mux-quic: retry after small buf alloc failure Previous commit switch to small buffers for HTTP/3 HEADERS emission. This ensures that several parallel streams can allocate their own buffer without hitting the connection buffer limit based now on the congestion window size. However, this prevents the transmission of responses with uncommonly large headers. Indeed, if all headers cannot be encoded in a single buffer, an error is reported which cause the whole connection closure. Adjust this by implementing a realloc API exposed by QUIC MUX. This allows application layer to switch from a small to a default buffer and restart its processing. This guarantees that again headers not longer than bufsize can be properly transferred.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	b355e89bf9	MEDIUM: h3: allocate small buffers for headers frames A major change was recently implemented to change QUIC MUX Tx buffer allocation limit, which is now based on the current connection congestion window size. As this size may be smaller than the previous static value, it is likely that the limit will be reached more frequently. When using HTTP/3, the majority of requests streams are used for small object exchanges. Every responses start with a HEADERS frames which should be much smaller in size than the default buffer. But as the whole buffer size is accounted against the congestion window, a single stream can block others even if only emitting a single HEADERS frame which is suboptimal for bandwith usage, if the congestion window is small enough. To adapt to this new situation, rely on the newly available small buffers to transfer HEADERS frame response. This at least guarantee that several parallel streams could allocate their own buffer for the first part of the response, even with a small congestion window. The situation could be further improve to use various indication on the data size and select a small buffer if sufficient. This could be done for example via the Content-length value or HTX extra field. However this must be the subject of a dedicated patch.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	885e4c5cf8	MINOR: quic: support sbuf allocation in quic_stream This patch extends qc_stream_desc API to be able to allocate small buffers. QUIC MUX API is similarly updated as ultimatly each application protocol is responsible to choose between a default or a smaller buffer. Internally, the type of allocated buffer is remembered via qc_stream_buf instance. This is mandatory to ensure that the buffer is released in the correct pool, in particular as small and standard buffers can be configured with the same size. This commit is purely an API change. For the moment, small buffers are not used. This will changed in a dedicated patch.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	d0d8e57d47	MINOR: quic: define sbuf pool Define a new buffer pool reserved to allocate smaller memory area. For the moment, its usage will be restricted to QUIC, as such it is declared in quic_stream module. Add a new config option "tune.bufsize.small" to specify the size of the allocated objects. A special check ensures that it is not greater than the default bufsize to avoid unexpected effects.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	1de5f718cf	MINOR: quic/config: adapt settings to new conn buffer limit QUIC MUX buffer allocation limit is now directly based on the underlying congestion window size. previous static limit based on conn-tx-buffers is now unused. As such, this commit adds a warning to users to prevent that it is now obsolete. Secondly, update max-window-size setting. It is now the main entrypoint to limit both the maximum congestion window size and the number of QUIC MUX allocated buffer on emission. Remove its special value '0' which was used to automatically adjust it on now unused conn-tx-buffers.	2024-08-20 17:59:35 +02:00
Amaury Denoyelle	aeb8c1ddc3	MAJOR: mux-quic: allocate Tx buffers based on congestion window Each QUIC MUX may allocate buffers for MUX stream emission. These buffers are then shared with quic_conn to handle ACK reception and retransmission. A limit on the number of concurrent buffers used per connection has been defined statically and can be updated via a configuration option. This commit replaces the limit to instead use the current underlying congestion window size. The purpose of this change is to remove the artificial static buffer count limit, which may be difficult to choose. Indeed, if a connection performs with minimal loss rate, the buffer count would limit severely its throughput. It could be increase to fix this, but it also impacts others connections, even with less optimal performance, causing too many extra data buffering on the MUX layer. By using the dynamic congestion window size, haproxy ensures that MUX buffering corresponds roughly to the network conditions. Using QCC <buf_in_flight>, a new buffer can be allocated if it is less than the current window size. If not, QCS emission is interrupted and haproxy stream layer will subscribe until a new buffer is ready. One of the criticals parts is to ensure that MUX layer previously blocked on buffer allocation is properly woken up when sending can be retried. This occurs on two occasions : * after an already used Tx buffer is cleared on ACK reception. This case is already handled by qcc_notify_buf() via quic_stream layer. * on congestion window increase. A new qcc_notify_buf() invokation is added into qc_notify_send(). Finally, remove <avail_bufs> QCC field which is now unused. This commit is labelled MAJOR as it may have unexpected effect and could cause significant behavior change. For example, in previous implementation QUIC MUX would be able to buffer more data even if the congestion window is small. With this patch, data cannot be transferred from the stream layer which may cause more streams to be shut down on client timeout. Another effect may be more CPU consumption as the connection limit would be hit more often, causing more streams to be interrupted and woken up in cycle.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	000976af58	MINOR: mux-quic: define buf_in_flight Define a new QCC counter named <buf_in_flight>. Its purpose is to account the current sum of all allocated stream buffer size used on emission. For this moment, this counter is updated and buffer allocation and deallocation. It will be used to replace <avail_bufs> once congestion window is used as limit for buffer allocation in a future commit.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	f9777bea30	MINOR: h3: mark control stream as metadata A current work is performed to change QUIC MUX buffer allocation limit from a configurable static value to use the size of the congestion window instead. This change may cause the buffer allocation limit to be triggered more frequently. To ensure HTTP/3 control emission is not perturbed by this change, mark the stream with qcc_send_metadata(). This ensures that buffer allocation for this stream won't be subject to the connection limit. This is necessary to guarantee that SETTINGS and GOAWAY frames are emitted.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	4c4bf26f44	MEDIUM: mux-quic: implement API to ignore txbuf limit for some streams Define a new qc_stream_desc flag QC_SD_FL_OOB_BUF. This is to mark streams which are not subject to the connection limit on allocated MUX stream buffer. The purpose is to simplify handling of QUIC MUX streams which do not transfer data and as such are not driven by haproxy layer, for example HTTP/3 control stream. These streams interacts synchronously with QUIC MUX and cannot retry emission in case of temporary failure. This commit will be useful once connection buffer allocation limit is reimplemented to directly rely on the congestion window size. This will probably cause the buffer limit to be reached more frequently, maybe even on QUIC MUX initialization. As such, it will be possible to mark control streams and prevent them to be subject to the buffer limit. QUIC MUX expose a new function qcs_send_metadata(). It can be used by an application protocol to specify which streams are used for control exchanges. For the moment, no such stream use this mechanism.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	f4d1bd0b76	MINOR: mux-quic: account stream txbuf in QCC A limit per connection is put on the number of buffers allocated by QUIC MUX for emission accross all its streams. This ensures memory consumption remains under control. This limit is simply explained as a count of buffers which can be concurrently allocated for each connection. As such, quic_conn structure was used to account currently allocated buffers. However, a quic_conn nevers allocates new stream buffers. This is only done at QUIC MUX layer. As such, this commit moves buffer accounting inside QCC structure. This simplifies the API, most notably qc_stream_buf_alloc() usage. Note that this commit inverts the accounting. Previously, it was initially set to 0 and increment for each allocated buffer. Now, it is set to the maximum value and decrement for each buf usage. This is considered as clearer to use.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	635fbaaa4a	MINOR: quic: allocate stream txbuf via qc_stream_desc API This commit simply adjusts QUIC stream buffer allocation. This operation is conducted by QUIC MUX using qc_stream_desc layer. Previously, qc_stream_buf_alloc() would return a qc_stream_buf instance and QUIC MUX would finalized the buffer area allocation. Change this to perform the buffer allocation directly into qc_stream_buf_alloc(). This patch clarifies the interaction between QUIC MUX and qc_stream_desc. It is cleaner to allocate the buffer via qc_stream_desc as it is already responsible to free the buffer. It also ensures that connection buffer accounting is only done after the whole qc_stream_buf and its buffer are allocated. Previously, the increment operation was performed between the two steps. This was not an issue, as this kind of error triggers the whole connection closure. However, if in the future this is handled as a stream closure instead, this commit ensures that the buffer remains valid in all cases.	2024-08-20 17:17:17 +02:00

... 2 3 4 5 6 ...

18279 commits