When an htx DATA block is partially transfer, we must take care to remove
exactly the copied size. To do so, we must save the size of the last block
value copied and not rely on the last data block after the copy. Indeed,
data can be merged with an existing DATA block, so the last block size can
be larger than the last part copied.
Because of this issue, it is possible to remove more data than
expected. Worse, this could lead to a crash by performing an integer
overflow on the block size.
No backport needed.
Fix a leak of the task object in acme_start_task() when one of the
condition in the function failed.
Fix issue #3308.
Must be backported to 3.2 and later.
There are two settings to control idle connection sharing across
threads.
tune.idle-pool.shared, that enables or disables it, and then
tune.takeover-other-tg-connections, which lets you or not get idle
connections from other thread groups.
Add a new keyword for tune.idle-pool.shared, "full", that lets you get
connections from other thread groups (equivalent to "full" keyword for
tune.takeover-other-tg-connections). The "on" keyword now will be
equivalent to the "restrict" one, which allowed getting connection from
other thread groups only when not doing it would result in a connection
failure (when reverse-http or when strict-macxonn are used).
tune.takeover-other-tg-connections will be deprecated.
If server returns an auth with status valid it seems that client
needs to always skip it, CA can recycle authorizations, without
this change haproxy fails to obtain certificates in that case.
It is also something that is explicitly allowed and stated
in the dns-persist-01 draft RFC.
Note that it would be better to change how haproxy does status polling,
and implements the state machine, but that will take some thought
and time, this patch is a quick fix of the problem.
See:
https://github.com/letsencrypt/boulder/issues/2125https://github.com/letsencrypt/pebble/issues/133
This must be backported to 3.2 and later.
When abortonclose option is enabled (by default since 3.3), the HTTP rules
can no longer yield if the client aborts. However, stream aborts were also
considered. So it was possible to interrupt yielding rules, especially on
the response processing, while the client was still waiting for the
response.
So now, when abortonclose option is enabled, we now take care to only
consider client aborts to prevent HTTP rules to yield.
Many thanks to @DirkyJerky for his detailed analysis.
This patch should fix the issue #3306. It should be backported as far as
2.8.
warnif_misplaced_* functions return 1 when a warning is reported and 0
otherwise. So the caller must properly handle the return value.
When parsing a proxy, ERR_WARN code must be added to the error code instead
of the return value. When a warning was reported, ERR_RETRYABLE (1) was
added instead of ERR_WARN.
And when tcp rules were parsed, warnings were ignored. Message were emitted
but the return values were ignored.
This patch should be backported to all stable versions.
When warnif_cond_conflicts() is called, we must take care to emit a warning
only when a conflict is reported. We cannot rely on the err_code variable
because some warnings may have been already reported. We now rely on the
errmsg variable. If it contains something, a warning is emitted. It is good
enough becasue warnif_cond_conflicts() only reports warnings.
This patch should fix the issue #3305. It is a 3.4-dev specific issue. No
backport needed.
When picking a mux, pay attention to its MX_FL_FRAMED. If it is set,
then it means we explicitely want QUIC, so don't use that mux for any
protocol that is not QUIC.
When parsing the check address, store the associated proto too.
That way we can use the notation like quic4@address, and the right
protocol will be used. It is possible for checks to use a different
protocol than the server, ie we can have a QUIC server but want to run
TCP checks, so we can't just reuse whatever the server uses.
WIP: store the protocol in checks
Don't assume the check will reuse the server's xprt. It may not be true
if some settings such as the ALPN has been set, and it differs from the
server's one. If the server is QUIC, and we want to use TCP for checks,
we certainly don't want to reuse its XPRT.
Permission checks on the CLI for ACME are missing.
This patch adds a check on the ACME commands
so they can only be run in admin mode.
ACME is stil a feature in experimental-mode.
Initial report by Cameron Brown.
Must be backported to 3.2 and later.
Permission checks on the CLI for ECH are missing.
This patch adds a check for "(add|set|del|show) ssl ech" commands
so they can only be run in admin mode.
ECH is stil a feature in experimental-mode and is not compiled by
default.
Initial report by Cameron Brown.
Must be backported to 3.3.
This patch fixes a warning that can be reproduced with gcc-8.5 on RHEL8
(gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)).
This should fix issue #3303.
Must be backported everywhere 917e82f283 ("MINOR: debug: copy debug
symbols from /usr/lib/debug when present") was backported, which is
to branch 3.2 for now.
Fix the check or arguments of the 'acme challenge_ready' command which
was checking if all arguments are NULL instead of one of the argument.
Must be backported to 3.2 and later.
Replace atol() by _strl2uic() in cases the input are ISTs when parsing
the retry-after header. There's no risk of an error since it will stop
at the first non-digit.
Must be backported to 3.2 and later.
In acme_req_finalize() the data buffer is only freed when a2base64url
succeed. This patch moves the allocation so it free() the DER buffer in
every cases.
Must be backported to 3.2 and later.
In the documentation of 'http-check expect' directive, the parameter
'status-code' was missing. Let's add it.
This patch could be backported to all stable versions.
Thanks to previous commits, it is possible to use small buffers at different
places: to store the request when a connection is queued or when L7 retries
are enabled, or for health-checks requests. However, there was no
configuration parameter to fine tune small buffer use.
It is now possible, thanks to the proxy option "use-small-buffers".
Documentation was updated accordingly.
When healthchecks were configured for a proxy, an enum-like was used to
sepcify the check's type. The idea was to reserve some values for futur
types of healthcheck. But it is overkill. I doubt we will ever have
something else than tcp and external checks. So corresponding PR_O2 flags
were slightly reviewed and a hole was filled.
Thanks to this change, some bits were released in options2 bitfield.
If support for small buffers is enabled, we now try to use them for
healthcheck requests. First, we take care the tcpcheck ruleset may use small
buffers. Send rules using LF strings or too large data are excluded. The
ability to use small buffers or not are set on the ruleset. All send rules
of the ruleset must be compatible. This info is then transfer to server's
healthchecks relying on this ruleset.
Then, when a healthcheck is running, when a send rule is evaluated, if
possible, we try to use small buffers. On error, the ability to use small
buffers is removed and we retry with a regular buffer. It means on the first
error, the support is disabled for the healthcheck and all other runs will
use regular buffers.
In h2_rcv_buf(), HTX flags are transfer with data when htx_xfer() is
called. There is no reason to continue to deal with them in the H2 mux. In
addition, there is no reason to set SE_FL_EOI flag when a parsing error was
reported. This part was added before the stconn era. Nowadays, when an HTX
parsing error is reported, an error on the sedesc should also be reported.
This reverts commit 44932b6c41.
The patch above was only necessary to handle partial headers or trailers
parsing. There was nothing to prevent the H2 multiplexer to start to add
headers or trailers in an HTX message and to stop the processing on error,
leaving the HTX message with no EOH/EOT block.
From the HTX API point of view, it is unexepected. And this was fixed thanks
to the commit ba7dc46a9 ("BUG/MINOR: h2/h3: Never insert partial
headers/trailers in an HTX message").
So this patch can be reverted. It is important to not report a parsign error
too early, when there are still data to transfer to the upper layer.
This patch must be backport where 44932b6c4 was backported but only after
backporting ba7dc46a9 first.
When a HTX stream is queued, if the request is small enough, it is moved
into a small buffer. This should save memory on instances intensively using
queues.
Applet and connection receive function were update to block receive when a
small buffer is in use.
In the same way support for large chunks was added to properly work with
large buffers, we are now adding supports for small chunks because it is
possible to process small buffers.
So a dedicated memory pool is added to allocate small
chunks. alloc_small_trash_chunk() must be used to allocate a small
chunk. alloc_trash_chunk_sz() and free_trash_chunk() were uppdated to
support small chunks.
In addition, small trash buffers are also created, using the same mechanism
than for regular trash buffers. So three thread-local trash buffers are
created. get_small_trash_chunk() must be used to get a small trash buffer.
And get_trash_chunk_sz() was updated to also deal with small buffers.
htx_move_to_small_buffer()/htx_move_to_large_buffer() and
htx_copy_to_small_buffer()/htx_copy_to_large_buffer() functions can now be
used to move or copy blocks from a default buffer to a small or large
buffer. The destination buffer is allocated and then each blocks are
transferred into it.
These funtions relies in htx_xfer() function.
htx_xfer() function should replace htx_xfer_blks(). It will be a bit easier to
maintain and to use. The behavior of htx_xfer() can be changed by calling it
with specific flags:
* HTX_XFER_KEEP_SRC_BLKS: Blocks from the source message are just copied
* HTX_XFER_PARTIAL_HDRS_COPY: It is allowed to partially xfer headers or trailers
* HTX_XFER_HDRS_ONLY: only headers are xferred
By default (HTX_XFER_DEFAULT or 0), all blocks from the source message are moved
into to the destination mesage. So copied in the destination messageand removed
from the source message.
The caller must still define the maximum amount of data (including meta-data)
that can be xferred.
It is no longer necessary to specify a block type to stop the copy. Most of
time, with htx_xfer_blks(), this parameter was set to HTX_BLK_UNUSED. And
otherwise it was only specified to transfer headers.
It is important to not that the caller is responsible to verify the original
HTX message is well-formated. Especially, it must be sure headers part and
trailers part are complete (finished by EOH/EOT block).
For now, htx_xfer_blks() is not removed for compatiblity reason. But it is
deprecated.
When small buffer size was greater than the default buffer size, an error
was triggered. We now do the same than for large buffer. A warning is
emitted and the small buffer size is set to 0 do disable small buffer
allocation.
Because small buffers were only used by QUIC streams, the pool used to alloc
these buffers was located in the quic code. However, their usage will be
extended to other parts. So, the small buffers pool was moved into the
dynbuf part.
http-errors parsing has been refactored in a recent serie of patches.
However, a null deref was introduced by the following patch in case a
non-existent http-errors section is referenced by an "errorfiles"
directive.
commit 2ca7601c2d
MINOR/OPTIM: http_htx: lookup once http_errors section on check/init
Fix this by delaying ha_free() so that it is called after ha_alert().
No need to backport.
The cfg_parse_acme() function checks if an 'acme' section is already
existing in the configuration with cur_acme->linenum > 0. But the wrong
filename and line number are displayed in the commit message.
Must be backported to 3.2 and later.
This patch fixes a leak of the ext_san structure when
sk_X509_EXTENSION_push() failed. sk_X509_EXTENSION_pop_free() is already
suppose to free it, so ext_san must be set to NULL upon success to avoid
a double-free.
Must be backported to 3.2 and later.
Use proxy_check_http_errors() on defaults proxy instances. This will
emit alert messages for errorfiles directives referencing a non-existing
http-errors section, or a warning if an explicitely listed status code
is not present in the target section.
This is a small behavior changes, as previouly this was only performed
for regular proxies. Thus, errorfile/errorfiles directives in an unused
defaults were never checked.
This may prevent startup of haproxy with a configuration file previously
considered as valid. However, this change is considered as necessary to
be able to use http-errors with dynamic backends. Any invalid defaults
will be detected on startup, rather than having to discover it at
runtime via "add backend" invokation.
Thus, any restriction on http-errors usage is now lifted for the
creation of dynamic backends.
The previous patch has splitted the original proxy_check_errors()
function in two, so that check and init steps are performed separately.
However, this renders the code inefficient for "errorfiles" directive as
tree lookup on http-errors section is performed twice.
Optimize this by adding a reference to the section in conf_errors
structure. This is resolved during proxy_check_http_errors() and
proxy_finalize_http_errors() can reuse it.
No need to backport.
Function proxy_check_errors() is used when configuration parsing is
over. This patch splits it in two newly named ones.
The first function is named proxy_check_http_errors(). It is responsible
to check for the validity of any "errorfiles" directive which could
reference non-existent http-errors section or code not defined in such
section. This function is now called via proxy_finalize().
The second function is named proxy_finalize_http_errors(). It converts
each conf_errors type used during parsing in a proper http_reply type
for runtime usage. This function is still called via post-proxy-check,
after proxy_finalize().
This patch does not bring any functional change. However, it will become
necessary to ensure http-errors can be used as expected with dynamic
backends.
This patch is the second part of the refactoring for http-errors
parsing. It renames some fields in <conf_errors> structure to clarify
their usage. In particular, union variants are renamed "inl"/"section",
which better highlight the link with the newly defined enum
http_err_directive.
In conf_errors struct, arbitrary integer values were used for both
<type> field and <status> array. This renders the code difficult to
follow.
Replaces these values with proper enums type. Two new types are defined
for each of these fields. The first one represents the directive type,
derived from the keyword used (errorfile vs errorfiles). This directly
represents which part of <info> union should be manipulated.
The second enum is used for errorfiles directive with a reference on a
http-errors section. It indicates whether or not if a status code should
be imported from this section, and if this import is explicit or
implicit.
Several resources were leaked on both success and error paths:
- X509_NAME *nm was never freed. X509_REQ_set_subject_name() makes
an internal copy, so nm must be freed separately by the caller.
- str_san allocated via my_strndup() was never freed on either path.
- On error paths after allocation, x (X509_REQ) and exts
(STACK_OF(X509_EXTENSION)) were also leaked.
Fix this by adding proper cleanup of all allocated resources in both
the success and error paths. Also move sk_X509_EXTENSION_pop_free()
after X509_REQ_sign() so it is not skipped when sign fails, and
initialize nm to NULL to make early error paths safe.
Must be backported as far as 3.2.
There was a leftover of "activity[tid].ctr1++" in commit 7d40b3134
("MEDIUM: sched: do not run a same task multiple times in series")
that unfortunately only builds in development mode :-(
This introduces 3 new settings: tune.h2.be.max-frames-at-once and
tune.h2.fe.max-frames-at-once, which limit the number of frames that
will be processed at once for backend and frontend side respectively,
and tune.h2.fe.max-rst-at-once which limits the number of RST_STREAM
frames processed at once on the frontend.
We can now yield when reading too many frames at once, which allows to
limit the latency caused by processing too many frames in large buffers.
However if we stop due to the RST budget being depleted, it's most likely
the sign of a protocol abuse, so we make the tasklet go to BULK since
the goal is to punish it.
By limiting the number of RST per loop to 1, the SSL response time drops
from 95ms to 1.6ms during an H2 RST flood attack, and the maximum SSL
connection rate drops from 35.5k to 28.0k instead of 11.8k. A moderate
SSL load that shows 1ms response time and 23kcps increases to 2ms with
15kcps versus 95ms and 800cps before. The average loop time goes down
from 270-280us to 160us, while still doubling the attack absorption
rate with the same CPU capacity.
This patch may usefully be backported to 3.3 and 3.2. Note that to be
effective, this relies on the following patches:
MEDIUM: sched: do not run a same task multiple times in series
MINOR: sched: do not requeue a tasklet into the current queue
MINOR: sched: do not punish self-waking tasklets anymore
MEDIUM: sched: do not punish self-waking tasklets if TASK_WOKEN_ANY
MEDIUM: sched: change scheduler budgets to lower TL_BULK
Having less yielding tasks in TL_BULK and more in TL_NORMAL, we need
to rebalance these queues' priorities. Tests have shown that raising
TL_NORMAL to 40% and lowering TL_BULK to 3% seems to give about the
best tradeoffs.
Self-waking tasklets are currently punished and go to the BULK list.
However it's a problem with muxes or the stick-table purge that just
yield and wake themselves up to limit the latency they cause to the
rest of the process, because by doing so to help others, they punish
themselves. Let's check if any TASK_WOKEN_ANY flag is present on
the tasklet and stop sending tasks presenting such a flag to TL_BULK.
Since tasklet_wakeup() by default passes TASK_WOKEN_OTHER, it means
that such tasklets will no longer be punished. However, tasks which
only want a best-effort wakeup can simply pass 0.
It's worth noting that a comparison was made between going into
TL_BULK at all and only setting the TASK_SELF_WAKING flag, and
it shows that the average latencies are ~10% better when entirely
avoiding TL_BULK in this case.
Nowadays due to yield etc, it's counter-productive to permanently
punish self-waking tasklets, let's abandon this principle as it prevent
finer task priority handling.
We continue to check for the TASK_SELF_WAKING flag to place a task
into TL_BULK in case some code wants to make use of it in the future
(similarly to TASK_HEAVY), but no code sets it anymore. It could
possible make sense in the future to replace this flag with a one-shot
variant requesting low-priority.
As found by Christopher, the concept of waking a tasklet up into the
current queue is totally flawed, because if a task is in TL_BULK or
TL_HEAVY, all the tasklets it will wake up will end up in the same
queue. Not only this will clobber such queues, but it will also
reduce their quality of service, and this can contaminate other
tasklets due to the numerous wakeups there are now with the subsribe
mechanism between layers.
There's always a risk that some tasks run multiple times if they wake
each other up. Now we include the loop counter in the task struct and
stop processing the queue it's in when meeting a task that has already
run. We only pick 16 bits since that's only what remains free in the
task common part, so from time to time (once every 65536) it will be
possible to wrongly match a task as having already run and stop evaluating
its queue, but it's rare enough that we don't care, because this will
be OK on the next iteration.
This patch improves the robustness of the QPACK varint decoder and fixes
potential 1-byte out-of-bounds reads in qpack_decode_fs().
In qpack_decode_fs(), two 1-byte OOB reads were possible on truncated
streams between two varint decoding. These occurred when trying to read
the byte containing the Huffman bit <h> and the Value Length prefix
immediately following an Index or a Name Length.
Note that these OOB are limited to a single byte because
qpack_get_varint() already ensures that its input length is non-zero
before consuming any data.
The fixes in qpack_decode_fs() are:
- When decoding an index, we now verify that at least one byte remains
to safely access the following <h> bit and value length.
- When decoding a literal, we now check len < name_len + 1 to ensure
the byte starting the header value is reachable.
In qpack_get_varint(), the maximum value is now strictly capped at 2^62-1
as per RFC. This is enforced using a budget-based check:
(v & 127) > (limit - ret) >> shift
This prevents values from overflowing into the 63rd or 64th bits, which
would otherwise break subsequent signed comparisons (e.g., if (len < name_len))
by interpreting the length as a negative value, leading to false positive
tests.
Thank you to @jming912 for having reported this issue in GH #3302.
Must be backported as far as 2.6
In the ENFILE and ENOMEM cases, when accept() fails, an irrelevant
global.maxsock value was printed that doesn't reflect system limits.
Now the actconn is printed that gives a hint about the failure reasons.
Should be backported in all stable branches.
We anticipated that the do-log action should be expanded with optional
arguments at some point. Now that we heard of multiple use-cases
that could be achieved with do-log action, but that are limitated by the
fact that all do-log statements inherit from the implicit log-profile
defined on the logger, we need to provide a way for the user to specify
that custom log-profile that could be used per do-log actions individually
This is what we try to achieve in this commit, by leveraging the
prerequisite work performed by the last 2 commits.
In process_send_log(), now also consider the ctx if ctx->profile != NULL
In that case, we do as if logger->prof was set, but we consider
ctx->profile in priority over the logger one. What this means is that
it will become possible to pass ctx.profile to a profile that will be
used no matter what to generate the log payload.
This is a pre-requisite to implement optional "profile" argument for
do-log action
do_log() is just a wrapper to use do_log_ctx() with pre-filled ctx, but
we now have the low-level do_log_ctx() variant which can be used to
pass specific ctx parameters instead.
Released version 3.4-dev7 with the following main changes :
- BUG/MINOR: stconn: Increase SC bytes_out value in se_done_ff()
- BUG/MINOR: ssl-sample: Fix sample_conv_sha2() by checking EVP_Digest* failures
- BUG/MINOR: backend: Don't get proto to use for webscoket if there is no server
- BUG/MINOR: jwt: Missing 'jwt_tokenize' return value check
- MINOR: flt_http_comp: define and use proxy_get_comp() helper function
- MEDIUM: flt_http_comp: split "compression" filter in 2 distinct filters
- CLEANUP: flt_http_comp: comp_state doesn't bother about the direction anymore
- BUG/MINOR: admin: haproxy-reload use explicit socat address type
- MEDIUM: admin: haproxy-reload conversion to POSIX sh
- BUG/MINOR: admin: haproxy-reload rename -vv long option
- SCRIPTS: git-show-backports: hide the common ancestor warning in quiet mode
- SCRIPTS: git-show-backports: add a restart-from-last option
- MINOR: mworker: add a BUG_ON() on mproxy_li in _send_status
- BUG/MINOR: mworker: don't set the PROC_O_LEAVING flag on master process
- Revert "BUG/MINOR: jwt: Missing 'jwt_tokenize' return value check"
- MINOR: jwt: Improve 'jwt_tokenize' function
- MINOR: jwt: Convert EC JWK to EVP_PKEY
- MINOR: jwt: Parse ec-specific fields in jose header
- MINOR: jwt: Manage ECDH-ES algorithm in jwt_decrypt_jwk function
- MINOR: jwt: Add ecdh-es+axxxkw support in jwt_decrypt_jwk converter
- MINOR: jwt: Manage ec certificates in jwt_decrypt_cert
- DOC: jwt: Add ECDH support in jwt_decrypt converters
- MINOR: stconn: Call sc_conn_process from the I/O callback if TASK_WOKEN_MSG state was set
- MINOR: mux-h2: Rely on h2s_notify_send() when resuming h2s for sending
- MINOR: mux-spop: Rely on spop_strm_notify_send() when resuming streams for sending
- MINOR: muxes: Wakup the data layer from a mux stream with TASK_WOKEN_IO state
- MAJOR: muxes: No longer use app_ops .wake() callback function from muxes
- MINOR: applet: Call sc_applet_process() instead of .wake() callback function
- MINOR: connection: Call sc_conn_process() instead of .wake() callback function
- MEDIUM: stconn: Remove .wake() callback function from app_ops
- MINOR: check: Remove wake_srv_chk() function
- MINOR: haterm: Remove hstream_wake() function
- MINOR: stconn: Wakup the SC with TASK_WOKEN_IO state from opposite side
- MEDIUM: stconn: Merge all .chk_rcv() callback functions in sc_chk_rcv()
- MINOR: stconn: Remove .chk_rcv() callback functions
- MEDIUM: stconn: Merge all .chk_snd() callback functions in sc_chk_snd()
- MINOR: stconn: Remove .chk_snd() callback functions
- MEDIUM: stconn: Merge all .abort() callback functions in sc_abort()
- MINOR: stconn: Remove .abort() callback functions
- MEDIUM: stconn: Merge all .shutdown() callback functions in sc_shutdown()
- MINOR: stconn: Remove .shutdown() callback functions
- MINOR: stconn: Totally app_ops from the stconns
- MINOR: stconn: Simplify sc_abort/sc_shutdown by merging calls to se_shutdown
- DEBUG: stconn: Add a CHECK_IF() when I/O are performed on a orphan SC
- MEDIUM: mworker: exiting when couldn't find the master mworker_proc element
- BUILD: ssl: use ASN1_STRING accessors for OpenSSL 4.0 compatibility
- BUILD: ssl: make X509_NAME usage OpenSSL 4.0 ready
- BUG/MINOR: tcpcheck: Fix typo in error error message for `http-check expect`
- BUG/MINOR: jws: fix memory leak in jws_b64_signature
- DOC: configuration: http-check expect example typo
- DOC/CLEANUP: config: update mentions of the old "Global parameters" section
- BUG/MEDIUM: ssl: Handle receiving early data with BoringSSL/AWS-LC
- BUG/MINOR: mworker: always stop the receiving listener
- BUG/MEDIUM: ssl: Don't report read data as early data with AWS-LC
- BUILD: makefile: fix range build without test command
- BUG/MINOR: memprof: avoid a small memory leak in "show profiling"
- BUG/MINOR: proxy: do not forget to validate quic-initial rules
- MINOR: activity: use dynamic allocation for "show profiling" entries
- MINOR: tools: extend the pointer hashing code to ease manipulations
- MINOR: tools: add a new pointer hash function that also takes an argument
- MINOR: memprof: attempt different retry slots for different hashes on collision
- MINOR: tinfo: start to add basic thread_exec_ctx
- MINOR: memprof: prepare to consider exec_ctx in reporting
- MINOR: memprof: also permit to sort output by calling context
- MINOR: tools: add a function to write a thread execution context.
- MINOR: debug: report the execution context on thread dumps
- MINOR: memprof: report the execution context on profiling output
- MINOR: initcall: record the file and line declaration of an INITCALL
- MINOR: tools: decode execution context TH_EX_CTX_INITCALL
- MINOR: tools: support decoding ha_caller type exec context
- MINOR: sample: store location for fetch/conv via initcalls
- MINOR: sample: also report contexts registered directly
- MINOR: tools: support an execution context that is just a function
- MINOR: actions: store the location of keywords registered via initcalls
- MINOR: actions: also report execution contexts registered directly
- MINOR: filters: set the exec context to the current filter config
- MINOR: ssl: set the thread execution context during message callbacks
- MINOR: connection: track mux calls to report their allocation context
- MINOR: task: set execution context on task/tasklet calls
- MINOR: applet: set execution context on applet calls
- MINOR: cli: keep the info of the current keyword being processed in the appctx
- MINOR: cli: keep track of the initcall context since kw registration
- MINOR: cli: implement execution context for manually registered keywords
- MINOR: activity: support aggregating by caller also for memprofile
- MINOR: activity: raise the default number of memprofile buckets to 4k
- DOC: internals: short explanation on how thread_exec_ctx works
- BUG/MINOR: mworker: only match worker processes when looking for unspawned proc
- MINOR: traces: defer processing of "-dt" options
- BUG/MINOR: mworker: fix typo &= instead of & in proc list serialization
- BUG/MINOR: mworker: set a timeout on the worker socketpair read at startup
- BUG/MINOR: mworker: avoid passing NULL version in proc list serialization
- BUG/MINOR: sockpair: set FD_CLOEXEC on fd received via SCM_RIGHTS
- BUG/MEDIUM: stconn: Don't forget to wakeup applets on shutdown
- BUG/MINOR: spoe: Properly switch SPOE filter to WAITING_ACK state
- BUG/MEDIUM: spoe: Properly abort processing on client abort
- BUG/MEDIUM: stconn: Fix abort on close when a large buffer is used
- BUG/MEDIUM: stconn: Don't perform L7 retries with large buffer
- BUG/MINOR: h2/h3: Only test number of trailers inserted in HTX message
- MINOR: htx: Add function to truncate all blocks after a specific block
- BUG/MINOR: h2/h3: Never insert partial headers/trailers in an HTX message
- BUG/MINOR: http-ana: Swap L7 buffer with request buffer by hand
- BUG/MINOR: stream: Fix crash in stream dump if the current rule has no keyword
- BUG/MINOR: mjson: make mystrtod() length-aware to prevent out-of-bounds reads
- MEDIUM: stats-file/clock: automatically update now_offset based on shared clock
- MINOR: promex: export "haproxy_sticktable_local_updates" metric
- BUG/MINOR: spoe: Fix condition to abort processing on client abort
- BUILD: spoe: Remove unsused variable
- MINOR: tools: add a function to create a tar file header
- MINOR: tools: add a function to load a file into a tar archive
- MINOR: config: support explicit "on" and "off" for "set-dumpable"
- MINOR: debug: read all libs in memory when set-dumpable=libs
- DEV: gdb: add a new utility to extract libs from a core dump: libs-from-core
- MINOR: debug: copy debug symbols from /usr/lib/debug when present
- MINOR: debug: opportunistically load libthread_db.so.1 with set-dumpable=libs
- BUG/MINOR: mworker: don't try to access an initializing process
- BUG/MEDIUM: peers: enforce check on incoming table key type
- BUG/MINOR: mux-h2: properly ignore R bit in GOAWAY stream ID
- BUG/MINOR: mux-h2: properly ignore R bit in WINDOW_UPDATE increments
- OPTIM: haterm: use chunk builders for generated response headers
- BUG/MAJOR: h3: check body size with content-length on empty FIN
- BUG/MEDIUM: h3: reject unaligned frames except DATA
- BUG/MINOR: mworker/cli: fix show proc pagination losing entries on resume
- CI: github: treat vX.Y.Z release tags as stable like haproxy-* branches
- MINOR: freq_ctr: add a function to add values with a peak
- MINOR: task: maintain a per-thread indicator of the peak run-queue size
- MINOR: mux-h2: store the concurrent streams hard limit in the h2c
- MINOR: mux-h2: permit to moderate the advertised streams limit depending on load
- MINOR: mux-h2: permit to fix a minimum value for the advertised streams limit
- BUG/MINOR: mworker: fix sort order of mworker_proc in 'show proc'
- CLEANUP: mworker: fix tab/space mess in mworker_env_to_proc_list()
Since version 3.1, the display order of old workers in 'show proc' was
accidentally reversed. The oldest worker was shown first and the newest
last, which was not the intended behavior. This regression was introduced
during the master-worker rework.
Fix this by sorting the list during deserialization in
mworker_env_to_proc_list().
An alternative fix would have been to iterate the list in reverse order
in the show proc function, but that approach risks introducing
inconsistencies when backporting to older versions.
Must be backported to 3.1 and later.
When using rq-load on tune.h2.fe.max-concurrent-streams, it's easy to
reach a situation where only one stream is allowed. There's nothing
wrong with this but it turns out that slightly higher values do not
necessarily cause significantly higher loads and will improve the user
experience. For this reason the keyword now also supports "min" to
specify a value. Experimentation shows that values from 5 to 15 remain
very effective at protecting the run queue while allowing a great level
of parallelism that keeps a site fluid.
Global setting tune.h2.fe.max-concurrent-streams now supports an optional
"rq-load" option to pass either a target load, or a keyword among "auto"
and "ignore". These are used to quadratically reduce the advertised streams
limit when the thread's run queue size goes beyong the configured value,
and automatically reduce the load on the process from new connections.
With "auto", instead of taking an explicit value, it uses as a target the
"tune.runqueue-depth" setting (which might be automatic). Tests have shown
that values between 50 and 100 are already very effective at reducing the
loads during attacks from 100000 to around 1500. By default, "ignore"
is in effect, which means that the dynamic tuning is not enabled.
The hard limit on the number of concurrent streams is currently
determined only by configuration and returned by
h2c_max_concurrent_streams(). However this doesn't permit to
change such settings on the fly without risking to break connections,
and it doesn't allow a connection to pick a different value, which
could be desirable for example to try to slow abuse down.
Let's store a copy of h2c_max_concurrent_streams() at connection
creation time into the h2c as streams_hard_limit. This inflates
the h2c size from 1324 to 1328 (0.3%) which is acceptable for the
expected benefits.
The new field th_ctx->rq_tot_peak contains the computed peak run queue
length averaged over the last 512 calls. This is computed when entering
process_runnable_tasks. It will not take into account new tasks that are
created or woken up during this round nor those which are evicted, which
is the reason why we're using a peak measurement to increase chances to
observe transient high values. Tests have shown that 512 samples are good
to provide a relatively smooth average measurement while still fading
away in a matter of milliseconds at high loads. Since this value is
only updated once per round, it cannot be used as a statistic and
shouldn't be exposed, it's only for internal use (self-regulation).
Sometimes it's desirable to observe fading away peak values, where a new
value that is higher than the historical one instantly replaces it,
otherwise contributes to it. It is convenient when trying to observe
certain phenomenons like peak queue sizes. The new function
swrate_add_peak_local() does that to a private variable (no atomic ops
involved as it's not worth the cost since such use cases are typically
local).
Add detection of release tags matching the vX.Y.Z pattern so they use
the same stable CI configuration as haproxy-* branches, rather than the
development one.
It prevents stable tag to trigger the CI with docker images and SSL
libraries only used for development.
Must be backported in stable releases.
After commit 594408cd61 ("BUG/MINOR: mworker/cli: fix show proc
pagination using reload counter"), the old-workers pagination stores
ctx->next_reload = child->reloads on flush failure, then skips entries
with child->reloads >= ctx->next_reload on resume.
The >= comparison is direction-dependent: it assumes the list is in
descending reload order (newest first). On current master, proc_list
is in ascending order (oldest first) because mworker_env_to_proc_list()
appends deserialized entries before mworker_prepare_master() appends
the new worker. This means the skip logic is inverted and can miss
entries or loop incorrectly depending on the version.
We fix this by renaming the context field to resume_reload and changing its
semantics: it now tracks the reload count of the last *successfully
flushed* row rather than the failed one. On flush failure, resume_reload
is left unchanged so the failed row is replayed on the next call. On
resume, entries are skipped by walking the list until the marker entry is
found (exact == match), which works regardless of list direction.
Additionally, we have to handle the unlikely case where the marker entry
is deleted from proc_list between handler calls (e.g. the process exits and
SIGCHLD processing removes it). Detect this by tracking the previous
LEAVING entry's reload count during the skip phase: if two consecutive
entries straddle the skip value (one > skip, the other < skip), the
deleted entry's former position has been crossed, so skipping stops and
the current entry is emitted.
This should be backported to all stable branches. On branches where
proc_list is in descending order (2.9, 3.0), the fix applies the
same way since the skip logic is now direction-agnostic.
HTTP/3 parser cannot deal with unaligned frames, except for DATA. As it
was expected that such case would not occur, a simple BUG_ON() was
written to protect HEADERS parsing.
First, this BUG_ON() was incorrectly written due an incorrect operator
'>=' vs '>' when checking if data wraps. Thus this patch correct it.
However this correction is not sufficient as it still possible to handle
a large unaligned HEADERS frame, which would trigger this BUG_ON(). This
is very unlikely as HEADERS is the first received frame on a request
stream, but not completely impossible. As HTTP/3 frame header (type +
length) is parsed first and removed, this leaves a small gap at the
buffer beginning. If this small gap is then filled with the remaining
frame payload, it would result in unaligned data. Also, trailers are
also sensitive here as in this case a HEADERS frame is handled after
other frames.
The objective of this patch is to ensure that an unaligned frame is now
handled in a safe way. This is extend to all HTTP/3 frames (except DATA)
and not only to HEADERS type. Parsing is interrupted if frame payload is
wrapping in the buffer. This should never happen except maybe with some
weird clients, so the connection is closed with H3_EXCESSIVE_LOAD error.
This approach is considered the safest one, in particular for backport
purpose. In the future, realign operation via copy may be implemented
instead if considered as useful.
This must be backported up to 2.6.
In QUIC, a STREAM frame may be received with no data but with FIN bit
set. This situation is tedious to handle and haproxy parsing code has
changed several times to deal with this situation. Now, H3 and H09
layers parsing code are skipped in favor of the shared function
qcs_http_handle_standalone_fin() used to handle the HTX EOM emission.
However, this shortcut bypasses an important HTTP/3 validation check on
the received body size vs the announced content-length header. Under
some conditions, this could cause a desynchronization with the backend
server which could be exploited for request smuggling.
Fix HTTP/3 parsing code by adding a call to h3_check_body_size() prior
to qcs_http_handle_standalone_fin() if content-length header has been
found. If the body size is incorrect, the stream is immediately resetted
with H3_MESSAGE_ERROR code and the error is forwarded to the stream
layer.
Thanks to Martino Spagnuolo for his detailed report on this issue and
for having contacting us about it via the security mailing list.
This must be backported up to 2.6.
hstream_build_http_resp() currently uses snprintf() to build the
status code and the generated X-req/X-rsp header values.
These strings are short and are fully derived from already parsed request
state, so they can be assembled directly in the HAProxy trash buffer using
`chunk_strcat()` and `ultoa_o()`.
This keeps the generated output unchanged while removing the remaining
`snprintf()` calls from the response-building path.
No functional change is expected.
Signed-off-by: Aleksandar Lazic <al-haproxy@none.at>
The window size increments are 31 bits and the topmost bit is reserved
and should be ignored, however it was not masked, so a peer sending it
set would emit a negative value which could actually reduce the current
window instead of increasing it. Note that the window cannot reach zero
as there's already a test for this, but transfers could slow down to
the same speed as if an initial window of just a few bytes had been
advertised. Let's just mask the reserved bit before processing.
This should be backported to all stable versions.
The stream ID indicated in GOAWAY frames must have its bit 31 (R) ignored
and this wasn't the case. The effect is that if this bit was present, the
GOAWAY frame would mark the last acceptable stream as negative, which is
the default situation (unlimited), thus would basically result in this
GOAWAY frame to be ignored since it would replace a negative last_sid
with another negative one. The impact is thus basically that if a peer
would emit anything non-zero in the R bit, the GOAWAY frame would be
ignored and new streams would still be initiated on the backend, before
being rejected by the server.
Thanks to Haruto Kimura (Stella) for finding and reporting this bug.
This fix needs to be backported to all stable versions.
The key type received over the peers protocol is not checked for
validity and as a result can crash the process when passed through
peer_int_key_type[] in peer_treat_definemsg(). The risk remains
very low since only trusted peers may exchange tables, however it
represents a risk the day haproxy supports new key types, because
mixing old and new versions could then cause the old ones to crash.
Let's add the required check in peer_treat_definemsg().
It is also worth noting that in this function a few protocol identifiers
of type int read directly from a var_int via intdecode() and that some
protocol aliasing may occur (e.g. table_id, table_id_len etc). This is
not supposed to be a problem but it could hide implementation bugs and
cause interoperability issues once fixed, so these should be addressed
in a future commit that will not be marked for backporting.
Thanks to Haruto Kimura (Stella) for finding and reporting this bug.
This fix needs to be backported to all stable versions.
In pcli_prefix_to_pid(), when resolving a worker by absolute pid
(@!<pid>) or by relative pid (@1), a worker that still has PROC_O_INIT
set (i.e. not yet ready, still initializing) could be returned as a
valid target.
During a reload, if a client connects to the master CLI and sends a
command targeting a worker (e.g. @@1 or @@!<pid>), the master resolves
the target pid and attempts to forward the command by transferring a fd
over the worker's sockpair. If the worker is still initializing and has
not yet sent its READY signal, its end of the sockpair is not usable,
causing send_fd_uxst() to fail with EPIPE. This results in the
following alert being repeated in a loop:
[ALERT] (550032) : socketpair: Cannot transfer the fd 13 over sockpair@5. Giving up.
The situation is even worse if the initializing worker has already
exited (e.g. due to a bind failure) but has not yet been removed from
the process list: in that case the sockpair's remote end is already
closed, making the failure immediate and unrecoverable until the dead
worker is cleaned up.
This was not possible before 3.1 because the master's polling loop only
started once all workers were fully ready, making it impossible to
receive CLI connections while a worker was still initializing.
Fix this by skipping workers with PROC_O_INIT set in both the absolute
and relative pid resolution paths of pcli_prefix_to_pid(), so that
only fully initialized workers can be targeted.
Must be backported to 3.1 and later.
When loading libs into the core dump, let's also try to load
libthread_db.so.1 that gdb usually requires. It can significantly help
decoding the threads for systems which require it, and the file is quite
small. It can appear at a few different locations and is generally next
to libpthread.so, or alternately libc, so we first look where we found
them, and fall back to a few other common places. The file is really
small, a few tens of kB usually.
When set-dumpable=libs, let's also pick the debug symbols for the libs
we're loading. For now we only try /usr/lib/debug/<path>, which is quite
common and easy to guess. Build IDs could also be used but are more
complex to deal with, so let's stay simple for now.
This utility takes in argument the path to a core dump, and it looks
for the archive signature of libraries embedded with "set-dumpable libs",
and either emits the offset and size of stdout, or directly dumps the
contents so that the tar file can be extracted directly by piping the
output to tar xf.
When "set-dumpable" is set to "libs", in addition to marking the process
dumpable, haproxy also reads the binary and shared objects into memory as
a tar archive in a page-aligned location so that these files are easily
extractable from a future core dump. The goal here is to always have
access to the exact same binary and libs as those which caused the core
to happen. It's indeed very frequent to miss some of these, or to get
mismatching files due to a local update that didn't experience a reload,
or to get those of a host system instead of the container.
The in-memory tar file presents everything under a directory called
"core-%d" where %d corresponds to the PID of the worker process. In
order to ease the finding of these data in the core dump, the memory
area is contiguous and surrounded by PROT_NONE pages so that it appears
in its own segment in the core file. The total size used by this is a
few tens of MB, which is not a problem on large systems.
The global "set-dumpable" keyword currently is only positional. Let's
extend its syntax to support arguments. For now we support both "on"
and "off" to explicitly enable or disable it.
New function load_file_into_tar() concatenates a file into an in-memory
tar archive and grows its size. Only the base name and a provided prefix
are used to name the faile. If the file cannot be loaded, it's added as
size zero and permissions 0 to show that it failed to load. This will
be used to load post-mortem information so it needs to remain simple.
The purpose here is to create a tar file header in memory from a known
file name, prefix, size and mode. It will be used to prepare archives
of libs in use for improved debugging, but may probably be useful for
other purposes due to its simplicity.
Since 7a1382da7 ("BUG/MINOR: spoe: Fix condition to abort processing on
client abort"), the chn variable is no longer used in
spoe_process_event(). Let's remove it
This patch must be backported with the commit above, as far as 3.1.
The test to detect client aborts in the SPOE, introduced by commit b3be3b94a
("BUG/MEDIUM: spoe: Properly abort processing on client abort"), was no
correct. Producer flags must not be tested. Only the frontend SC must be
tested when the abortonclose option is set.
Because of this bug, when a client aborted, the SPOE processing was aborted
too, regardless the abortonclose option.
This patch must be backpoeted with the commit above, so as far as 3.1.
haproxy_sticktable_local_updates corresponds to the table->localupdate
counter, which is used internally by the peers protocol to identify
update messages in order to send and ack them among peers.
Here we decide to expose this information, as it is already the case in
"show peers" output, because it turns out that this value, which is
cumulative and grows in sync with the number of updates triggered on the
table due to changes initiated by the current process, can be used to
compute the update rate of the table. Computing the update rate of the
table (from the process point of view, ie: updates sent by the process and
not those received by the process), can be a great load indicator in order
to properly scale the infrastructure that is intended to handle the
table updates.
Note that there is a pitfall, which is that the value will eventually
wrap since it is stored using unsigned 32bits integer. Scripts or system
making use of this value must take wrapping into account between two
readings to properly compute the effective number of updates that were
performed between two readings. Also, they must ensure that the "polling"
rate between readings is small enough so that the value cannot wrap behind
their back.
We no longer rely on now_offset stored in the shm-stats-file. Instead
haproxy automatically computes the now_offset relative to the monotonic
clock and the shared global clock.
Indeed, the previous model based on static now_offset when monotonic
clock is available proved to be insufficient when used in
combination with shm-stats-file (that is when monotonic clock is shared
between multiple co-processes). In ideal situation co-processes would
correctly apply the offset to their local monotonic clock and end up
with consistent now_ns. But when restarting from an existing
shm-stats-file from a previous session (ie: prior to reboot), then the
local monotonic clock would no longer be consistent with the one used
to update the file previously, so applying a static offset would fail
to restore clock consistency.
For this specific issue, a workaround was brought by 09bf116
("BUG/MEDIUM: stats-file: detect and fix inconsistent shared clock when resuming from shm-stats-file")
but the solution implemented there was deemed too fragile, because there
is a 60sec window where the fix would fail to detect inconsistent clock
and would leave haproxy with a broken clock ranging from 0 to 60 seconds,
which can be huge..
By simply recomputing the now_offset each time we learn from another
process (through the shared map by reading global_now_ns), we simply
recompute our local offset (difference between OUR monotonic clock
and the SHARED one). Also, in clock_update_global_date(), we make
sure we always recompute the now_offset as now_ms may have been
updated from shared clock if shared clock was ahead of us.
Thanks to that new logic, interrupted processes, resumed processes,
processed started with shm-stats-file from previous session now
correctly recover from those various situations and multiple
co-processes with diverting clocks on startup end up converging to
the same values.
Since it is no longer relevant to save now_offset in the map, it was
removed but to prevent shm-stats-file incompatibility with previous
versions, 8-byte hole was forced, and we didn't bump the shm-stats-file
version on purpose.
This patch may be backported in 3.3 after a solid period of observation
to ensure we didn't break things.
mystrtod() was not length-aware and relied on null-termination or a
non-numeric character to stop. The fix adds a length parameter as a
strict upper bound for all pointer accesses.
The practical impact in haproxy is essentially null: all callers embed
the JSON payload inside a large haproxy buffer, so the speculative read
past the last digit lands on memory that is still within the same
allocation. ASAN cannot detect it in a normal haproxy run for the same
reason — the overread never escapes the enclosing buffer. Triggering a
detectable fault requires placing the JSON payload at the exact end of
an allocation.
Note: the 'path' buffer was using a null-terminated string so the result
of strlen is passed to it, this part was not at risk.
Thanks to Kamil Frankowicz for the original bug report.
This patch must be backported to all maintained versions.
The commit 9f1e9ee0e ("DEBUG: stream: Display the currently running rule in
stream dump") revealed a bug. When a stream is dumped, if it is blocked on a
rule, we must take care the rule has a keyword to display its name.
Indeed, some action parsings are inlined with the rule parser. In that case,
there is no keyword attached to the rule.
Because of this bug, crashes can be experienced when a stream is
dumped. Now, when there is no keyword, "?" is display instead.
This patch must be backported as far as 2.6.
When a L7 retry is performed, we should not rely on b_xfer() to swap the L7
buffer with the request buffer. When it is performed the request buffer is
not allocated. b_xfer() must not be called with an unallocated destination
buffer. The swap remains an optim. For instance, It is not performed on
buffers of different size. So the caller is responsible to provide an
allocated destination buffer with enough free space to transfer data.
However, when a L7 retry is performed, we cannot allocate a request buffer,
because we cannot yield. An error was reported, if we wait for a buffer, the
error will be handled by process_stream(). But we can swap the buffers by
hand. At this stage, we know there is no request buffer, so we can easily
swap it with the L7 buffer.
Note there is no real bug for now.
This patch could be backported to all stable versions.
In HTX, headers and trailers parts must always be complete. It is unexpected
to found header blocks without the EOH block or trailer blocks without the
EOT block. So, during H2/H3 message parsing, we must take care to remove any
HEADER/TRAILER block inserted when an error is encountered. It is mandatory
to be sure to properly report parsing error to upper layer.x
It is now performed by calling htx_truncat_blk() function on the error
path. The tail block is saved before converting any HEADERS/TRAILERS frame
to HTX. It is used to remove all inserted block on error.
This patch rely on the following one:
"MINOR: htx: Add function to truncate all blocks after a specific block"
It should be backported with the commit above to all stable versions for
the H2 part and as far as 2.8 for h3 one.
When H2 or H3 trailers are inserted in an HTX message, we must take care to
not exceed the maximum number of trailers allowed in a message (same than
the maximum number of headers, i.e tune.http.maxhdr). However, all HTX
blocks in the HTX message were considered. Only TRAILERS HTX blocks must be
considered.
To fix the issue, in h2_make_htx_trailers(), we rely on the "idx" variable
at the end of the for loop. In h3_trailers_to_htx(), we rely on the
"hdr_idx" variable.
This patch must be backported to all stables versions for the H2 part and as
far as 2.8 for the H3 one.
pouet
L7 retries are buggy when a large buffer is used on the request channel. A
memcpy is used to copy data from the request buffer into the L7 buffer. The
L7 buffer is for now always a standard buffer. So if a larger buffer is
used, this leads to a buffer overflow and crash the process.
The Best way to fix the issue is to disable L7 retries when a large buffer
was allocated for the request channel. In that case, we don't want to
allocate an extra large buffer.
No backport needed.
When a large buffer is used on a channel, once we've started to send data to
the opposite side, receives are blocked temporarily to be sure to flush the
large buffer ASAP to be able to fall back on regular buffers. This was
performed by skipping call to the endpoint (connection or applet). Howerver,
doing so, this broken the abortonclose and more generally this masked any
shut or error events reported by the lower layer.
To fix the issue, instead of skipping receives, we now try a receive but
with a requested size set to 0.
No backport needed
Client abort when abortonclose is configured was ignored when messges were
sent on event while it works properly when messages are sent via an
"send-spoe-group" action.
To fix the issue, when the SPOE filter is waiting for the SPOE applet
response, it must check if a client abort was reported and if so, must
interrupt its processing.
This patch should be backported as far as 3.1.
When the SPOE applet is created, the SPOE filter is set in SENDING_MSGS
state. When the applet has transferred data, it should switch the filter to
WAITING_ACK state. Concretly, there is no bug. At best, it could save some
useless applet wakeups.
This patch should be backported as far as 3.1
When SC's shudown callback functions were merged, a regression was
introduced. The applet was no longer woken up. Because of this bug, an
applet could remain blocked, waiting for an I/O event or a timeout.
This patch should fix the issue #3301.
No backport needed.
FDs received through recv_fd_uxst() do not have FD_CLOEXEC set.
The equivalent sock_accept_conn() already handles this correctly:
any FD accepted or received in the master must be marked close-on-exec
to avoid leaking it across the execvp() performed on soft-reload.
This is currently triggering a leak in the master since 3.1: the worker
sends a socketpair fd to the master to issue the _send_status CLI
command, and recv_fd_uxst() receive it without setting FD_CLOEXEC. If a
re-exec is emitted before the master had the chance to close that fd, it
survives execvp() and appears as an untracked unnamed AF_UNIX socket in
the new master generation.
This must be backported to all maintained branches.
Add a NULL guard for the version field. This has no functional impact
since the master process never uses this field for its own mworker_proc
element, and should be the only one impacted. This avoid seeing "(null)"
in the version field when debugging.
Must be backported to 3.1 and later.
During a soft reload, a starting worker sends sock_pair[0] to the master
via send_fd_uxst(), then reads on sock_pair[1] waiting for the master to
acknowledge receipt. Because of a documented macOS sendmsg(2) bug, the
worker must keep sock_pair[0] open until the master confirms the fd was
received by the CLI applet. This means the read() on sock_pair[1] will
never return 0 (EOF), since the worker itself still holds a reference to
sock_pair[0]. The worker can only unblock when the master actively sends
a byte back. If the master crashes before doing so, the worker blocks
indefinitely in read().
Fix this by setting a 2-second SO_RCVTIMEO on sock_pair[1] before the
read(), so the worker can unblock and continue regardless of the master's
state.
This was introduced by d7f6819161 ("BUG/MEDIUM: mworker: fix startup
and reload on macOS").
This should be backported to 3.1 and later.
In mworker_proc_list_to_env(), a typo used '&=' instead of '&' when
checking PROC_O_TYPE_WORKER in child->options. This would corrupt the
options field by clearing all bits except PROC_O_TYPE_WORKER, but since
the function is called right before the master re-execs itself during a
reload, the corruption has no actual effect: the in-memory proc_list is
discarded by the exec, and the options field is not serialized to the
environment anyway.
This should be backported to all maintained versions.
We defer processing of the "-dt" options until after the configuration
file has been read. This will be useful if we ever allow trace sources
to be registered later, for instance with LUA.
No backport needed.
In master-worker mode, when a freshly forked worker looks up its own
entry in proc_list to send its "READY" status to the master, the loop
was breaking on the first process with pid == -1 regardless of its
type. If a non-worker process (e.g. a master or program) also had
pid == -1, the wrong entry could be selected, causing send_fd_uxst()
to use an invalid ipc_fd.
Fix this by adding a PROC_O_TYPE_WORKER check to the loop condition,
and add a BUG_ON() assertion to catch any case where the loop exits
without finding a valid worker entry.
Must be backported to 3.1.
"show profiling" supports "aggr" for tasks but it was ignored for
memory. Now that we're having many more entries, it makes sense to
have it to ignore the call path and merge similar operations.
Keywords registered out of an initcall will have a TH_EX_CTX_CLI_KWL
execution context pointing to the keyword list. The report will indicate
the 5 first words of the first command of the list, e.g.:
exec_ctx: cli kwl starting with 'debug counters '
This should also work for CLI keywords registered in Lua.
Now CLI keywords registered via an initcall will be tracked during
execution, by keeping a link to their initcall location. "show threads"
now shows "exec_ctx: kw registered at @debug.c:3093" which indeed
corresponds to the initcall for the debugging commands.
Till now the CLI didn't know what keyword was being processed after it
was parsed. In order to report the execution context, we'll need to
store it. And this may even help for post-mortem analysis to know the
exact keyword being processed, so let's store the pointer in the cli_ctx
part of the appctx.
It allows to know when a thread is currnetly running inside an applet.
For example now "show threads" will show "applet '<CLI>'" for the thread
issuing this command.
It now appears almost everywhere due to callbacks (e.g. ssl_sock_io_cb).
Muxes also become visible now on memory profiling. A small test on h1+ssl
yields 838 lines of statistics. The number of buckets should definitely
be increased, and more grouping criteria should be added.
A performance test was conducted to observe the possible effect of
setting the execution context on each task switch, and it didn't change
at all, remaining at about 1.01 billion ctxsw/s on a 128-thread EPYC.
Most calls to mux ops were instrumented with a CALL_MUX_WITH_RET() or
CALL_MUX_NO_RET() macro in order to make the current thread's context
point to the called mux and be able to track its allocations. Only
a bunch of harmless mux_ctl() and ->subscribe/unsubscribe calls were
left untouched since useless. But destroy/detach/shut/init/snd_buf
and rcv_buf are now tracked.
It will not show allocations performed in IO callback via tasklet
wakeups however.
In order to ease reading of the output, cmp_memprof_ctx() knows about
muxes and sorts based on the .subscribe function address instead of
the mux_ops address so as to keep various callers grouped.
In order to be able to track memory allocation performed from message
callbacks, let's set the thread execution context to a generic function
pointing to them during their call. This allows for example to observe
the share of SSL allocations caused by ssl_sock_parse_clienthello() when
SSL captures are enabled.
The release calls are automatic from the SSL library for these, and are
registered directly via SSL_get_ex_new_index(). Maybe we should improve
the internal API to wrap that function and systematically track free
calls as well. In this case, maybe even registering the message callback
registration could take both the callback and the release function.
There are few such users however, essentially capture and keylog.
Doing this allows to report the allocations/releases performed by filters
when running with memory profiling enabled. The flt_conf pointer is kept
and the report shows the filter name.
A bit similar to what was done for sample fetch functions and converters,
we now store with each action keyword the location of the initcall when
they're registered this way. Since there are many functions only calling
a LIST_APPEND() (one per ruleset), we now implement a dedicated function
to store the context in all keywords before doing the append.
However that's not sufficient, because keywords are not mandatory for
actions, so we cannot safely rely on rule->kw. Thus we then set the
exec_ctx per rule when they are all scanned in check_action_rules(),
based on the keyword if it exists, otherwise we make a context from
the action_ptr function if it is set (it should).
Finally at all call points we now check rule->exec_ctx.
The purpose here is to be able to spot certain callbacks, such as the
SSL message callbacks, which are difficult to associate to anything.
Thus we introduce a new context type, TH_EX_CTX_FUNC, for which the
context is just the function pointed to by the void *pointer. One
difficulty with callbacks is that the allocation and release contexts
will likely be different, so the code should be properly structured
to allow proper tracking, either by instrumenting all calls, or by
making sure that the free calls are easy to spot in a report.
With the two new context types TH_EX_CTX_SMPF/CONV, we can now also
report contexts corresponding to direct calls to sample_register_fetches()
and sample_register_convs(). In this case, the first word of the keyword
list is reported.
Now keywords are registered with an exec_ctx and this one is passed
when calling ->process. The ctx is of type INITCALL when passed via
an initcall where we know the file name and line number.
This was tested with and extra "malloc(15)" added in smp_fetch_path()
which shows that it works:
$ socat /tmp/sock1 - <<< "show profiling memory"|grep via
Calls | Tot Bytes | Caller and method [via]
1893399 0 60592592 0| 0x78b2ec task_run_applet+0x3339c malloc(32) [via initcall @http_fetch.c:2416]
When the execution context is set to TH_EX_CTX_INITCALL, the pointer
points to a valid initcall, and the decoder will show "kw registered
at %s:%d" with file and line number of the initcall declaration. It's
up to the caller to make the initcall pointer point to the one that was
set during the initcall. The purpose here is to be able to preserve and
pass that knowledge of an initcall down the chain so that future calls
to functions registered via the initcall are still assigned to it.
The INITCALL macros will now store the file and line number where they
are declared into the initcall struct, and RUN_INITCALLS() will assign
them to the global caller_file and caller_line variables, and will even
set caller_initcall to the current initall so that at any instant such
functions know where their caller declared them. This will help with
error messages and traces where a bit of context will be welcome.
Now we have one extra line saying "exec_ctx: something" in thread dumps
when it's known. It may help with warnings and panics to figure what
is ongoing.
The new function chunk_append_thread_ctx() appends to a buffer the given
execution context based on its type and pointer. The goal is to easily
use it in profiling output and thread dumps. For now it only handles
TH_EX_CTX_NONE (which prints nothing) and TH_EX_CTX_OTHER (which indicates
"other ctx" followed by the pointer). It will be extended by new types as
they arrive.
By passing "byctx" to "show profiling memory", it's possible to sort by
the calling context first, which could help group certain calls by
subsystem and ease the interpretation of the output.
This now allows to report the same function in multiple bins based on the
th_ctx's exec_ctx discriminant. It's also worth noting that the context is
not atomically committed, but this shouldn't be a problem since a single
entry can get it. In the worst case, a second thread trying to create the
same context in parallel would create a different bin just for this call,
which is harmless. The same situation already exists with the caller
pointer.
We have the struct made of a type and a pointer in the th_ctx and a
function to switch it for the current thread. Two macros are provided
to enclose a callee within a temporary context. For now only type OTHER
is supported (only a generic pointer).
When two pointer hash to the same memprofile bin, we currently try again
with the same bin until we find a spare one or we reach the limit of 16.
Olivier suggested to try with a different step for different pointers so
as to limit the number of bins to visit in such a case, so let's split
the pointer hash calculation so that we keep the raw hash before reduction
and use its lowest bits as the retry step. We force lowest bit to 1 to
avoid integral multiples that would oscillate between only a few positions.
Quick tests with h1+h2 requests show that for ~744 distinct entries, we
used to have 1.17 retries per lookup before and 0.6 now so we're halving
the cost of hash collisions. A heavier workload that used to produce 920
entries with 2.01 retries per lookup now reaches 966 entries (94.3% usage
vs 89.8% before) with only 1.44 retries per lookup.
This should be safe to backport, but depends on this previous commit:
MINOR: tools: extend the pointer hashing code to ease manipulations
The purpose here is to combine two pointers and a long argument instead
of having the caller perform the mixing. Also it's cleaner and more
efficient this was as the arg is mixed after the multiplications, and
modern processors are efficient at multiplying then adding.
We'll need to further extend the pointer hashing code to pass extra
parameters and to retrieve the dropped bits, so let's first split the
part that hashes the pointer from the part that reduces the hash to
the desired size.
Historically, the data manipulated by "show profiling" were copied
onto the stack for sorting and aggregating, but not only this limits
the number of entries we can keep, but it also has an impact on CPU
usage (having to redo the whole copy+sort upon each resume) and the
output accuracy (if sorting changes lines, resume may happen from an
incorrect one).
Instead, let's dynamically allocate the work buffer and place it into
the service context. We only allocate it immediately before needing it
and release it immediately afterwards so that it doesn't stay long. It
also requires a release handler to release those allocates by interrupted
dumps, but that's all. The overall result is now much cleaner, more
accurate, faster and safer.
This patch may be backported to older LTS releases.
In check_config_validity() and proxy_finalize() we check the consistency
of all rule sets, but the quic_initial rules were not placed there. This
currently has little to no impact, however we're going to use that to
also finalize certain debugging info so better call the function. This
can be backported to 3.1 (proxy_finalize is 3.4-only).
In 3.1, per-DSO statistics were added to the memprofile output by
commit 401fb0e87a ("MINOR: activity/memprofile: show per-DSO stats").
However an strdup() is performed there on the .info field, that is
never freed when leaving the function. Let's do it each time we leave
it. Ironically, this was found thanks to "show profiling" showing
itself as an unbalanced caller of strdup().
This needs to be backported to 3.0 since that commit was backported
there.
In 3.3, the "make range" target adopted a test command via the TEST_CMD
variable, with commit 90b70b61b1 ("BUILD: makefile: implement support
for running a command in range"). However now it breaks the script when
TEST_CMD is not set due to the shell expansion leaving two '||' operators
side by side. Let's fix this by passing the contents of the makefile
variable in positional arguments before executing them.
To read early data with AWS-LC (and BoringSSL), we have to use
SSL_read(). But SSL_read() will also try to do the handshake if it
hasn't been done yet, and at some point will do the handshake and will
return data that are actually not early data. So use SSL_in_early_data()
to make sure that the data we received are actually early data, and only
if so add the CO_FL_EARLY_DATA flag. Otherwise any data first received will be
considered early, and a Early-data header will be added.
As this bug was introduced by 76ba026548,
it should be backported with it.
Upon _send_status, always stop the listener from which the request
was received, rather than looking it up from the proc_list entry via
fdtab[proc->ipc_fd[0]].owner.
A BUG_ON is added to verify that the listener which received the
request is the one expected for the reported PID.
This means it is no longer possible to send "_send_status READY XXX"
manually through the master CLI for testing, as that would trigger
the BUG_ON.
Must be backported as far as 3.1.
The API for early data is a bit different with BoringSSL and AWS-LC than
it is for OpenSSL. As it was implemented, early data would be accepted,
but would not be processed until the handshake is done. Change that by
doing something similar to what OpenSSL does, and, if 0RTT has been
enabled on the listener, use SSL_read() to try to get early data before
starting the handshake, and if there's any, provide them to the mux the
same way it is done for OpenSSL.
That replaces a bunch of #ifdef SSL_READ_EARLY_DATA_SUCCESS by
something specific to OpenSSL has to be done.
This should be backported to 3.3.
The name of "Global section" was changed only in the summary, not in the
text itself. The names of some related refs were also updated.
Should be backported as far as 3.2.
EVP_MD_CTX is allocated using EVP_MD_CTX_new() but was never freed.
ctx should be initialized to NULL otherwise EVP_MD_CTX_free(ctx) could
segfault.
Must be backported as far as 3.2.
With a config:
backend bk_app
http-check expect status 200 string "status: ok"
This now correctly emits the error:
config : parsing [./patch.cfg:2] : 'http-check expect' : only one pattern expected.
This line containing the typo is unchanged since at least HAProxy 2.2, the
patch should be backported into all supported branches.
Starting with OpenSSL 4.0, X509_get_subject_name(), X509_get_issuer_name(),
and X509_CRL_get_issuer() return a const-qualified X509_NAME pointer.
Similarly, X509_NAME_get_entry() returns a const X509_NAME_ENTRY *, and
X509_NAME_ENTRY_get_data() returns a const ASN1_STRING *.
Introduce the __X509_NAME_CONST__ macro (defined to 'const' for OpenSSL
>= 4.0.0, empty for WolfSSL and older OpenSSL version which lacks const
on these APIs) and use it to qualify X509_NAME * variables and the
parameters of the three DN helper functions ssl_sock_get_dn_entry(),
ssl_sock_get_dn_formatted(), and ssl_sock_get_dn_oneline(). This avoids
both const-qualifier warnings on OpenSSL 4.0 and discarded-qualifier
warnings on WolfSSL, without needing explicit casts at call sites.
In ssl_sock.c (ssl_get_client_ca_file) and ssl_gencert.c
(ssl_sock_do_create_cert), a __X509_NAME_CONST__ X509_NAME * variable was
being reused to store the result of X509_NAME_dup() and then passed to
mutating functions (X509_NAME_add_entry_by_txt, X509_NAME_free). Introduce
separate X509_NAME * variables (xn_dup, subject) to hold the mutable
duplicate.
Original patch from Alexandr Nedvedicky <sashan@openssl.org>:
https://www.mail-archive.com/haproxy@formilux.org/msg46696.html
In OpenSSL 4.0, the ASN1_STRING struct was made opaque and direct access
to its members (->data, ->length, ->type) no longer compiles. Replace
these accesses in ssl_sock_get_serial(), ssl_sock_get_time(), and
asn1_generalizedtime_to_epoch() with the proper accessor functions
ASN1_STRING_get0_data(), ASN1_STRING_length(), and ASN1_STRING_type().
The old direct access is preserved under USE_OPENSSL_WOLFSSL since
WolfSSL does not provide these accessor functions.
Original patch from Alexandr Nedvedicky <sashan@openssl.org>:
https://www.mail-archive.com/haproxy@formilux.org/msg46696.html
When a master process is reloading, the HAPROXY_PROCESSES variable is
deserialized. In older version of the master-worker (< 1.9), no master
element was existing in this variable.
This is not suppose to happen anymore, and could have provoked problems
in the master anyway.
This patch changes the behavior by exiting the master with an alert if
mp master element was found in this variable.
When no endpoint is attached to a SC, it is unexpected to have I/O (receive
or send). But we honestly don't know if it happens or not. So a CHECK_IF()
is added to be able to track such calls.
Calls to se_shutdown were no the same between applets and mux endpoints.
Only the SHUTW flag was not the same. However, on the multiplexers are
sensitive to the true SHUTW flag. The applets handle all of them the same
way. So calls to se_shutdown() from sc_abort() and sc_shutdown() can be
merged to always use the multiplexer version.
wake_srv_chk() function is now only used by srv_chk_io_cb(), the
health-checl I/O callback function. So let's remove it. The code of the
function was moved in srv_chk_io_cb().
At we fail to create a mux, in conn_create_mux(), instead of calling the
app_ops .wake() callback function, we can directly call sc_conn_process().
At this stage, we know we are using an connection, so it is safe to do so.
At the end of task_run_applet() and task_process_applet(), instead of
calling the app_ops .wake() callback function, we can directly call
sc_applet_process(). At this stage, we know we are using an applet, so it is
safe to do so.
Thanks to previous commits, it is now possible to wake the data layer up,
via a tasklet_wakeup, instead of using the app_ops .wake() callback
function.
When a data layer must be notified of a mux event (an error for instance),
we now always perform a tasklet_wakeup(). TASK_WOKEN_MSG state is used by
default. TASK_WOKEN_IO is eventually added if the data layer was subscribed
to receives or sends.
Changes are not trivial at all. We replaced a synchronous call to the
sc_conn_process() function by a tasklet_wakeup().
Now, when a mux stream is waking its data layer up for receives or sends, it
uses the TASK_WOKEN_IO state. The state is not used by the stconn I/O
callback function for now.
In spop_resume_each_sending_spop_strm(), there was exactly the same code
than spop_strm_notify_send(). So let's use spop_strm_notify_send() instead
of duplicating code.
It is the first commit of a series to refactor the SC app_ops. The first
step is to remove the .wake() callback function from the app_ops to replace
all uses by a wakeup of the SC tasklet.
Here, when the SC is woken up, the state is now tested and if TASK_WOKEN_MSG
is set, sc_conn_process() is called.
When ECDH-ES algorithm is used in a JWE token, no cek is provided and
one must be built in order to decrypt the contents of the token. The
decrypting key is built by deriving a temporary key out of a public key
provided in the token and the private key provided by the user and
performing a concatKDF operation.
When the encoding is of the ECDH family, the optional "apu" and "apv"
fields of the JOSE header must be parsed, as well as the mandatory "epk"
field that contains an EC public key used to derive a key that allows
either to decrypt the contents of the token (in case of ECDH-ES) or to
decrypt the content encoding key (cek) when using ECDH-ES+AES Key Wrap.
Convert a JWK with the "EC" key type ("kty") into an EVP_PKEY. The JWK
can either represent a public key if it only contains the "x" and "y"
fields, or a private key if it also contains the "d" field.
The 'jwt_tokenize' function that can be used to split a JWT token into
its subparts can either fully process the token (from beginning to end)
when we need to check its signature, or only partially when using the
jwt_header_query or jwt_member_query converters. In this case we relied
on the fact that the return value of the 'jwt_tokenize' function was not
checked because a '-1' was returned (which was not actually an error).
In order to make this logic more explicit, the 'jwt_tokenize' function
now has a way to warn the caller that the token was invalid (less
subparts than the specified 'item_num') or that the token was not
processed in full (enough subparts found without parsing the token all
the way).
The function will now only return 0 if we found strictly the same number
of subparts as 'item_num'.
The master process in the proc_list mustn't set the PROC_O_LEAVING flag
since the reload doesn't mean the master will leave.
Could be backported as far as 3.1.
mproxy_li is supposed to be used in _send_status to stop the sockpair FD
between the master and the new worker, being a listener.
This can only work if the listener has been stored in the fdtab owner,
and there's no reason it shouldn't be here.
It's always a bit tricky to avoid already backported patches when they
just got a different ID (e.g. a critical fix in a topic branch). Most
often with stable topic branches we just want to pick all stable commits
since the last backported one. New option -L instead of -m does exactly
this: it enumerates only commits that were added to the reference branch
after its most recent backport.
The -vv option used --verbose as its long form, which was identical to
the long form of -v. Since the case statement matches top-to-bottom,
--verbose would always trigger -v (VERBOSE=2), making -vv unreachable
via its long option. The long form is renamed to --verbose=all to avoid
the conflict, and the usage string is updated accordingly.
Must be backported to 3.3.
The script relied on a bash-specific process substitution (< <(...)) to
feed socat's output into the read loop. This is replaced with a standard
POSIX pipe into a command group.
The response parsing is also simplified: instead of iterating over each
line with a while loop and echoing them individually, the status line is
read first, the "--" separator consumed, and the remaining output is
streamed to stderr or discarded as a whole depending on the verbosity
level.
Could be backported to 3.3 as it makes it more portable, but introduce a
slight change in the error format.
socat was used with the ${MASTER_SOCKET} variable directly, letting it
auto-detect the network protocol. However, when given a plain filename
that does not point to a UNIX socket, socat would create a file at that
path instead of reporting an error.
To fix this, the address type is now determined explicitly: if
MASTER_SOCKET points to an existing UNIX socket file (checked with -S),
UNIX-CONNECT: is used; if it matches a <host>:<port> pattern, TCP: is
used; otherwise an error is reported. The socat_addr variable is also
properly scoped as local to the reload() function.
Could be backported in 3.3.
no need to have duplicated comp_ctx and comp_algo for request vs response
in comp_state struct, because thanks to previous commit compression filter
is either oriented on the request or the response, and 2 distinct filters
are instanciated when we need to handle both requests and responses
compression.
Thus we can save us from duplicated struct members and related operations.
Existing "compression" filter is a multi-purpose filter that will try
to compress both requests and responses according to "compression"
settings, such as "compression direction".
One of the pre-requisite work identified to implement decompression
filter is that we needed a way to manually define the sequence of
enabled filters to chain them in the proper order to make
compression and decompression chains work as expected in regard
to the intended use-case.
Due to the current nature of the "compression" filter this was not
possible, because the filter has a combined action as it will try
to compress both requests and responses, and as we are about to
implement "filter-sequence" directive, we will not be able to
change the order of execution of the compression filter between
requests and responses.
A possible solution we identified to solve this issue is to split the
existing "compression" filter into 2 distinct filters, one which is
request-oriented, "comp-req", and another one which is response-oriented
"comp-res". This is what we are doing in this commit. Compression logic
in itself is unchanged, "comp-req" will only aim to compress the request
while "comp-res" will try to compress the response. Both filters will
still be invoked on request and responses hooks, but they only do their
part of the job.
From now on, to compress both requests and responses, both filters have
to be enabled on the proxy. To preserve original behavior, the "compression"
filter is still supported, what it does is that it instantiates both
"comp-req" and "comp-res" filters implicitly, as the compression filter is
now effectively split into 2 separate filters under the hood.
When using "comp-res" and "comp-req" filters explicitly, the use of the
"compression direction" setting is not relevant anymore. Indeed, the
compression direction is assumed as soon as one or both filters are
enabled. Thus "compression direction" is kept as a legacy option in
order to configure the "compression" generic filter.
Documentation was updated.
proxy_get_comp() function can be used to retrieve proxy->comp options or
allocate and initialize it if missing
For now, it is solely used by parse_compression_options(), but the goal is
to be able to use this helper from multiple origins.
There was a "jwt_tokenize" call whose return value was not checked.
This was found by coverity and raised in GitHub #3277.
This patch can be backported to all stable branches.
In connect_server(), it is possible to have no server defined (dispatch mode
or transparent backend). In that case, we must be carefull to check the srv
variable in all calls involving the server. It was not perform at one place,
when the protocol to use for websocket is retrieved. This must not be done
when there is no server.
This patch should fix the first report in #3144. It must be backported to
all stable version.
In sample_conv_sha2(), calls to EVP_Digest* can fail. So we must check
return value of each call and report a error on failure and release the
digest context.
This patch should fix the issue #3274. It should be backported as far as
2.6.
Released version 3.4-dev6 with the following main changes :
- CLEANUP: acme: remove duplicate includes
- BUG/MINOR: proxy: detect strdup error on server auto SNI
- BUG/MINOR: server: set auto SNI for dynamic servers
- BUG/MINOR: server: enable no-check-sni-auto for dynamic servers
- MINOR: haterm: provide -b and -c options (RSA key size, ECDSA curves)
- MINOR: haterm: add long options for QUIC and TCP "bind" settings
- BUG/MINOR: haterm: missing allocation check in copy_argv()
- BUG/MINOR: quic: fix counters used on BE side
- MINOR: quic: add BUG_ON() on half_open_conn counter access from BE
- BUG/MINOR: quic/h3: display QUIC/H3 backend module on HTML stats
- BUG/MINOR: acme: acme_ctx_destroy() leaks auth->dns
- BUG/MINOR: acme: wrong labels logic always memprintf errmsg
- MINOR: ssl: clarify error reporting for unsupported keywords
- BUG/MINOR: acme: fix incorrect number of arguments allowed in config
- CLEANUP: haterm: remove unreachable labels hstream_add_data()
- CLEANUP: haterm: avoid static analyzer warnings about rand() use
- CLEANUP: ssl: Remove a useless variable from ssl_gen_x509()
- CI: use the latest docker for QUIC Interop
- CI: remove redundant "halog" compilation
- CLENAUP: cfgparse: accept-invalid-http-* does not support "no"/"defaults"
- BUG/MEDIUM: spoe: Acquire context buffer in applet before consuming a frame
- MINOR: traces: always mark trace_source as thread-aligned
- MINOR: ncbmbuf: improve itbmap_next() code
- MINOR: proxy: improve code when checking server name conflicts
- MINOR: quic: add a new metric for ncbuf failures
- BUG/MINOR: haterm: cannot reset default "haterm" mode
- BUG/MEDIUM: cpu-topo: Distribute CPUs fairly across groups
- BUG/MINOR: quic: missing app ops init during backend 0-RTT sessions
- CLEANUP: ssl: remove outdated comments
- MINOR: mux-h2: also count glitches on invalid trailers
- MINOR: mux-h2: add a new setting, "tune.h2.log-errors" to tweak error logging
- BUG/MEDIUM: mux-h2: make sure to always report pending errors to the stream
- BUG/MINOR: server: adjust initialization order for dynamic servers
- CLEANUP: tree-wide: drop a few useless null-checks before free()
- CLEANUP: quic-stats: include counters from quic_stats
- REORG: stats/counters: move extra_counters to counters not stats
- CLEANUP: stats: drop stats.h / stats-t.h where not needed
- MEDIUM: counters: change the fill_stats() API to pass the module and extra_counters
- CLEANUP: counters: only retrieve zeroes for unallocated extra_counters
- MEDIUM: counters: add a dedicated storage for extra_counters in various structs
- MINOR: counters: store a tgroup step for extra_counters to access multiple tgroups
- MEDIUM: counters: store the number of thread groups accessing extra_counters
- MINOR: counters: add EXTRA_COUNTERS_BASE() to retrieve extra_counters base storage
- MEDIUM: counters: return aggregate extra counters in ->fill_stats()
- MEDIUM: counters: make EXTRA_COUNTERS_GET() consider tgid
- BUG/MINOR: call EXTRA_COUNTERS_FREE() before srv_free_params() in srv_drop()
- MINOR: promex: test applet resume in stress mode
- BUG/MINOR: promex: fix server iteration when last server is deleted
- BUG/MINOR: proxy: add dynamic backend into ID tree
- MINOR: proxy: convert proxy flags to uint
- MINOR: server: refactor srv_detach()
- MINOR: proxy: define a basic "del backend" CLI
- MINOR: proxy: define proxy watcher member
- MINOR: stats: protect proxy iteration via watcher
- MINOR: promex: use watcher to iterate over backend instances
- MINOR: lua: use watcher for proxies iterator
- MINOR: proxy: add refcount to proxies
- MINOR: proxy: rename default refcount to avoid confusion
- MINOR: server: take proxy refcount when deleting a server
- MINOR: lua: handle proxy refcount
- MINOR: proxy: prevent backend removal when unsupported
- MINOR: proxy: prevent deletion of backend referenced by config elements
- MINOR: proxy: prevent backend deletion if server still exists in it
- MINOR: server: mark backend removal as forbidden if QUIC was used
- MINOR: cli: implement wait on be-removable
- MINOR: proxy: add comment for defaults_px_ref/unref_all()
- MEDIUM: proxy: add lock for global accesses during proxy free
- MEDIUM: proxy: add lock for global accesses during default free
- MINOR: proxy: use atomic ops for default proxy refcount
- MEDIUM: proxy: implement backend deletion
- REGTESTS: add a test on "del backend"
- REGTESTS: complete "del backend" with unnamed defaults ref free
- BUG/MINOR: hlua: fix return with push nil on proxy check
- BUG/MEDIUM: stream: Handle TASK_WOKEN_RES as a stream event
- MINOR: quic: use signed char type for ALPN manipulation
- MINOR: quic/h3: reorganize stream reject after MUX closure
- MINOR: mux-quic: add function for ALPN to app-ops conversion
- MEDIUM: quic/mux-quic: adjust app-ops install
- MINOR: quic: use server cache for ALPN on BE side
- BUG/MEDIUM: hpack: correctly deal with too large decoded numbers
- BUG/MAJOR: qpack: unchecked length passed to huffman decoder
- BUG/MINOR: qpack: fix 1-byte OOB read in qpack_decode_fs_pfx()
- BUG/MINOR: quic: fix OOB read in preferred_address transport parameter
- BUG/MEDIUM: qpack: correctly deal with too large decoded numbers
- BUG/MINOR: hlua: Properly enable/disable line receives from HTTP applet
- BUG/MEDIUM: hlua: Fix end of request detection when retrieving payload
- BUG/MINOR: hlua: Properly enable/disable receives for TCP applets
- MINOR: htx: Add a function to retrieve the HTTP version from a start-line
- MINOR: h1-htx: Reports non-HTTP version via dedicated flags
- BUG/MINOR: h1-htx: Be sure that H1 response version starts by "HTTP/"
- MINOR: http-ana: Save the message version in the http_msg structure
- MEDIUM: http-fetch: Rework how HTTP message version is retrieved
- MEDIUM: http-ana: Use the version of the opposite side for internal messages
- DEBUG: stream: Display the currently running rule in stream dump
- MINOR: filters: Use filter API as far as poissible to break loops on filters
- MINOR: filters: Set last_entity when a filter fails on stream_start callback
- MINOR: stream: Display the currently running filter per channel in stream dump
- DOC: config: Use the right alias for %B
- BUG/MINOR: channel: Increase the stconn bytes_in value in channel_add_input()
- BUG/MINOR: sample: Fix sample to retrieve the number of bytes received and sent
- BUG/MINOR: http-ana: Increment scf bytes_out value if an haproxy error is sent
- BUG/MAJOR: fcgi: Fix param decoding by properly checking its size
- BUG/MAJOR: resolvers: Properly lowered the names found in DNS response
- BUG/MEDIUM: mux-fcgi: Use a safe loop to resume each stream eligible for sending
- MINOR: mux-fcgi: Use a dedicated function to resume streams eligible for sending
- CLEANUP: qpack: simplify length checks in qpack_decode_fs()
- MINOR: counters: Introduce COUNTERS_UPDATE_MAX()
- MINOR: listeners: Update the frequency counters separately when needed
- MINOR: proxies: Update beconn separately
- MINOR: stats: Add an option to disable the calculation of max counters
Add a new option, "stats calculate-max-counters [on|off]".
It makes it possible to disable the calculation of max counters, as they
can have a performance cost.
Update beconn separately from the call to COUNTERS_UPDATE_MAX(), as soon
there will be an option to get COUNTERS_UPDATE_MAX() to do nothing, and
we still want beconn to be properly updated, as it is used for other
purposes.
Update the frequency counters that are exported to the stats page
outside of the call to COUNTERS_UPDATE_MAX(), so that they will
happen even if COUNTERS_UPDATE_MAX() ends up doing nothing.
Introduce COUNTERS_UPDATE_MAX(), and use it instead of using
HA_ATOMIC_UPDATE_MAX() directly.
For now it just calls HA_ATOMIC_UPDATE_MAX(), but will later be modified
so that we can disable max calculation.
This can be backported up to 2.8 if the usage of COUNTERS_UPDATE_MAX()
generates too many conflicts.
This patch simplifies the decoding loop by merging the variable-length
integer truncation check (len == -1) with the subsequent buffer
availability check (len < length).
This removes redundant code blocks and improves readability without
changing the decoding logic.
Note that the second removal is correct, as the check was duplicate and
unnecessary."
At the end of fcgi_send(), if the connection is not full anymore, we loop on
the send list to resume FCGI stream for sending. But a streams may be
removed from the this list during the loop. So a safe loop must be used.
This patch should be backported to all stable versions.
Names found in DNS responses are lowered to be compared. A name is composed
of several labels, strings precedeed by their length on one byte. For
instance:
3www7haproxy3org
There is an bug when labels are lowered. The label length is not skipped and
tolower() function is called on it. So for label length in the range [65-90]
(uppercase char), 32 is added to the label length due to the conversion of a
uppercase char to lowercase. This bugs can lead to OOB read later in the
resolvers code.
The fix is quite obvious, the label length must be skipped when the label is
lowered.
Thank you to Kamil Frankowicz for having reported this.
This patch must be backported to all stable versions.
In functions used to decode a FCGI parameter, the test on the data length
before reading the parameter's name and value did not consider the offset
value used to skip already parsed data. So it was possible to read more data
than available (OOB read). To do so, a malicious FCGI server must send a
forged GET_VALUES_RESULT record containing a parameter with wrong name/value
length.
Thank you to Kamil Frankowicz for having reported this.
This patch must be backported to all stable versions.
There was an issue in the if/else statement in smp_fetch_bytes() function.
When req.bytes_in or req.bytes_out was requested, res.bytes_in was always
returned. It is now fixed.
This patch must be backported to 3.3.
This function is no longer used. So it is not really an bug. But it is still
available and could be used by legacy applets. In that case, we must take
care to increment the stconn bytes_in value accordingly when input data are
inserted.
This patch must be backported to 3.3.
In custom log format part, %[req.bytes_in] was erroneously documented as the
alias of %B. The good alias is %[res.bytes_in]. It is now fixed.
This patch must be backported to 3.3.
Since the 3.1, when stream's info are dump, it is possible to print the
yielding filter on each channel, if any. It was useful to detect buggy
filter on spinning loop. But it is not possible to detect a filter consuming
too much CPU per-execution. We can see a filter was executing in the
backtrace reported by the watchdog, but we are unable to spot the specific
one.
Thanks to this patch, it is now possible. When a dump is emitted, the
running or yield filters on each channel are now displayed with their
current state (RUNNING or YIELDING).
This patch could be backported as far as 3.2 because it could be useful to
spot issues. But the filter API was slightly refactored in 3.4, so this
patch should be adapted.
On the stream, the last_entity should reference the last rule or the last
filter evaluated during the stream processing. However, this info was not
saved when a filter failed on strem_start callback function. It is now
fixed.
This patch could be backported as far as 3.1.
When the filters API was refactored to improve loops on filters, some places
were not updated (or not fully updated). Some loops were not relying on
resume_filter_list_break() while it was possible. So let's do so with this
patch.
Since the 2.5, when stream's info are dump, it is possible to print the
yielding rule, if any. It was useful to detect buggy rules on spinning
loop. But it is not possible to detect a rule consuming too much CPU
per-execution. We can see a rule was executing in the backtrace reported by
the watchdog, but we are unable to spot the specific rule.
Thanks to this patch, it is now possible. When a dump is emitted, the
running or yield rule is now displayed with its current state (RUNNING or
YIELDING).
This patch could be backported as far as 3.2 because it could be useful to
spot issues.
When the response is send by HAProxy, from a applet or for the analyzers,
The request version is used for the response. The main reason is that there
is not real version for the response when this happens. "HTTP/1.1" is used,
but it is in fact just an HTX response. So the version of the request is
used.
In the same manner, when the request is sent from an applet (httpclient),
the response version is used, once available.
The purpose of this change is to return the most accurate version from the
user point of view.
Thanks to previous patches, we can now rely on the version stored in the
http_msg structure to get the request or the response version.
"req.ver" and "res.ver" sample fetch functions returns the string
representation of the version, without the prefix, so "<major>.<minor>", but
only if the version is valid. For the response, "res.ver" may be added from
a health-check context, in that case, the HTX message is used.
"capture.req.ver" and "capture.res.ver" does the same but the "HTTP/" prefix
is added to the result. And "capture.res.ver" cannot be called from a
health-check.
To ease the version formatting and avoid code duplication, an helper
function was added. So these samples are now relying on "get_msg_version()".
When the request or the response is received, the numerical value of the
message version is now saved. To do so, the field "vsn" was added in the
http_msg structure. It is an unsigned char. The 4 MSB bits are used for the
major digit and the 4 LSB bits for the minor one.
Of couse, the version must be valid. the HTX_SL_F_NOT_HTTP flag of the
start-line is used to be sure the version is valid. But because this flag is
quite new, we also take care the string representation of the version is 8
bytes length. 0 means the version is not valid.
When the response is parsed, we test the version to be sure it is
valid. However, the protocol was not tested. Now we take care that the
response version starts by "HTTP/", otherwise an error is returned.
Of course, it is still possible to by-pass this test with
"accept-unsafe-violations-in-http-response" option.
This patch could be backported to all stable versions.
Now, when the HTTP version format is not strictly valid, flags are set on
the h1 parser and the HTX start-line. H1_MF_NOT_HTTP is set on the H1 parser
and HTX_SL_F_NOT_HTTP is set on the HTX start-line. These flags were
introduced to avoid parsing again and again the version to know if it is a
valid version or not, escpecially because it is most of time valid.
htx_sl_vsn() function can now be used to retrieve the ist string
representing the HTTP version from a start-line passed as parameter. This
function takes care to return the right part of the start-line, depending on
its type (request or response).
From a lua TCP applet, in functions used to retrieve data (receive,
try_receive and getline), we must take care to disable receives when data
are returned (or on failure) and to restart receives when these functions
are called again. In addition, when an applet execution is finished, we must
restart receives to properly drain the request.
This patch should be backported to 3.3. On older version, no bug was
reported so we can wait a report first.
When the lua HTTP applet was refactored to use its own buffers, a bug was
introduced in receive() and getline() function. We rely on HTX_FL_EOM flag
to detect the end of the request. But there is nothing preventing extra
calls to these function, when the whole request was consumed. If this
happens, the call will yield waiting for more data with no way to stop it.
To fix the issue, APPLET_REQ_RECV flag was added to know the whole request
was received.
This patch should fix#3293. It must be backported to 3.3.
From a lua HTTP applet, in the getline() function, we must take care to
disable receives when a line is retrieved and to restart receives when the
function is called again. In addition, when an applet execution is finished,
we must restart receives to properly drain the request.
This patch could help to fix#3293. It must be backported to 3.3. On older
version, no bug was reported so we can wait a report first. But in that
case, hlua_applet_http_recv() should also be fixed (on 3.3 it was fixed
during the applets refactoring).
Same fix as this one for hpack:
7315428615 ("BUG/MEDIUM: hpack: correctly deal with too large decoded numbers")
Indeed, the encoding of integers for QPACK is the same as for HPACK but for 64 bits
integers.
Must be backported as far as 2.6.
This bug impacts only the QUIC backend. A QUIC server does receive
a server preferred address transport parameter.
In quic_transport_param_dec_pref_addr(), the boundary check for the
connection ID was inverted and incorrect. This could lead to an
out-of-bounds read during the following memcpy.
This patch fixes the comparison to ensure the buffer has enough input data
for both the CID and the mandatory Stateless Reset Token.
Thank you to Kamil Frankowicz for having reported this.
Must be backported to 3.3.
In qpack_decode_fs_pfx(), if the first qpack_get_varint() call
consumes the entire buffer, the code would perform a 1-byte
out-of-bounds read when accessing the sign bit via **raw.
This patch adds an explicit length check at the beginning of
qpack_get_varint(), which systematically secures all other callers
against empty inputs. It also adds a necessary check before the
second varint call in qpack_decode_fs_pfx() to ensure data is still
available before dereferencing the pointer to extract the sign bit,
returning QPACK_RET_TRUNCATED if the buffer is exhausted.
Thank you to Kamil Frankowicz for having reported this.
Must be backported as far as 2.6.
A call to huffman decoder function (huff_dec()) is made from qpack_decode_fs()
without checking the buffer length passed to this function, leading to OOB read
which can crash the process.
Thank you to Kamil Frankowicz for having reported this.
Must be backport as far as 2.6.
The varint hpack decoder supports unbounded numbers but returns 32-bit
results. This means that possible truncation my happen on some field
lengths or indexes that would be emitted as quantities that do not fit
in a 32-bit number. The final value will also depend on how the left
shift operation behaves on the target architecture (e.g. whether bits
are lost or used modulo 31). This could lead to a desynchronization of
the HPACK stream decoding compared to what an external observer would
see (e.g. from a network traffic capture). However, there isn't any
impact between streams, HPACK is performed at the connection level,
not at the stream level, so no stream may try to leverage this
limitation to have any effect on another one.
For the fix, instead of adding checks everywhere in the loop and for
the final stage, let's rewrite the decoder to compare the read value
to a max value that is shifted by 7 bits for every 7 bits read. This
allows a sender to continue to emit zeroes for higher bits without
being blocked, while detecting that a received value would overflow.
The loop is now simpler as it deals both with values with the higher
bit set and the final ones, and stops once the final value was recorded.
A test on non-zero before performing the shift was added to please
ubsan, though in practice zero shifted by any quantity remains zero.
But the test is cheap so that's OK.
Thanks to Guillaume Meunier, Head of Vulnerability Operations Center
France at Orange Cyberdefense, for reporting this bug.
This should be backported to all stable versions.
On the backend side, QUIC MUX may be started preemptively before the
ALPN negotiation. This is useful notably for 0-RTT implementation.
However, this was a source of crashes. ALPN was expected to be retrieved
from the server cache, however QUIC MUX still used the ALPN from the
transport layer. This could cause a crash, especially when several
connections runs in parallel as the server cache is shared among
threads.
Thanks to the previous patch which reworks QUIC MUX init, this solution
can now be fixed. Indeed, if conn_get_alpn() is not successful, MUX can
look at the server cache again to use the expected value.
Note that this could still prevent the MUX to work as expected if the
server cache is resetted between connect_server() and MUX init. Thus,
the ultimate solution would be to copy the cached ALPN into the
connection. This problem is not specific to QUIC though, and must be
fixed in a separate patch.
This patch reworks the installation of app-ops layer by QUIC MUX.
Previously, app_ops field was stored directly into the quic_conn
structure. Then the MUX reused it directly during its qmux_init().
This patch removes app_ops field from quic_conn and replaces it with a
copy of the negotiated ALPN. By using quic_alpn_to_app_ops(), it ensures
it remains compatible with a known application layer.
On the MUX layer, qcc_install_app_ops() now uses the standard
conn_get_alpn() to retrieve the ALPN from the transport layer. This is
done via the newly defined <get_alpn> QUIC xprt callback.
This new architecture should be cleaner as it better highlights the
responsibility of each layers in the ALPN/app negotiation.
Extract the conversion from ALPN to qcc_app_ops type from quic_conn
source file into QUIC MUX. The newly created function is named
quic_alpn_to_app_ops(). This will serve as a central point to identify
which ALPNs are currently supported in our QUIC stack.
This patch is purely a small refactoring. It will be useful for the next
one which rework MUX app-ops layer init. The current cleanup allows
notably to remove H3/hq-interop headers from quic_conn source file.
The QUIC MUX layer is closed after its transport counterpart. This may
be necessary then to reject any new streams opened by the remote peer.
This operation is dependent however from the application protocol.
Previously, a function qc_h3_request_reject() was directly implemented
in quic_conn source file for use when HTTP/3 was previously negotiated.
However, this solution was not evolutive and broke layering.
This patch introduces a new proper separation with a <strm_reject>
callback defined in quic_conn structure. When set, it will be used to
preemptively close any new stream. QUIC MUX is responsible to set it
just before its closure.
No functional change. This patch is purely a refactoring with a better
architecture design. Especially, H3 specific code from transport layer
is now completely removed.
In most of haproxy code, ALPN is used as a signed char pointer. In QUIC
code instead, it is manipulated as unsigned.
Unifies this by using signed type in QUIC code. This allows to remove a
bunch of unnecessary casts.
The conversion of TASK_WOKEN_RES to a stream event was missing. Among other
things, this wakeup reason is used when a stream is dequeued. So it was
possible to skip the connection establishment if the stream was also woken
up for a timer reason. When this happened, the stream was blocked till the
queue timeout expiration.
Converting TASK_WOKEN_RES to STRM_EVT_RES fixes the issue.
This patch should fix the issue #3290. It must be backported as far as 3.2.
hlua_check_proxy() may now return NULL if the target proxy instance has
been flagged for deletion. Thus, proxies method have been adjusted and
may push nil to report such case.
This patch fixes these error paths. When nil is pushed, 1 must be
returned instead of 0. This represents the count of pushed values on the
stack which can be retrieved by the caller.
No need to backport.
Complete delete backend regtests by checking deletion of a proxy with a
reference on an unnamed defaults instance. This operation is sensible as
the defaults refcount is decremented, and when the last backend is
removed, the defaults is also freed.
Add a reg-tests to test "del backend" CLI command. First, checks are
performed to ensure a backend cannot be deleted if not in the expected
state.
Then, a "del backend" success is tested. Stats are dumped to ensure the
backend instance is indeed removed.
This patch finalizes "del backend" handler by implementing the proper
proxy deletion.
After ensuring backend deletion can be performed, several steps are
executed. First, any watcher elements are updated to point on the next
proxy instance. The backend is then removed from ID and name global
trees and is finally detached from proxies_list.
Once the backend instance is removed from proxies_list, the backend
cannot be found by new elements. Thread isolation is lifted and
proxy_drop() is called, which will purge the proxy if its refcount is
null. Thanks to recently introduced PROXIES_DEL_LOCK, proxy_drop() is
thread safe.
Default proxy refcount <def_ref> is used to comptabilize reference on a
default proxy instance by standard proxies. Currently, this is necessary
when a default proxy defines TCP/HTTP rules or a tcpcheck ruleset.
Transform every access on <def_ref> so that atomic operations are now
used. Currently, this is not strictly needed as default proxies
references are only manipulated at init or deinit in single thread mode.
However, when dynamic backends deletion will be implemented, <def_ref>
will be decremented at runtime also.
This patch is similar to the previous one, but this time it deals with
functions related to defaults proxies instances. Lock PROXIES_DEL_LOCK
is used to protect accesses on global collections.
This patch will be necessary to implement dynamic backend deletion, even
if defaults won't be use as direct target of a "del backend" CLI.
However, a backend may have a reference on a default instance. When the
backend is freed, this references is released, which can in turn cause
the freeing of the default proxy instance. All of this will occur at
runtime, outside of thread isolation.
Define a new lock with label PROXIES_DEL_LOCK. Its purpose is to protect
operations performed on global lists or trees while a proxy is freed.
Currently, this lock is unneeded as proxies are only freed on
single-thread init or deinit. However, with the incoming dynamic backend
deletion, this operation will be also performed at runtime, outside of
thread isolation.
Implement be-removable argument to CLI wait. This is implemented via
be_check_for_deletion() invokation, also used by "del backend" handler.
The objective is to test whether a backend instance can be removed. If
this is not the case, the command may returns immediately if the target
proxy is incompatible with dynamic removal or if a user action is
required. Else, the command will wait until the temporary restriction is
lifted.
Currenly, quic_conn on the backend side may access their parent proxy
instance during their lifetime. In particular, this is the case for
counters update, with <prx_counters> field directly referencing a proxy
memory zone.
As such, this prevents safe backend removal. One solution would be to
check if the upper connection instance is still alive, as a proxy cannot
be removed if connection are still active. However, this would
completely prevent proxy counters update via
quic_conn_prx_cntrs_update(), as this is performed on quic_conn release.
Another solution would be to use refcount, or a dedicated counter on the
which account for QUIC connections on a backend instance. However,
refcount is currently only used by short-term references, and it could
also have a negative impact on performance.
Thus, the simplest solution for now is to disable a backend removal if a
QUIC server is/was used in it. This is considered acceptable for now as
QUIC on the backend side is experimental.
Ensure a backend instance cannot be removed if there is still server in
it. This is checked via be_check_for_deletion() to ensure "del backend"
cannot be executed. The only solution is to use "del server" to remove
on the servers instances.
This check only covers servers not yet targetted via "del server". For
deleted servers not yet purged (due to their refcount), the proxy
refcount is incremented but this does not block "del backend"
invokation.
Define a new proxy flag PR_FL_NON_PURGEABLE. This is used to mark every
proxy instance explicitely referenced in the config. Such instances
cannot be deleted at runtime.
Static use_backend/default_backend rules are handled in
proxy_finalize(). Also, sample expression proxy references are protected
via smp_resolve_args().
Note that this last case also incidentally protects any proxies
referenced via a CLI "set var" expression. This should not be the case
as in this case variable value is instantly resolved so the proxy
reference is not needed anymore. This also affects dynamic servers.
Prevent removal of a backend which relies on features not compatible
with dynamic backends. This is the case if either dispatch or
transparent option is used, or if a stick-table is declared.
These limitations are similar to the "add backend" ones.
Implement proxy refcount for Lua proxy class. This is similar to the
server class.
In summary, proxy_take() is used to increment refcount when a Lua proxy
is instantiated. proxy_drop() is called via Lua garbage collector. To
ensure a deleted backend is released asap, hlua_check_proxy() now
returns NULL if PR_FL_DELETED is set.
This approach is directly dependable on Lua GC execution. As such, it
probably suffers from the same limitations as the ones already described
in the previous commit. With the current patch, "del backend" is not
directly impacted though. However, the final proxy deinit may happen
after a long period of time, which could cause memory pressure increase.
One final observations regarding deinit : it is necessary to delay a
BUG_ON() which checks that defaults proxies list is empty. Now this must
be executed after Lua deinit (called via post_deinit_list). This should
guarantee that all proxies and their defaults refcount are null.
When a server is deleted via "del server", increment refcount of its
parent backend. This is necessary as the server is not referenced
anymore in the backend, but can still access it via its own <proxy>
member. Thus, backend removal must not happen until the complete purge
of the server.
The proxy refcount is released in srv_drop() if the flag SRV_F_DELETED
is set, which indicates that "del server" was used. This operation is
performed after the complete release of the server instance to ensure no
access will be performed on the proxy via itself. The refcount must not
be decremented if a server is freed without "del server" invokation.
Another solution could be for servers to always increment the refcount.
However, for now in haproxy refcount usage is limited, so the current
approach is preferred. It should also ensure that if the refcount is
still incremented, it may indicate that some servers are not completely
purged themselves.
Note that this patch may cause issues if "del backend" are used in
parallel with LUA scripts referencing servers. Currently, any servers
referenced by LUA must be released by its garbage collector to ensure it
can be finally freed. However, it appeas that in some case the gc does
not run for several minutes. At least this has been observed with Lua
version 5.4.8. In the end, this will result in indefinitely blocking of
"del backend" commands.
Rename proxy conf <refcount> to <def_ref>. This field only serves for
defaults proxy instances. The objective is to avoid confusion with the
newly introduced <refcount> field used for dynamic backends.
As an optimization, it could be possible to remove <def_ref> and only
use <refcount> also for defaults proxies usage. However for now the
simplest solution is implemented.
This patch does not bring any functional change.
Implement refcount notion into proxy structure. The objective is to be
able to increment refcount on proxy to prevent its deletion temporarily.
This is similar to the server refcount : "del backend" is not blocked
and will remove the targetted instance from the global proxies_list.
However, the final free operation is delayed until the refcount is null.
As stated above, the API is similar to servers. Proxies are initialized
with a refcount of 1. Refcount can be incremented via proxy_take(). When
no longer useful, refcount is decremented via proxy_drop() which
replaces the older free_proxy(). Deinit is only performed once refcount
is null.
This commit also defines flag PR_FL_DELETED. It is set when a proxy
instance has been removed via a "del backend" CLI command. This should
serve as indication to modules which may still have a refcount on the
target proxy so that they can release it as soon as possible.
Note that this new refcount is completely ignored for a default proxy
instance. For them, proxy_take() is pure noop. Free is immediately
performed on first proxy_drop() invokation.
Ensures proxies iteration in promex applet is safe via a new watcher
member. The principle is similar to the one already used for servers
iteration.
Note that ctx.p[0] is not updated anymore at the end of a function, as
this is automatically done via the watcher itself.
Define a new <px_watch> watcher member in stats applet context. It is
used to register the applet on a proxy when iterating over the proxies
list. <obj1> is automatically updated via the watcher interaction.
Watcher is first initialized prior to stats_dump_proxies() invocation.
This guarantees that stats dump is safe even if applet yields and a
backend is removed in parallel.
Define a new member watcher_list in proxy. It will be used to register
modules which iterate over the proxies list. This will ensure that the
operation is safe even if a backend is removed in parallel.
Add "del backend" handler which is restricted to admin level. Along with
it, a new function be_check_for_deletion() is used to test if the
backend is removable.
Correct documentation for srv_detach() which previously stated that this
function could be called for a server even if not stored in its proxy
list. In fact there is a BUG_ON() which detects this case.
Proxy flags member were of type char. This will soon enough not be
sufficient as new flags will be defined. As such, convert flags member
to unsigned int type.
Add missing proxy_index_id() call in "add backend" handler. This step is
responsible to store the newly created proxy instance in the
used_proxy_id global tree.
No need to backport.
Servers iteration via promex is now resilient to server runtime deletion
thanks to the watcher mechanism. However, the watcher was not correctly
initialized which could cause duplicate metrics reporting.
This issue happens when promex dump yielded when manipulating the last
server of a proxy. If this server is removed in parallel, <sv> pointer
will be set to NULL when promex resumes. Instead of switching to another
proxy, the code would reuse the same one and iterate again on the same
server list.
To fix this issue, <sv> pointer must not be reinitialized just after a
resumption point. Instead, this is now performed before
promex_dump_srv_metrics(), or just after switching to another proxy
instance. Thus, on resumption, if promex_dump_srv_metrics() is started
with <sv> as NULL, it means that the server was deleted and the end of
the current proxy list is reached, hence iteration is restarted on the
next proxy instance.
Note that ctx.p[1] does not need to be manually updated at the end of
promex_dump_srv_metrics() as srv_watch already does that.
This patch must be backported up to 3.0.
Implement a stress mode with force yield for promex applet each time a
metric is displayed. This is implemented by returning 0 in
promex_dump_ts() each time the output buffer is not empty.
To test this, haproxy must be compiled with DEBUG_STRESS and the
following configuration must be used :
global
stress-level 1
As seen with the last changes to counters allocation, the move of the
counters storage to the thread group as operated in commit 04a9f86a85
("MEDIUM: counters: add a dedicated storage for extra_counters in various
structs") causes some random errors when using ASAN, because the extra
counters are freed in srv_drop() after calling srv_free_params(), which
is responsible for freeing the per-thread group storage.
For the proxies however it's OK because free calls are made before the
call to deinit_proxy() which frees the per_tgrp area.
No backport is needed, this is purely 3.4-dev.
Now we store and retrieve only counters for the current tgid when more
than one is supported. This allows to significantly reduce contention
on shared stats. The haterm utility saw its performance increase from
4.9 to 5.8M req/s in H1, and 6.0 to 7.6M for H2, both with 5 groups of
16 threads, showing that we don't necessarily need insane amounts of
groups.
Now thanks to new macro EXTRA_COUNTERS_AGGR() we can iterate over all
thread groups storages when returning the data for a given metric. This
remains convenient and mostly transparent. The caller continues to pass
the pointer to the metric in the first group, and offsets are calculated
for all other groups and data summed. For now all groups except the
first one contain only zeroes but reported values are nevertheless
correct.
The goal is to always retrieve the storage address of the first thread
group for the given module. This will be used to iterate over all thread
groups. For now it returns the same value as EXTRA_COUNTERS_GET().
In order to be able to properly allocate all storage and retrieve data
from there, we'll need to know how many thread groups are supposed to
access it. Let's store the number of thread groups at init time. If the
tgrp_step is zero, there's always only one tg though.
Now EXTRA_COUNTERS_ALLOC() takes this number of thread groups in argument
and stores it in the structure. It also allocates as many areas as needed,
incrementing the datap pointer by the step for each of them.
EXTRA_COUNTERS_FREE() uses this info to free all allocated areas.
EXTRA_COUNTERS_INIT() initializes all allocated areas, this is used
elsewhere to clear/preset counters, e.g. in proxy_stats_clear_counters().
It involves a memcpy() call for each array, which is normally preset to
something empty but might also be used to preset certain non-scalar
fields such as an instance name.
We'll need to permit any user to update its own tgroup's extra counters
instead of the global ones. For this we now store the per-tgroup step
between two consecutive data storages, for when they're stored in a
tgroup array. When shared (e.g. resolvers or listeners), we just store
zero to indicate that it doesn't scale with tgroups. For now only the
registration was handled, it's not used yet.
Servers, proxies, listeners and resolvers all use extra_counters. We'll
need to move the storage to per-tgroup for those where it matters. Now
we're relying on an external storage, and the data member of the struct
was replaced with a pointer to that pointer to data called datap. When
the counters are registered, these datap are set to point to relevant
locations. In the case of proxies and servers, it points to the first
tgrp's storage. For listeners and resolvers, it points to a local
storage. The rationale here is that listeners are limited to a single
group anyway, and that resolvers have a low enough load so that we do
not care about contention there.
Nothing should change for the user at this point.
Since version 2.4 with commit 7f8f6cb926 ("BUG/MEDIUM: stats: prevent
crash if counters not alloc with dummy one") we can afford to always
update extra_counters because we know they're always either allocated
or linked to a dedicated trash. However, the ->fill_stats() callbacks
continue to access such values, making it technically possible to
retrieve random counters from this trash, which is not really clean.
Let's implement an explicit test in the ->fill_stats() functions to
only return 0 for the metric when not allocated like this. It's much
cleaner because it guarantees that we're returning an empty counter
in this case rather than random values.
The situation currently happens for dummy servers like the ones used
in Lua proxies as well as those used by rings (e.g. used for logging
or traces). Normally, none of the objects retrieved via stats or
Prometheus is concerned by this unallocated extra_counters situation,
so this is more about a cleanup than a real fix.
We'll soon need to iterate over thread groups in the fill_stats() functions,
so let's first pass the extra_counters and stats_module pointers to the
fill_stats functions. They now call EXTRA_COUNTERS_GET() themselves with
these elements in order to retrieve the required pointer. Nothing else
changed, and it's getting even a bit more transparent for callers.
This doesn't change anything visible however.
A number of C files include stats.h or stats-t.h, many of which were
just to access the counters. Now those which really need counters rely
on counters.h or counters-t.h, which already reduces the amount of
preprocessed code to be built (~3000 lines or about 0.05%).
It was always difficult to find extra_counters when the rest of the
counters are now in counters-t.h. Let's move the types to counters-t.h
and the macros to counters.h. Stats include them since they're used
there. But some users could be cleaned from the stats definitions now.
There's something a bit awkward in the way stats counters are inherited
through the QUIC modules: quic_conn-t includes quic_stats-t.h, which
declares quic_stats_module as extern from a type that's not known from
this file. And anyway externs should not be exported from type defintions
since they're not part of the ABI itself.
This commit moves the declaration to quic_stats.h which now takes care
to include stats-t.h to get the definition of struct stats_module. The
few users who used to learn it through quic_conn-t.h now include it
explicitly. As a bonus this reduces the number of preprocessed lines
by 5000 (~0.1%).
By the way, it looks like struct stats_module could benefit from being
moved off stats-t.h since it's only used at places where the rest of
the stats is not needed. Maybe something to consider for a future
cleanup.
We only support platforms where free(NULL) is a NOP so that
null checks are useless before free(). Let's drop them to keep
the code clean. There were a few in cfgparse-global, flt_trace,
ssl_sock and stats.
It appears that in cli_parse_add_server(), we're calling srv_alloc_lb()
and stats_allocate_proxy_counters_internal() before srv_preinit() which
allocates the thread groups. LB algos can make use of the per_tgrp part
which is initialized by srv_preinit(). Fortunately for now no algo uses
both tgrp and ->server_init() so this explains why this remained
unnoticed to date. Also, extra counters will soon require per_tgrp to
already be initialized. So let's move these between srv_preinit() and
srv_postinit(). It's possible that other parts will have to be moved
in between.
This could be backported to recent versions for the sake of safety but
it looks like the current code cannot tell the difference.
Some stream parsing errors that do not affect the connection result in
the parsed block not being transferred from the rx buffer to the channel
and not being reported upstream in rcv_buf(), causing the stconn to time
out. Let's detect this condition, and propagate term flags anyway since
no more progress will be made otherwise.
This should be backported at least till 3.2, probably even 2.8.
The H2 mux currently logs whenever some decoding fails. Most of the errors
happen at the connection level, but some are even at the stream level,
meaning that multiple logs can be emitted for a given connection, which
can quickly use some resource for little value. This new setting allows
to tweak this and decide to only log errors that affect the connection,
or even none at all.
This should be backported at least as far as 3.2.
Two cases were not causing glitches to be incremented:
- invalid trailers
- trailers on closed streams
This patch addresses this. It could be backported, at least to 3.2.
ssl_sock_srv_try_reuse_sess() was modified by this commit to no longer
fail (it now returns void), but the related comments remained:
BUG/MINOR: quic: missing app ops init during backend 0-RTT sessions
This patch cleans them up.
The QUIC mux requires "application operations" (app ops), which are a list
of callbacks associated with the application level (i.e., h3, h0.9) and
derived from the ALPN. For 0-RTT, when the session cache cannot be reused
before activation, the current code fails to reach the initialization of
these app ops, causing the mux to crash during its initialization.
To fix this, this patch restores the behavior of
ssl_sock_srv_try_reuse_sess(), whose purpose was to reuse sessions stored
in the session cache regardless of whether 0-RTT was enabled, prior to
this commit:
MEDIUM: quic-be: modify ssl_sock_srv_try_reuse_sess() to reuse backend
sessions (0-RTT)
With this patch, this function now does only one thing: attempt to reuse a
session, and that's it!
This patch allows ignoring whether a session was successfully reused from
the cache or not. This directly fixes the issue where app ops
initialization was skipped upon a session cache reuse failure. From a
functional standpoint, starting a mux without reusing the session cache
has no negative impact; the mux will start, but with no early data to
send.
Finally, there is the case where the ALPN is reset when the backend is
stopped. It is critical to continue locking read access to the ALPN to
secure shared access, which this patch does. It is indeed possible for the
server to be stopped between the call to connect_server() and
quic_reuse_srv_params(). But this cannot prevent the mux to start
without app ops. This is why a 'TODO' section was added, as a reminder that a
race condition regarding the ALPN reset still needs to be fixed.
Must be backported to 3.3
Make sure CPUs are distributed fairly across groups, in case the number
of groups to generate is not a divider of the number of CPUs, otherwise
we may end up with a few groups that will have no CPU bound to them.
This was introduced in 3.4-dev2 with commit 56fd0c1a5c ("MEDIUM: cpu-topo:
Add an optional directive for per-group affinity"). No backport is
needed unless this commit is backported.
When "mode haterm" was set in a "defaults" section, it could not be
overridden in subsequent sections using the "mode" keyword. This is because
the proxy stream instantiation callback was not being reset to the
default stream_new() value.
This could break the stats URI with a configuration such as:
defaults
mode haterm
# ...
frontend stats
bind :8181
mode http
stats uri /
This patch ensures the ->stream_new_from_sc() proxy callback is reset
to stream_new() when the "mode" keyword is parsed for any mode other
than "haterm".
No need to backport.
During proxy_finalize(), a lookup is performed over the servers by name
tree to detect any collision. Only the first conflict for each server
instance is reported to avoid a combinatory explosion with too many
alerts shown.
Previously, this was written using a for loop without any iteration.
Replace this by a simple if statement as this is cleaner.
This should fix github issue #3276.
itbmap_next() advances an iterator over a ncbmbuf buffer storage. When
reaching the end of the buffer, <b> field is set to NULL, and the caller
is expected to stop working with the iterator.
Complete this part to ensure that itbmap type is fully initialized in
case null iterator value is returned. This is not strictly required
given the above description, but this is better to avoid any possible
future mistake.
This should fix coverity issue from github #3273.
This could be backported up to 2.8.
Some perf profiles occasionally show that reading the trace source's
state can take some time, which is not expected at all. It just happens
that the trace_source is not cache-aligned so depending on linkage, it
may share a cache line with a more active variable, thereby inducing a
slow down to all threads trying to read the variable.
Let's always mark it aligned to avoid this. For now the problem was not
observed again.
Changes brought to support large buffers revealed a bug in the SPOE applet
when a frame is copied in the SPOE context buffer. A b_xfer() was performed
without allocating the SPOE context buffer. It is not expected. As stated in
the function documentation, the caller is responsible for ensuring there is
enough space in the destination buffer. So first of all, it must ensure this
buffer was allocated.
With recent changes, we are able to hit a BUG_ON() because the swap is no
longer possible if source and destination buffers size are not the same.
This patch should fix the issue #3286. It could be backported as far as 3.1.
Some options do not support "no" nor "defaults" and they're placed after
the check for their absence. However, "accept-invalid-http-request" and
"accept-invalid-http-response" still used to check for the flags that
come with these prefixes, but Coverity noticed this was dead code in
github issue #3272. Let's just drop the test.
No backport needed as it's just dead code.
This function was recently created by moving code from acme_gen_tmp_x509()
(in acme.c) to ssl_gencrt.c (ssl_gen_x509()). The <ctmp> variable was
initialized and then freed without ever being used. This was already the
case in the original acme_gen_tmp_x509() function.
This patch removes these useless statements.
Reported in GH #3284
Avoid such a warnings from coverity:
CID 1645121: (#1 of 1): Calling risky function (DC.WEAK_CRYPTO)
dont_call: random should not be used for security-related applications,
because linear congruential algorithms are too easy to break.
Reported in GH #3283 and #3285
This patch changes the registration of the following keywords to be
unconditional:
- ssl-dh-param-file
- ssl-engine
- ssl-propquery, ssl-provider, ssl-provider-path
- ssl-default-bind-curves, ssl-default-server-curves
- ssl-default-bind-sigalgs, ssl-default-server-sigalgs
- ssl-default-bind-client-sigalgs, ssl-default-server-client-sigalgs
Instead of excluding them at compile time via #ifdef guards in the keyword
registration table, their parsing functions now check feature availability
at runtime and return a descriptive error when the feature is missing.
For features controlled by the SSL library (providers, curves, sigalgs,
DH), the error message includes the actual OpenSSL version string via
OpenSSL_version(OPENSSL_VERSION), so users can immediately identify which
library they are running rather than seeing cryptic internal macro names.
For ssl-dh-param-file, the message also includes "(no DH support)" as a
hint, since OPENSSL_NO_DH can be set either by an OpenSSL build or by
HAProxy itself in certain configurations.
For ssl-engine, which depends on a HAProxy build-time flag (USE_ENGINE),
the message retains the flag name as it is more actionable for the user.
This addresses issue https://github.com/haproxy/haproxy/issues/3246.
In acme_req_finalize(), acme_req_challenge(), acme_req_neworder(),
acme_req_account(), and acme_post_as_get(), the success path always
calls unconditionally memprintf(errmsg, ...).
This may result in a leak of errmsg.
Additionally, acme_res_chkorder(), acme_res_finalize(), acme_res_auth(),
and acme_res_neworder() had unused 'out:' labels that were removed.
Must be backported as far as 3.2.
365a696 ("MINOR: acme: emit a log for DNS-01 challenge response")
introduces the auth->dns member which is istdup(). But this member is
never free, instead auth->token was freed twice by mistake.
Must be backported to 3.2.
QUIC is now implemented on the backend side. Complete definitions for
QUIC/H3 stats module to add STATS_PX_CAP_BE capability.
This change is necessary to display QUIC/H3 counters on backend lines
for HTML stats page.
This should be backported up to 3.3.
half_open_conn is a proxy counter used to account for quic_conn in
half-open state : this represents a connection whose address is not yet
validated (handshake successful, or via token validation).
This counter only has sense for the frontend side. Currently, code is
safe as access is only performed if quic_conn is not yet flagged with
QUIC_FL_CONN_PEER_VALIDATED_ADDR, which is always set for backend
connections.
To better reflect this, add a BUG_ON() when half_open_conn is
incremented/decremented to ensure this never occurs for backend
connections.
quic_conn is initialized with a pointer to its proxy counters. These
counters are then updated during the connection lifetime.
Counters pointer was incorrect for backend quic_conn, as it always
referenced frontend counters. For pure backend, no stats would be
updated. For listen instances, this resulted in incorrect stats
reporting.
Fix this by correctly set proxy counters based on the connection side.
This must be backported up to 3.3.
This is a very minor bug with a very low probability of occurring.
However, it could be flagged by a static analyzer or result in a small
contribution, which is always time-consuming for very little gain.
Add the --quic-bind-opts and --tcp-bind-opts long options to append
settings to all QUIC and TCP bind lines. This requires modifying the argv
parser to first process these new options, ensuring they are available
during the second argv pass to be added to each relevant "bind" line.
Add -b and -c options to the haterm argv parser. Use -b to specify the RSA
private key size (in bits) and -c to define the ECDSA certificate curves.
These self-signed certificates are required for haterm SSL bindings.
Allows server keyword "no-check-sni-auto" for dynamic servers. This may
be necessary to users who do not want to benefit from auto SNI for
checks.
Keyword "check-sni-auto" is still deactivated for dynamic servers, for
the same reason as "sni-auto" (cf the previous patch for a complete
explanation).
This must be backported up to 3.3.
Auto SNI configuration is configured during check config validity.
However, nothing was implemented for dynamic servers.
Fix this by implementing auto SNI configuration during "add server" CLI
handler. Auto SNI configuration code is moved in a dedicated function
srv_configure_auto_sni() called both for static and dynamic servers.
Along with this, allows the keyword "no-sni-auto" on dynamic servers, so
that this process can be deactivated if wanted. Note that "sni-auto"
remains unavailable as it only makes sense with default-servers which
are never used for dynamic server creation.
This must be backported up to 3.3.
There was no check on the result of strdup() used to setup auto SNI on a
server instance during check config validity. In case of failure, the
error would be silently ignored as the following server_parse_exprs()
does nothing when <sni_expr> server field is NULL. Hence, no SNI would
be used on the server, without any error nor warning reported.
Fix this by adding a check on strdup() return value. On error, ERR_ABORT
is reported along with an alert, parsing should be interrupted as soon
as possible.
This must be backported up to 3.3. Note that the related code in this
case is present in cfgparse.c source file.
expect resp.http.x-decrypted == "Sed ut perspiciatis unde omnis iste natus error sit voluptatem doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. porro quisquam est, qui dolorem ipsum quia dolor sit amet, adipisci velit, sed quia non numquam eius modi tempora incidunt ut dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in voluptate velit esse quam nihil molestiae consequatur, vel illum qui eum fugiat quo voluptas nulla pariatur?"
{{"acme","renew",NULL},"acme renew <certfile> : renew a certificate using the ACME protocol",cli_acme_renew_parse,NULL,NULL,NULL,0},
{{"acme","status",NULL},"acme status : show status of certificates configured with ACME",cli_acme_ps,cli_acme_status_io_handler,NULL,NULL,0},
{{"acme","status",NULL},"acme status : show status of certificates configured with ACME",cli_acme_parse_status,cli_acme_status_io_handler,NULL,NULL,0},
{{"acme","challenge_ready",NULL},"acme challenge_ready <certfile> domain <domain> : notify HAProxy that the ACME challenge is ready",cli_acme_chall_ready_parse,NULL,NULL,NULL,0},
{{"show","activity",NULL},"show activity [-1|0|thread_num] : show per-thread activity stats (for support/developers)",cli_parse_show_activity,cli_io_handler_show_activity,NULL},
{{"show","profiling",NULL},"show profiling [<what>|<#lines>|<opts>]*: show profiling state (all,status,tasks,memory)",cli_parse_show_profiling,cli_io_handler_show_profiling,NULL},
{{"show","profiling",NULL},"show profiling [<what>|<#lines>|<opts>]*: show profiling state (all,status,tasks,memory)",cli_parse_show_profiling,cli_io_handler_show_profiling,cli_release_show_profiling},
{{"show","tasks",NULL},"show tasks : show running tasks",NULL,cli_io_handler_show_tasks,NULL},
ha_alert("parsing [%s:%d] : '%s' only supports 'on' and 'off' as an argument, found '%s'.\n",file,linenum,args[0],args[1]);
err_code|=ERR_ALERT|ERR_FATAL;
gotoout;
}
}
elseif(strcmp(args[0],"h2-workaround-bogus-websocket-clients")==0){/* "no h2-workaround-bogus-websocket-clients" or "h2-workaround-bogus-websocket-clients" */