This patch improves the robustness of the QPACK varint decoder and fixes
potential 1-byte out-of-bounds reads in qpack_decode_fs().
In qpack_decode_fs(), two 1-byte OOB reads were possible on truncated
streams between two varint decoding. These occurred when trying to read
the byte containing the Huffman bit <h> and the Value Length prefix
immediately following an Index or a Name Length.
Note that these OOB are limited to a single byte because
qpack_get_varint() already ensures that its input length is non-zero
before consuming any data.
The fixes in qpack_decode_fs() are:
- When decoding an index, we now verify that at least one byte remains
to safely access the following <h> bit and value length.
- When decoding a literal, we now check len < name_len + 1 to ensure
the byte starting the header value is reachable.
In qpack_get_varint(), the maximum value is now strictly capped at 2^62-1
as per RFC. This is enforced using a budget-based check:
(v & 127) > (limit - ret) >> shift
This prevents values from overflowing into the 63rd or 64th bits, which
would otherwise break subsequent signed comparisons (e.g., if (len < name_len))
by interpreting the length as a negative value, leading to false positive
tests.
Thank you to @jming912 for having reported this issue in GH #3302.
Must be backported as far as 2.6
In the ENFILE and ENOMEM cases, when accept() fails, an irrelevant
global.maxsock value was printed that doesn't reflect system limits.
Now the actconn is printed that gives a hint about the failure reasons.
Should be backported in all stable branches.
We anticipated that the do-log action should be expanded with optional
arguments at some point. Now that we heard of multiple use-cases
that could be achieved with do-log action, but that are limitated by the
fact that all do-log statements inherit from the implicit log-profile
defined on the logger, we need to provide a way for the user to specify
that custom log-profile that could be used per do-log actions individually
This is what we try to achieve in this commit, by leveraging the
prerequisite work performed by the last 2 commits.
In process_send_log(), now also consider the ctx if ctx->profile != NULL
In that case, we do as if logger->prof was set, but we consider
ctx->profile in priority over the logger one. What this means is that
it will become possible to pass ctx.profile to a profile that will be
used no matter what to generate the log payload.
This is a pre-requisite to implement optional "profile" argument for
do-log action
do_log() is just a wrapper to use do_log_ctx() with pre-filled ctx, but
we now have the low-level do_log_ctx() variant which can be used to
pass specific ctx parameters instead.
Released version 3.4-dev7 with the following main changes :
- BUG/MINOR: stconn: Increase SC bytes_out value in se_done_ff()
- BUG/MINOR: ssl-sample: Fix sample_conv_sha2() by checking EVP_Digest* failures
- BUG/MINOR: backend: Don't get proto to use for webscoket if there is no server
- BUG/MINOR: jwt: Missing 'jwt_tokenize' return value check
- MINOR: flt_http_comp: define and use proxy_get_comp() helper function
- MEDIUM: flt_http_comp: split "compression" filter in 2 distinct filters
- CLEANUP: flt_http_comp: comp_state doesn't bother about the direction anymore
- BUG/MINOR: admin: haproxy-reload use explicit socat address type
- MEDIUM: admin: haproxy-reload conversion to POSIX sh
- BUG/MINOR: admin: haproxy-reload rename -vv long option
- SCRIPTS: git-show-backports: hide the common ancestor warning in quiet mode
- SCRIPTS: git-show-backports: add a restart-from-last option
- MINOR: mworker: add a BUG_ON() on mproxy_li in _send_status
- BUG/MINOR: mworker: don't set the PROC_O_LEAVING flag on master process
- Revert "BUG/MINOR: jwt: Missing 'jwt_tokenize' return value check"
- MINOR: jwt: Improve 'jwt_tokenize' function
- MINOR: jwt: Convert EC JWK to EVP_PKEY
- MINOR: jwt: Parse ec-specific fields in jose header
- MINOR: jwt: Manage ECDH-ES algorithm in jwt_decrypt_jwk function
- MINOR: jwt: Add ecdh-es+axxxkw support in jwt_decrypt_jwk converter
- MINOR: jwt: Manage ec certificates in jwt_decrypt_cert
- DOC: jwt: Add ECDH support in jwt_decrypt converters
- MINOR: stconn: Call sc_conn_process from the I/O callback if TASK_WOKEN_MSG state was set
- MINOR: mux-h2: Rely on h2s_notify_send() when resuming h2s for sending
- MINOR: mux-spop: Rely on spop_strm_notify_send() when resuming streams for sending
- MINOR: muxes: Wakup the data layer from a mux stream with TASK_WOKEN_IO state
- MAJOR: muxes: No longer use app_ops .wake() callback function from muxes
- MINOR: applet: Call sc_applet_process() instead of .wake() callback function
- MINOR: connection: Call sc_conn_process() instead of .wake() callback function
- MEDIUM: stconn: Remove .wake() callback function from app_ops
- MINOR: check: Remove wake_srv_chk() function
- MINOR: haterm: Remove hstream_wake() function
- MINOR: stconn: Wakup the SC with TASK_WOKEN_IO state from opposite side
- MEDIUM: stconn: Merge all .chk_rcv() callback functions in sc_chk_rcv()
- MINOR: stconn: Remove .chk_rcv() callback functions
- MEDIUM: stconn: Merge all .chk_snd() callback functions in sc_chk_snd()
- MINOR: stconn: Remove .chk_snd() callback functions
- MEDIUM: stconn: Merge all .abort() callback functions in sc_abort()
- MINOR: stconn: Remove .abort() callback functions
- MEDIUM: stconn: Merge all .shutdown() callback functions in sc_shutdown()
- MINOR: stconn: Remove .shutdown() callback functions
- MINOR: stconn: Totally app_ops from the stconns
- MINOR: stconn: Simplify sc_abort/sc_shutdown by merging calls to se_shutdown
- DEBUG: stconn: Add a CHECK_IF() when I/O are performed on a orphan SC
- MEDIUM: mworker: exiting when couldn't find the master mworker_proc element
- BUILD: ssl: use ASN1_STRING accessors for OpenSSL 4.0 compatibility
- BUILD: ssl: make X509_NAME usage OpenSSL 4.0 ready
- BUG/MINOR: tcpcheck: Fix typo in error error message for `http-check expect`
- BUG/MINOR: jws: fix memory leak in jws_b64_signature
- DOC: configuration: http-check expect example typo
- DOC/CLEANUP: config: update mentions of the old "Global parameters" section
- BUG/MEDIUM: ssl: Handle receiving early data with BoringSSL/AWS-LC
- BUG/MINOR: mworker: always stop the receiving listener
- BUG/MEDIUM: ssl: Don't report read data as early data with AWS-LC
- BUILD: makefile: fix range build without test command
- BUG/MINOR: memprof: avoid a small memory leak in "show profiling"
- BUG/MINOR: proxy: do not forget to validate quic-initial rules
- MINOR: activity: use dynamic allocation for "show profiling" entries
- MINOR: tools: extend the pointer hashing code to ease manipulations
- MINOR: tools: add a new pointer hash function that also takes an argument
- MINOR: memprof: attempt different retry slots for different hashes on collision
- MINOR: tinfo: start to add basic thread_exec_ctx
- MINOR: memprof: prepare to consider exec_ctx in reporting
- MINOR: memprof: also permit to sort output by calling context
- MINOR: tools: add a function to write a thread execution context.
- MINOR: debug: report the execution context on thread dumps
- MINOR: memprof: report the execution context on profiling output
- MINOR: initcall: record the file and line declaration of an INITCALL
- MINOR: tools: decode execution context TH_EX_CTX_INITCALL
- MINOR: tools: support decoding ha_caller type exec context
- MINOR: sample: store location for fetch/conv via initcalls
- MINOR: sample: also report contexts registered directly
- MINOR: tools: support an execution context that is just a function
- MINOR: actions: store the location of keywords registered via initcalls
- MINOR: actions: also report execution contexts registered directly
- MINOR: filters: set the exec context to the current filter config
- MINOR: ssl: set the thread execution context during message callbacks
- MINOR: connection: track mux calls to report their allocation context
- MINOR: task: set execution context on task/tasklet calls
- MINOR: applet: set execution context on applet calls
- MINOR: cli: keep the info of the current keyword being processed in the appctx
- MINOR: cli: keep track of the initcall context since kw registration
- MINOR: cli: implement execution context for manually registered keywords
- MINOR: activity: support aggregating by caller also for memprofile
- MINOR: activity: raise the default number of memprofile buckets to 4k
- DOC: internals: short explanation on how thread_exec_ctx works
- BUG/MINOR: mworker: only match worker processes when looking for unspawned proc
- MINOR: traces: defer processing of "-dt" options
- BUG/MINOR: mworker: fix typo &= instead of & in proc list serialization
- BUG/MINOR: mworker: set a timeout on the worker socketpair read at startup
- BUG/MINOR: mworker: avoid passing NULL version in proc list serialization
- BUG/MINOR: sockpair: set FD_CLOEXEC on fd received via SCM_RIGHTS
- BUG/MEDIUM: stconn: Don't forget to wakeup applets on shutdown
- BUG/MINOR: spoe: Properly switch SPOE filter to WAITING_ACK state
- BUG/MEDIUM: spoe: Properly abort processing on client abort
- BUG/MEDIUM: stconn: Fix abort on close when a large buffer is used
- BUG/MEDIUM: stconn: Don't perform L7 retries with large buffer
- BUG/MINOR: h2/h3: Only test number of trailers inserted in HTX message
- MINOR: htx: Add function to truncate all blocks after a specific block
- BUG/MINOR: h2/h3: Never insert partial headers/trailers in an HTX message
- BUG/MINOR: http-ana: Swap L7 buffer with request buffer by hand
- BUG/MINOR: stream: Fix crash in stream dump if the current rule has no keyword
- BUG/MINOR: mjson: make mystrtod() length-aware to prevent out-of-bounds reads
- MEDIUM: stats-file/clock: automatically update now_offset based on shared clock
- MINOR: promex: export "haproxy_sticktable_local_updates" metric
- BUG/MINOR: spoe: Fix condition to abort processing on client abort
- BUILD: spoe: Remove unsused variable
- MINOR: tools: add a function to create a tar file header
- MINOR: tools: add a function to load a file into a tar archive
- MINOR: config: support explicit "on" and "off" for "set-dumpable"
- MINOR: debug: read all libs in memory when set-dumpable=libs
- DEV: gdb: add a new utility to extract libs from a core dump: libs-from-core
- MINOR: debug: copy debug symbols from /usr/lib/debug when present
- MINOR: debug: opportunistically load libthread_db.so.1 with set-dumpable=libs
- BUG/MINOR: mworker: don't try to access an initializing process
- BUG/MEDIUM: peers: enforce check on incoming table key type
- BUG/MINOR: mux-h2: properly ignore R bit in GOAWAY stream ID
- BUG/MINOR: mux-h2: properly ignore R bit in WINDOW_UPDATE increments
- OPTIM: haterm: use chunk builders for generated response headers
- BUG/MAJOR: h3: check body size with content-length on empty FIN
- BUG/MEDIUM: h3: reject unaligned frames except DATA
- BUG/MINOR: mworker/cli: fix show proc pagination losing entries on resume
- CI: github: treat vX.Y.Z release tags as stable like haproxy-* branches
- MINOR: freq_ctr: add a function to add values with a peak
- MINOR: task: maintain a per-thread indicator of the peak run-queue size
- MINOR: mux-h2: store the concurrent streams hard limit in the h2c
- MINOR: mux-h2: permit to moderate the advertised streams limit depending on load
- MINOR: mux-h2: permit to fix a minimum value for the advertised streams limit
- BUG/MINOR: mworker: fix sort order of mworker_proc in 'show proc'
- CLEANUP: mworker: fix tab/space mess in mworker_env_to_proc_list()
Since version 3.1, the display order of old workers in 'show proc' was
accidentally reversed. The oldest worker was shown first and the newest
last, which was not the intended behavior. This regression was introduced
during the master-worker rework.
Fix this by sorting the list during deserialization in
mworker_env_to_proc_list().
An alternative fix would have been to iterate the list in reverse order
in the show proc function, but that approach risks introducing
inconsistencies when backporting to older versions.
Must be backported to 3.1 and later.
When using rq-load on tune.h2.fe.max-concurrent-streams, it's easy to
reach a situation where only one stream is allowed. There's nothing
wrong with this but it turns out that slightly higher values do not
necessarily cause significantly higher loads and will improve the user
experience. For this reason the keyword now also supports "min" to
specify a value. Experimentation shows that values from 5 to 15 remain
very effective at protecting the run queue while allowing a great level
of parallelism that keeps a site fluid.
Global setting tune.h2.fe.max-concurrent-streams now supports an optional
"rq-load" option to pass either a target load, or a keyword among "auto"
and "ignore". These are used to quadratically reduce the advertised streams
limit when the thread's run queue size goes beyong the configured value,
and automatically reduce the load on the process from new connections.
With "auto", instead of taking an explicit value, it uses as a target the
"tune.runqueue-depth" setting (which might be automatic). Tests have shown
that values between 50 and 100 are already very effective at reducing the
loads during attacks from 100000 to around 1500. By default, "ignore"
is in effect, which means that the dynamic tuning is not enabled.
The hard limit on the number of concurrent streams is currently
determined only by configuration and returned by
h2c_max_concurrent_streams(). However this doesn't permit to
change such settings on the fly without risking to break connections,
and it doesn't allow a connection to pick a different value, which
could be desirable for example to try to slow abuse down.
Let's store a copy of h2c_max_concurrent_streams() at connection
creation time into the h2c as streams_hard_limit. This inflates
the h2c size from 1324 to 1328 (0.3%) which is acceptable for the
expected benefits.
The new field th_ctx->rq_tot_peak contains the computed peak run queue
length averaged over the last 512 calls. This is computed when entering
process_runnable_tasks. It will not take into account new tasks that are
created or woken up during this round nor those which are evicted, which
is the reason why we're using a peak measurement to increase chances to
observe transient high values. Tests have shown that 512 samples are good
to provide a relatively smooth average measurement while still fading
away in a matter of milliseconds at high loads. Since this value is
only updated once per round, it cannot be used as a statistic and
shouldn't be exposed, it's only for internal use (self-regulation).
Sometimes it's desirable to observe fading away peak values, where a new
value that is higher than the historical one instantly replaces it,
otherwise contributes to it. It is convenient when trying to observe
certain phenomenons like peak queue sizes. The new function
swrate_add_peak_local() does that to a private variable (no atomic ops
involved as it's not worth the cost since such use cases are typically
local).
Add detection of release tags matching the vX.Y.Z pattern so they use
the same stable CI configuration as haproxy-* branches, rather than the
development one.
It prevents stable tag to trigger the CI with docker images and SSL
libraries only used for development.
Must be backported in stable releases.
After commit 594408cd61 ("BUG/MINOR: mworker/cli: fix show proc
pagination using reload counter"), the old-workers pagination stores
ctx->next_reload = child->reloads on flush failure, then skips entries
with child->reloads >= ctx->next_reload on resume.
The >= comparison is direction-dependent: it assumes the list is in
descending reload order (newest first). On current master, proc_list
is in ascending order (oldest first) because mworker_env_to_proc_list()
appends deserialized entries before mworker_prepare_master() appends
the new worker. This means the skip logic is inverted and can miss
entries or loop incorrectly depending on the version.
We fix this by renaming the context field to resume_reload and changing its
semantics: it now tracks the reload count of the last *successfully
flushed* row rather than the failed one. On flush failure, resume_reload
is left unchanged so the failed row is replayed on the next call. On
resume, entries are skipped by walking the list until the marker entry is
found (exact == match), which works regardless of list direction.
Additionally, we have to handle the unlikely case where the marker entry
is deleted from proc_list between handler calls (e.g. the process exits and
SIGCHLD processing removes it). Detect this by tracking the previous
LEAVING entry's reload count during the skip phase: if two consecutive
entries straddle the skip value (one > skip, the other < skip), the
deleted entry's former position has been crossed, so skipping stops and
the current entry is emitted.
This should be backported to all stable branches. On branches where
proc_list is in descending order (2.9, 3.0), the fix applies the
same way since the skip logic is now direction-agnostic.
HTTP/3 parser cannot deal with unaligned frames, except for DATA. As it
was expected that such case would not occur, a simple BUG_ON() was
written to protect HEADERS parsing.
First, this BUG_ON() was incorrectly written due an incorrect operator
'>=' vs '>' when checking if data wraps. Thus this patch correct it.
However this correction is not sufficient as it still possible to handle
a large unaligned HEADERS frame, which would trigger this BUG_ON(). This
is very unlikely as HEADERS is the first received frame on a request
stream, but not completely impossible. As HTTP/3 frame header (type +
length) is parsed first and removed, this leaves a small gap at the
buffer beginning. If this small gap is then filled with the remaining
frame payload, it would result in unaligned data. Also, trailers are
also sensitive here as in this case a HEADERS frame is handled after
other frames.
The objective of this patch is to ensure that an unaligned frame is now
handled in a safe way. This is extend to all HTTP/3 frames (except DATA)
and not only to HEADERS type. Parsing is interrupted if frame payload is
wrapping in the buffer. This should never happen except maybe with some
weird clients, so the connection is closed with H3_EXCESSIVE_LOAD error.
This approach is considered the safest one, in particular for backport
purpose. In the future, realign operation via copy may be implemented
instead if considered as useful.
This must be backported up to 2.6.
In QUIC, a STREAM frame may be received with no data but with FIN bit
set. This situation is tedious to handle and haproxy parsing code has
changed several times to deal with this situation. Now, H3 and H09
layers parsing code are skipped in favor of the shared function
qcs_http_handle_standalone_fin() used to handle the HTX EOM emission.
However, this shortcut bypasses an important HTTP/3 validation check on
the received body size vs the announced content-length header. Under
some conditions, this could cause a desynchronization with the backend
server which could be exploited for request smuggling.
Fix HTTP/3 parsing code by adding a call to h3_check_body_size() prior
to qcs_http_handle_standalone_fin() if content-length header has been
found. If the body size is incorrect, the stream is immediately resetted
with H3_MESSAGE_ERROR code and the error is forwarded to the stream
layer.
Thanks to Martino Spagnuolo for his detailed report on this issue and
for having contacting us about it via the security mailing list.
This must be backported up to 2.6.
hstream_build_http_resp() currently uses snprintf() to build the
status code and the generated X-req/X-rsp header values.
These strings are short and are fully derived from already parsed request
state, so they can be assembled directly in the HAProxy trash buffer using
`chunk_strcat()` and `ultoa_o()`.
This keeps the generated output unchanged while removing the remaining
`snprintf()` calls from the response-building path.
No functional change is expected.
Signed-off-by: Aleksandar Lazic <al-haproxy@none.at>
The window size increments are 31 bits and the topmost bit is reserved
and should be ignored, however it was not masked, so a peer sending it
set would emit a negative value which could actually reduce the current
window instead of increasing it. Note that the window cannot reach zero
as there's already a test for this, but transfers could slow down to
the same speed as if an initial window of just a few bytes had been
advertised. Let's just mask the reserved bit before processing.
This should be backported to all stable versions.
The stream ID indicated in GOAWAY frames must have its bit 31 (R) ignored
and this wasn't the case. The effect is that if this bit was present, the
GOAWAY frame would mark the last acceptable stream as negative, which is
the default situation (unlimited), thus would basically result in this
GOAWAY frame to be ignored since it would replace a negative last_sid
with another negative one. The impact is thus basically that if a peer
would emit anything non-zero in the R bit, the GOAWAY frame would be
ignored and new streams would still be initiated on the backend, before
being rejected by the server.
Thanks to Haruto Kimura (Stella) for finding and reporting this bug.
This fix needs to be backported to all stable versions.
The key type received over the peers protocol is not checked for
validity and as a result can crash the process when passed through
peer_int_key_type[] in peer_treat_definemsg(). The risk remains
very low since only trusted peers may exchange tables, however it
represents a risk the day haproxy supports new key types, because
mixing old and new versions could then cause the old ones to crash.
Let's add the required check in peer_treat_definemsg().
It is also worth noting that in this function a few protocol identifiers
of type int read directly from a var_int via intdecode() and that some
protocol aliasing may occur (e.g. table_id, table_id_len etc). This is
not supposed to be a problem but it could hide implementation bugs and
cause interoperability issues once fixed, so these should be addressed
in a future commit that will not be marked for backporting.
Thanks to Haruto Kimura (Stella) for finding and reporting this bug.
This fix needs to be backported to all stable versions.
In pcli_prefix_to_pid(), when resolving a worker by absolute pid
(@!<pid>) or by relative pid (@1), a worker that still has PROC_O_INIT
set (i.e. not yet ready, still initializing) could be returned as a
valid target.
During a reload, if a client connects to the master CLI and sends a
command targeting a worker (e.g. @@1 or @@!<pid>), the master resolves
the target pid and attempts to forward the command by transferring a fd
over the worker's sockpair. If the worker is still initializing and has
not yet sent its READY signal, its end of the sockpair is not usable,
causing send_fd_uxst() to fail with EPIPE. This results in the
following alert being repeated in a loop:
[ALERT] (550032) : socketpair: Cannot transfer the fd 13 over sockpair@5. Giving up.
The situation is even worse if the initializing worker has already
exited (e.g. due to a bind failure) but has not yet been removed from
the process list: in that case the sockpair's remote end is already
closed, making the failure immediate and unrecoverable until the dead
worker is cleaned up.
This was not possible before 3.1 because the master's polling loop only
started once all workers were fully ready, making it impossible to
receive CLI connections while a worker was still initializing.
Fix this by skipping workers with PROC_O_INIT set in both the absolute
and relative pid resolution paths of pcli_prefix_to_pid(), so that
only fully initialized workers can be targeted.
Must be backported to 3.1 and later.
When loading libs into the core dump, let's also try to load
libthread_db.so.1 that gdb usually requires. It can significantly help
decoding the threads for systems which require it, and the file is quite
small. It can appear at a few different locations and is generally next
to libpthread.so, or alternately libc, so we first look where we found
them, and fall back to a few other common places. The file is really
small, a few tens of kB usually.
When set-dumpable=libs, let's also pick the debug symbols for the libs
we're loading. For now we only try /usr/lib/debug/<path>, which is quite
common and easy to guess. Build IDs could also be used but are more
complex to deal with, so let's stay simple for now.
This utility takes in argument the path to a core dump, and it looks
for the archive signature of libraries embedded with "set-dumpable libs",
and either emits the offset and size of stdout, or directly dumps the
contents so that the tar file can be extracted directly by piping the
output to tar xf.
When "set-dumpable" is set to "libs", in addition to marking the process
dumpable, haproxy also reads the binary and shared objects into memory as
a tar archive in a page-aligned location so that these files are easily
extractable from a future core dump. The goal here is to always have
access to the exact same binary and libs as those which caused the core
to happen. It's indeed very frequent to miss some of these, or to get
mismatching files due to a local update that didn't experience a reload,
or to get those of a host system instead of the container.
The in-memory tar file presents everything under a directory called
"core-%d" where %d corresponds to the PID of the worker process. In
order to ease the finding of these data in the core dump, the memory
area is contiguous and surrounded by PROT_NONE pages so that it appears
in its own segment in the core file. The total size used by this is a
few tens of MB, which is not a problem on large systems.
The global "set-dumpable" keyword currently is only positional. Let's
extend its syntax to support arguments. For now we support both "on"
and "off" to explicitly enable or disable it.
New function load_file_into_tar() concatenates a file into an in-memory
tar archive and grows its size. Only the base name and a provided prefix
are used to name the faile. If the file cannot be loaded, it's added as
size zero and permissions 0 to show that it failed to load. This will
be used to load post-mortem information so it needs to remain simple.
The purpose here is to create a tar file header in memory from a known
file name, prefix, size and mode. It will be used to prepare archives
of libs in use for improved debugging, but may probably be useful for
other purposes due to its simplicity.
Since 7a1382da7 ("BUG/MINOR: spoe: Fix condition to abort processing on
client abort"), the chn variable is no longer used in
spoe_process_event(). Let's remove it
This patch must be backported with the commit above, as far as 3.1.
The test to detect client aborts in the SPOE, introduced by commit b3be3b94a
("BUG/MEDIUM: spoe: Properly abort processing on client abort"), was no
correct. Producer flags must not be tested. Only the frontend SC must be
tested when the abortonclose option is set.
Because of this bug, when a client aborted, the SPOE processing was aborted
too, regardless the abortonclose option.
This patch must be backpoeted with the commit above, so as far as 3.1.
haproxy_sticktable_local_updates corresponds to the table->localupdate
counter, which is used internally by the peers protocol to identify
update messages in order to send and ack them among peers.
Here we decide to expose this information, as it is already the case in
"show peers" output, because it turns out that this value, which is
cumulative and grows in sync with the number of updates triggered on the
table due to changes initiated by the current process, can be used to
compute the update rate of the table. Computing the update rate of the
table (from the process point of view, ie: updates sent by the process and
not those received by the process), can be a great load indicator in order
to properly scale the infrastructure that is intended to handle the
table updates.
Note that there is a pitfall, which is that the value will eventually
wrap since it is stored using unsigned 32bits integer. Scripts or system
making use of this value must take wrapping into account between two
readings to properly compute the effective number of updates that were
performed between two readings. Also, they must ensure that the "polling"
rate between readings is small enough so that the value cannot wrap behind
their back.
We no longer rely on now_offset stored in the shm-stats-file. Instead
haproxy automatically computes the now_offset relative to the monotonic
clock and the shared global clock.
Indeed, the previous model based on static now_offset when monotonic
clock is available proved to be insufficient when used in
combination with shm-stats-file (that is when monotonic clock is shared
between multiple co-processes). In ideal situation co-processes would
correctly apply the offset to their local monotonic clock and end up
with consistent now_ns. But when restarting from an existing
shm-stats-file from a previous session (ie: prior to reboot), then the
local monotonic clock would no longer be consistent with the one used
to update the file previously, so applying a static offset would fail
to restore clock consistency.
For this specific issue, a workaround was brought by 09bf116
("BUG/MEDIUM: stats-file: detect and fix inconsistent shared clock when resuming from shm-stats-file")
but the solution implemented there was deemed too fragile, because there
is a 60sec window where the fix would fail to detect inconsistent clock
and would leave haproxy with a broken clock ranging from 0 to 60 seconds,
which can be huge..
By simply recomputing the now_offset each time we learn from another
process (through the shared map by reading global_now_ns), we simply
recompute our local offset (difference between OUR monotonic clock
and the SHARED one). Also, in clock_update_global_date(), we make
sure we always recompute the now_offset as now_ms may have been
updated from shared clock if shared clock was ahead of us.
Thanks to that new logic, interrupted processes, resumed processes,
processed started with shm-stats-file from previous session now
correctly recover from those various situations and multiple
co-processes with diverting clocks on startup end up converging to
the same values.
Since it is no longer relevant to save now_offset in the map, it was
removed but to prevent shm-stats-file incompatibility with previous
versions, 8-byte hole was forced, and we didn't bump the shm-stats-file
version on purpose.
This patch may be backported in 3.3 after a solid period of observation
to ensure we didn't break things.
mystrtod() was not length-aware and relied on null-termination or a
non-numeric character to stop. The fix adds a length parameter as a
strict upper bound for all pointer accesses.
The practical impact in haproxy is essentially null: all callers embed
the JSON payload inside a large haproxy buffer, so the speculative read
past the last digit lands on memory that is still within the same
allocation. ASAN cannot detect it in a normal haproxy run for the same
reason — the overread never escapes the enclosing buffer. Triggering a
detectable fault requires placing the JSON payload at the exact end of
an allocation.
Note: the 'path' buffer was using a null-terminated string so the result
of strlen is passed to it, this part was not at risk.
Thanks to Kamil Frankowicz for the original bug report.
This patch must be backported to all maintained versions.
The commit 9f1e9ee0e ("DEBUG: stream: Display the currently running rule in
stream dump") revealed a bug. When a stream is dumped, if it is blocked on a
rule, we must take care the rule has a keyword to display its name.
Indeed, some action parsings are inlined with the rule parser. In that case,
there is no keyword attached to the rule.
Because of this bug, crashes can be experienced when a stream is
dumped. Now, when there is no keyword, "?" is display instead.
This patch must be backported as far as 2.6.
When a L7 retry is performed, we should not rely on b_xfer() to swap the L7
buffer with the request buffer. When it is performed the request buffer is
not allocated. b_xfer() must not be called with an unallocated destination
buffer. The swap remains an optim. For instance, It is not performed on
buffers of different size. So the caller is responsible to provide an
allocated destination buffer with enough free space to transfer data.
However, when a L7 retry is performed, we cannot allocate a request buffer,
because we cannot yield. An error was reported, if we wait for a buffer, the
error will be handled by process_stream(). But we can swap the buffers by
hand. At this stage, we know there is no request buffer, so we can easily
swap it with the L7 buffer.
Note there is no real bug for now.
This patch could be backported to all stable versions.
In HTX, headers and trailers parts must always be complete. It is unexpected
to found header blocks without the EOH block or trailer blocks without the
EOT block. So, during H2/H3 message parsing, we must take care to remove any
HEADER/TRAILER block inserted when an error is encountered. It is mandatory
to be sure to properly report parsing error to upper layer.x
It is now performed by calling htx_truncat_blk() function on the error
path. The tail block is saved before converting any HEADERS/TRAILERS frame
to HTX. It is used to remove all inserted block on error.
This patch rely on the following one:
"MINOR: htx: Add function to truncate all blocks after a specific block"
It should be backported with the commit above to all stable versions for
the H2 part and as far as 2.8 for h3 one.
When H2 or H3 trailers are inserted in an HTX message, we must take care to
not exceed the maximum number of trailers allowed in a message (same than
the maximum number of headers, i.e tune.http.maxhdr). However, all HTX
blocks in the HTX message were considered. Only TRAILERS HTX blocks must be
considered.
To fix the issue, in h2_make_htx_trailers(), we rely on the "idx" variable
at the end of the for loop. In h3_trailers_to_htx(), we rely on the
"hdr_idx" variable.
This patch must be backported to all stables versions for the H2 part and as
far as 2.8 for the H3 one.
pouet
L7 retries are buggy when a large buffer is used on the request channel. A
memcpy is used to copy data from the request buffer into the L7 buffer. The
L7 buffer is for now always a standard buffer. So if a larger buffer is
used, this leads to a buffer overflow and crash the process.
The Best way to fix the issue is to disable L7 retries when a large buffer
was allocated for the request channel. In that case, we don't want to
allocate an extra large buffer.
No backport needed.
When a large buffer is used on a channel, once we've started to send data to
the opposite side, receives are blocked temporarily to be sure to flush the
large buffer ASAP to be able to fall back on regular buffers. This was
performed by skipping call to the endpoint (connection or applet). Howerver,
doing so, this broken the abortonclose and more generally this masked any
shut or error events reported by the lower layer.
To fix the issue, instead of skipping receives, we now try a receive but
with a requested size set to 0.
No backport needed
Client abort when abortonclose is configured was ignored when messges were
sent on event while it works properly when messages are sent via an
"send-spoe-group" action.
To fix the issue, when the SPOE filter is waiting for the SPOE applet
response, it must check if a client abort was reported and if so, must
interrupt its processing.
This patch should be backported as far as 3.1.
When the SPOE applet is created, the SPOE filter is set in SENDING_MSGS
state. When the applet has transferred data, it should switch the filter to
WAITING_ACK state. Concretly, there is no bug. At best, it could save some
useless applet wakeups.
This patch should be backported as far as 3.1
When SC's shudown callback functions were merged, a regression was
introduced. The applet was no longer woken up. Because of this bug, an
applet could remain blocked, waiting for an I/O event or a timeout.
This patch should fix the issue #3301.
No backport needed.
FDs received through recv_fd_uxst() do not have FD_CLOEXEC set.
The equivalent sock_accept_conn() already handles this correctly:
any FD accepted or received in the master must be marked close-on-exec
to avoid leaking it across the execvp() performed on soft-reload.
This is currently triggering a leak in the master since 3.1: the worker
sends a socketpair fd to the master to issue the _send_status CLI
command, and recv_fd_uxst() receive it without setting FD_CLOEXEC. If a
re-exec is emitted before the master had the chance to close that fd, it
survives execvp() and appears as an untracked unnamed AF_UNIX socket in
the new master generation.
This must be backported to all maintained branches.
Add a NULL guard for the version field. This has no functional impact
since the master process never uses this field for its own mworker_proc
element, and should be the only one impacted. This avoid seeing "(null)"
in the version field when debugging.
Must be backported to 3.1 and later.
During a soft reload, a starting worker sends sock_pair[0] to the master
via send_fd_uxst(), then reads on sock_pair[1] waiting for the master to
acknowledge receipt. Because of a documented macOS sendmsg(2) bug, the
worker must keep sock_pair[0] open until the master confirms the fd was
received by the CLI applet. This means the read() on sock_pair[1] will
never return 0 (EOF), since the worker itself still holds a reference to
sock_pair[0]. The worker can only unblock when the master actively sends
a byte back. If the master crashes before doing so, the worker blocks
indefinitely in read().
Fix this by setting a 2-second SO_RCVTIMEO on sock_pair[1] before the
read(), so the worker can unblock and continue regardless of the master's
state.
This was introduced by d7f6819161 ("BUG/MEDIUM: mworker: fix startup
and reload on macOS").
This should be backported to 3.1 and later.
In mworker_proc_list_to_env(), a typo used '&=' instead of '&' when
checking PROC_O_TYPE_WORKER in child->options. This would corrupt the
options field by clearing all bits except PROC_O_TYPE_WORKER, but since
the function is called right before the master re-execs itself during a
reload, the corruption has no actual effect: the in-memory proc_list is
discarded by the exec, and the options field is not serialized to the
environment anyway.
This should be backported to all maintained versions.
We defer processing of the "-dt" options until after the configuration
file has been read. This will be useful if we ever allow trace sources
to be registered later, for instance with LUA.
No backport needed.
In master-worker mode, when a freshly forked worker looks up its own
entry in proc_list to send its "READY" status to the master, the loop
was breaking on the first process with pid == -1 regardless of its
type. If a non-worker process (e.g. a master or program) also had
pid == -1, the wrong entry could be selected, causing send_fd_uxst()
to use an invalid ipc_fd.
Fix this by adding a PROC_O_TYPE_WORKER check to the loop condition,
and add a BUG_ON() assertion to catch any case where the loop exits
without finding a valid worker entry.
Must be backported to 3.1.