New function load_file_into_tar() concatenates a file into an in-memory
tar archive and grows its size. Only the base name and a provided prefix
are used to name the faile. If the file cannot be loaded, it's added as
size zero and permissions 0 to show that it failed to load. This will
be used to load post-mortem information so it needs to remain simple.
The purpose here is to create a tar file header in memory from a known
file name, prefix, size and mode. It will be used to prepare archives
of libs in use for improved debugging, but may probably be useful for
other purposes due to its simplicity.
Since 7a1382da7 ("BUG/MINOR: spoe: Fix condition to abort processing on
client abort"), the chn variable is no longer used in
spoe_process_event(). Let's remove it
This patch must be backported with the commit above, as far as 3.1.
The test to detect client aborts in the SPOE, introduced by commit b3be3b94a
("BUG/MEDIUM: spoe: Properly abort processing on client abort"), was no
correct. Producer flags must not be tested. Only the frontend SC must be
tested when the abortonclose option is set.
Because of this bug, when a client aborted, the SPOE processing was aborted
too, regardless the abortonclose option.
This patch must be backpoeted with the commit above, so as far as 3.1.
haproxy_sticktable_local_updates corresponds to the table->localupdate
counter, which is used internally by the peers protocol to identify
update messages in order to send and ack them among peers.
Here we decide to expose this information, as it is already the case in
"show peers" output, because it turns out that this value, which is
cumulative and grows in sync with the number of updates triggered on the
table due to changes initiated by the current process, can be used to
compute the update rate of the table. Computing the update rate of the
table (from the process point of view, ie: updates sent by the process and
not those received by the process), can be a great load indicator in order
to properly scale the infrastructure that is intended to handle the
table updates.
Note that there is a pitfall, which is that the value will eventually
wrap since it is stored using unsigned 32bits integer. Scripts or system
making use of this value must take wrapping into account between two
readings to properly compute the effective number of updates that were
performed between two readings. Also, they must ensure that the "polling"
rate between readings is small enough so that the value cannot wrap behind
their back.
We no longer rely on now_offset stored in the shm-stats-file. Instead
haproxy automatically computes the now_offset relative to the monotonic
clock and the shared global clock.
Indeed, the previous model based on static now_offset when monotonic
clock is available proved to be insufficient when used in
combination with shm-stats-file (that is when monotonic clock is shared
between multiple co-processes). In ideal situation co-processes would
correctly apply the offset to their local monotonic clock and end up
with consistent now_ns. But when restarting from an existing
shm-stats-file from a previous session (ie: prior to reboot), then the
local monotonic clock would no longer be consistent with the one used
to update the file previously, so applying a static offset would fail
to restore clock consistency.
For this specific issue, a workaround was brought by 09bf116
("BUG/MEDIUM: stats-file: detect and fix inconsistent shared clock when resuming from shm-stats-file")
but the solution implemented there was deemed too fragile, because there
is a 60sec window where the fix would fail to detect inconsistent clock
and would leave haproxy with a broken clock ranging from 0 to 60 seconds,
which can be huge..
By simply recomputing the now_offset each time we learn from another
process (through the shared map by reading global_now_ns), we simply
recompute our local offset (difference between OUR monotonic clock
and the SHARED one). Also, in clock_update_global_date(), we make
sure we always recompute the now_offset as now_ms may have been
updated from shared clock if shared clock was ahead of us.
Thanks to that new logic, interrupted processes, resumed processes,
processed started with shm-stats-file from previous session now
correctly recover from those various situations and multiple
co-processes with diverting clocks on startup end up converging to
the same values.
Since it is no longer relevant to save now_offset in the map, it was
removed but to prevent shm-stats-file incompatibility with previous
versions, 8-byte hole was forced, and we didn't bump the shm-stats-file
version on purpose.
This patch may be backported in 3.3 after a solid period of observation
to ensure we didn't break things.
mystrtod() was not length-aware and relied on null-termination or a
non-numeric character to stop. The fix adds a length parameter as a
strict upper bound for all pointer accesses.
The practical impact in haproxy is essentially null: all callers embed
the JSON payload inside a large haproxy buffer, so the speculative read
past the last digit lands on memory that is still within the same
allocation. ASAN cannot detect it in a normal haproxy run for the same
reason — the overread never escapes the enclosing buffer. Triggering a
detectable fault requires placing the JSON payload at the exact end of
an allocation.
Note: the 'path' buffer was using a null-terminated string so the result
of strlen is passed to it, this part was not at risk.
Thanks to Kamil Frankowicz for the original bug report.
This patch must be backported to all maintained versions.
The commit 9f1e9ee0e ("DEBUG: stream: Display the currently running rule in
stream dump") revealed a bug. When a stream is dumped, if it is blocked on a
rule, we must take care the rule has a keyword to display its name.
Indeed, some action parsings are inlined with the rule parser. In that case,
there is no keyword attached to the rule.
Because of this bug, crashes can be experienced when a stream is
dumped. Now, when there is no keyword, "?" is display instead.
This patch must be backported as far as 2.6.
When a L7 retry is performed, we should not rely on b_xfer() to swap the L7
buffer with the request buffer. When it is performed the request buffer is
not allocated. b_xfer() must not be called with an unallocated destination
buffer. The swap remains an optim. For instance, It is not performed on
buffers of different size. So the caller is responsible to provide an
allocated destination buffer with enough free space to transfer data.
However, when a L7 retry is performed, we cannot allocate a request buffer,
because we cannot yield. An error was reported, if we wait for a buffer, the
error will be handled by process_stream(). But we can swap the buffers by
hand. At this stage, we know there is no request buffer, so we can easily
swap it with the L7 buffer.
Note there is no real bug for now.
This patch could be backported to all stable versions.
In HTX, headers and trailers parts must always be complete. It is unexpected
to found header blocks without the EOH block or trailer blocks without the
EOT block. So, during H2/H3 message parsing, we must take care to remove any
HEADER/TRAILER block inserted when an error is encountered. It is mandatory
to be sure to properly report parsing error to upper layer.x
It is now performed by calling htx_truncat_blk() function on the error
path. The tail block is saved before converting any HEADERS/TRAILERS frame
to HTX. It is used to remove all inserted block on error.
This patch rely on the following one:
"MINOR: htx: Add function to truncate all blocks after a specific block"
It should be backported with the commit above to all stable versions for
the H2 part and as far as 2.8 for h3 one.
When H2 or H3 trailers are inserted in an HTX message, we must take care to
not exceed the maximum number of trailers allowed in a message (same than
the maximum number of headers, i.e tune.http.maxhdr). However, all HTX
blocks in the HTX message were considered. Only TRAILERS HTX blocks must be
considered.
To fix the issue, in h2_make_htx_trailers(), we rely on the "idx" variable
at the end of the for loop. In h3_trailers_to_htx(), we rely on the
"hdr_idx" variable.
This patch must be backported to all stables versions for the H2 part and as
far as 2.8 for the H3 one.
pouet
L7 retries are buggy when a large buffer is used on the request channel. A
memcpy is used to copy data from the request buffer into the L7 buffer. The
L7 buffer is for now always a standard buffer. So if a larger buffer is
used, this leads to a buffer overflow and crash the process.
The Best way to fix the issue is to disable L7 retries when a large buffer
was allocated for the request channel. In that case, we don't want to
allocate an extra large buffer.
No backport needed.
When a large buffer is used on a channel, once we've started to send data to
the opposite side, receives are blocked temporarily to be sure to flush the
large buffer ASAP to be able to fall back on regular buffers. This was
performed by skipping call to the endpoint (connection or applet). Howerver,
doing so, this broken the abortonclose and more generally this masked any
shut or error events reported by the lower layer.
To fix the issue, instead of skipping receives, we now try a receive but
with a requested size set to 0.
No backport needed
Client abort when abortonclose is configured was ignored when messges were
sent on event while it works properly when messages are sent via an
"send-spoe-group" action.
To fix the issue, when the SPOE filter is waiting for the SPOE applet
response, it must check if a client abort was reported and if so, must
interrupt its processing.
This patch should be backported as far as 3.1.
When the SPOE applet is created, the SPOE filter is set in SENDING_MSGS
state. When the applet has transferred data, it should switch the filter to
WAITING_ACK state. Concretly, there is no bug. At best, it could save some
useless applet wakeups.
This patch should be backported as far as 3.1
When SC's shudown callback functions were merged, a regression was
introduced. The applet was no longer woken up. Because of this bug, an
applet could remain blocked, waiting for an I/O event or a timeout.
This patch should fix the issue #3301.
No backport needed.
FDs received through recv_fd_uxst() do not have FD_CLOEXEC set.
The equivalent sock_accept_conn() already handles this correctly:
any FD accepted or received in the master must be marked close-on-exec
to avoid leaking it across the execvp() performed on soft-reload.
This is currently triggering a leak in the master since 3.1: the worker
sends a socketpair fd to the master to issue the _send_status CLI
command, and recv_fd_uxst() receive it without setting FD_CLOEXEC. If a
re-exec is emitted before the master had the chance to close that fd, it
survives execvp() and appears as an untracked unnamed AF_UNIX socket in
the new master generation.
This must be backported to all maintained branches.
Add a NULL guard for the version field. This has no functional impact
since the master process never uses this field for its own mworker_proc
element, and should be the only one impacted. This avoid seeing "(null)"
in the version field when debugging.
Must be backported to 3.1 and later.
During a soft reload, a starting worker sends sock_pair[0] to the master
via send_fd_uxst(), then reads on sock_pair[1] waiting for the master to
acknowledge receipt. Because of a documented macOS sendmsg(2) bug, the
worker must keep sock_pair[0] open until the master confirms the fd was
received by the CLI applet. This means the read() on sock_pair[1] will
never return 0 (EOF), since the worker itself still holds a reference to
sock_pair[0]. The worker can only unblock when the master actively sends
a byte back. If the master crashes before doing so, the worker blocks
indefinitely in read().
Fix this by setting a 2-second SO_RCVTIMEO on sock_pair[1] before the
read(), so the worker can unblock and continue regardless of the master's
state.
This was introduced by d7f6819161 ("BUG/MEDIUM: mworker: fix startup
and reload on macOS").
This should be backported to 3.1 and later.
In mworker_proc_list_to_env(), a typo used '&=' instead of '&' when
checking PROC_O_TYPE_WORKER in child->options. This would corrupt the
options field by clearing all bits except PROC_O_TYPE_WORKER, but since
the function is called right before the master re-execs itself during a
reload, the corruption has no actual effect: the in-memory proc_list is
discarded by the exec, and the options field is not serialized to the
environment anyway.
This should be backported to all maintained versions.
We defer processing of the "-dt" options until after the configuration
file has been read. This will be useful if we ever allow trace sources
to be registered later, for instance with LUA.
No backport needed.
In master-worker mode, when a freshly forked worker looks up its own
entry in proc_list to send its "READY" status to the master, the loop
was breaking on the first process with pid == -1 regardless of its
type. If a non-worker process (e.g. a master or program) also had
pid == -1, the wrong entry could be selected, causing send_fd_uxst()
to use an invalid ipc_fd.
Fix this by adding a PROC_O_TYPE_WORKER check to the loop condition,
and add a BUG_ON() assertion to catch any case where the loop exits
without finding a valid worker entry.
Must be backported to 3.1.
"show profiling" supports "aggr" for tasks but it was ignored for
memory. Now that we're having many more entries, it makes sense to
have it to ignore the call path and merge similar operations.
Keywords registered out of an initcall will have a TH_EX_CTX_CLI_KWL
execution context pointing to the keyword list. The report will indicate
the 5 first words of the first command of the list, e.g.:
exec_ctx: cli kwl starting with 'debug counters '
This should also work for CLI keywords registered in Lua.
Now CLI keywords registered via an initcall will be tracked during
execution, by keeping a link to their initcall location. "show threads"
now shows "exec_ctx: kw registered at @debug.c:3093" which indeed
corresponds to the initcall for the debugging commands.
Till now the CLI didn't know what keyword was being processed after it
was parsed. In order to report the execution context, we'll need to
store it. And this may even help for post-mortem analysis to know the
exact keyword being processed, so let's store the pointer in the cli_ctx
part of the appctx.
It allows to know when a thread is currnetly running inside an applet.
For example now "show threads" will show "applet '<CLI>'" for the thread
issuing this command.
It now appears almost everywhere due to callbacks (e.g. ssl_sock_io_cb).
Muxes also become visible now on memory profiling. A small test on h1+ssl
yields 838 lines of statistics. The number of buckets should definitely
be increased, and more grouping criteria should be added.
A performance test was conducted to observe the possible effect of
setting the execution context on each task switch, and it didn't change
at all, remaining at about 1.01 billion ctxsw/s on a 128-thread EPYC.
Most calls to mux ops were instrumented with a CALL_MUX_WITH_RET() or
CALL_MUX_NO_RET() macro in order to make the current thread's context
point to the called mux and be able to track its allocations. Only
a bunch of harmless mux_ctl() and ->subscribe/unsubscribe calls were
left untouched since useless. But destroy/detach/shut/init/snd_buf
and rcv_buf are now tracked.
It will not show allocations performed in IO callback via tasklet
wakeups however.
In order to ease reading of the output, cmp_memprof_ctx() knows about
muxes and sorts based on the .subscribe function address instead of
the mux_ops address so as to keep various callers grouped.
In order to be able to track memory allocation performed from message
callbacks, let's set the thread execution context to a generic function
pointing to them during their call. This allows for example to observe
the share of SSL allocations caused by ssl_sock_parse_clienthello() when
SSL captures are enabled.
The release calls are automatic from the SSL library for these, and are
registered directly via SSL_get_ex_new_index(). Maybe we should improve
the internal API to wrap that function and systematically track free
calls as well. In this case, maybe even registering the message callback
registration could take both the callback and the release function.
There are few such users however, essentially capture and keylog.
Doing this allows to report the allocations/releases performed by filters
when running with memory profiling enabled. The flt_conf pointer is kept
and the report shows the filter name.
A bit similar to what was done for sample fetch functions and converters,
we now store with each action keyword the location of the initcall when
they're registered this way. Since there are many functions only calling
a LIST_APPEND() (one per ruleset), we now implement a dedicated function
to store the context in all keywords before doing the append.
However that's not sufficient, because keywords are not mandatory for
actions, so we cannot safely rely on rule->kw. Thus we then set the
exec_ctx per rule when they are all scanned in check_action_rules(),
based on the keyword if it exists, otherwise we make a context from
the action_ptr function if it is set (it should).
Finally at all call points we now check rule->exec_ctx.
The purpose here is to be able to spot certain callbacks, such as the
SSL message callbacks, which are difficult to associate to anything.
Thus we introduce a new context type, TH_EX_CTX_FUNC, for which the
context is just the function pointed to by the void *pointer. One
difficulty with callbacks is that the allocation and release contexts
will likely be different, so the code should be properly structured
to allow proper tracking, either by instrumenting all calls, or by
making sure that the free calls are easy to spot in a report.
With the two new context types TH_EX_CTX_SMPF/CONV, we can now also
report contexts corresponding to direct calls to sample_register_fetches()
and sample_register_convs(). In this case, the first word of the keyword
list is reported.
Now keywords are registered with an exec_ctx and this one is passed
when calling ->process. The ctx is of type INITCALL when passed via
an initcall where we know the file name and line number.
This was tested with and extra "malloc(15)" added in smp_fetch_path()
which shows that it works:
$ socat /tmp/sock1 - <<< "show profiling memory"|grep via
Calls | Tot Bytes | Caller and method [via]
1893399 0 60592592 0| 0x78b2ec task_run_applet+0x3339c malloc(32) [via initcall @http_fetch.c:2416]
When the execution context is set to TH_EX_CTX_INITCALL, the pointer
points to a valid initcall, and the decoder will show "kw registered
at %s:%d" with file and line number of the initcall declaration. It's
up to the caller to make the initcall pointer point to the one that was
set during the initcall. The purpose here is to be able to preserve and
pass that knowledge of an initcall down the chain so that future calls
to functions registered via the initcall are still assigned to it.
The INITCALL macros will now store the file and line number where they
are declared into the initcall struct, and RUN_INITCALLS() will assign
them to the global caller_file and caller_line variables, and will even
set caller_initcall to the current initall so that at any instant such
functions know where their caller declared them. This will help with
error messages and traces where a bit of context will be welcome.
Now we have one extra line saying "exec_ctx: something" in thread dumps
when it's known. It may help with warnings and panics to figure what
is ongoing.
The new function chunk_append_thread_ctx() appends to a buffer the given
execution context based on its type and pointer. The goal is to easily
use it in profiling output and thread dumps. For now it only handles
TH_EX_CTX_NONE (which prints nothing) and TH_EX_CTX_OTHER (which indicates
"other ctx" followed by the pointer). It will be extended by new types as
they arrive.
By passing "byctx" to "show profiling memory", it's possible to sort by
the calling context first, which could help group certain calls by
subsystem and ease the interpretation of the output.
This now allows to report the same function in multiple bins based on the
th_ctx's exec_ctx discriminant. It's also worth noting that the context is
not atomically committed, but this shouldn't be a problem since a single
entry can get it. In the worst case, a second thread trying to create the
same context in parallel would create a different bin just for this call,
which is harmless. The same situation already exists with the caller
pointer.
We have the struct made of a type and a pointer in the th_ctx and a
function to switch it for the current thread. Two macros are provided
to enclose a callee within a temporary context. For now only type OTHER
is supported (only a generic pointer).
When two pointer hash to the same memprofile bin, we currently try again
with the same bin until we find a spare one or we reach the limit of 16.
Olivier suggested to try with a different step for different pointers so
as to limit the number of bins to visit in such a case, so let's split
the pointer hash calculation so that we keep the raw hash before reduction
and use its lowest bits as the retry step. We force lowest bit to 1 to
avoid integral multiples that would oscillate between only a few positions.
Quick tests with h1+h2 requests show that for ~744 distinct entries, we
used to have 1.17 retries per lookup before and 0.6 now so we're halving
the cost of hash collisions. A heavier workload that used to produce 920
entries with 2.01 retries per lookup now reaches 966 entries (94.3% usage
vs 89.8% before) with only 1.44 retries per lookup.
This should be safe to backport, but depends on this previous commit:
MINOR: tools: extend the pointer hashing code to ease manipulations
The purpose here is to combine two pointers and a long argument instead
of having the caller perform the mixing. Also it's cleaner and more
efficient this was as the arg is mixed after the multiplications, and
modern processors are efficient at multiplying then adding.