New connections created by tcpcheck for are marked as private, making
them ineligible for insertion into the server-side connection pool, even
when check-reuse-pool is activated. Thus, connection reuse for health
checks would only work when the pool had already been populated by
regular (non-check) traffic.
Change this behavior so that a new check connection is not flagged as
private anymore when check-reuse-pool is requested. As a result, on
detach, instead of being freed, the connection will be inserted in the
idle pool and will be eligible for reuse, both for regular traffic and
checks.
This change can be useful to ensure that a server idle pool is never
completely empty when check-reuse-pool is active. Additionnally, it is
also necessary to ensure that check reuse is really effective when
connection parameters are different between checks and regular traffic,
resulting in a different reuse hash.
The previous behavior could be considered as a bug to a certain extents.
The current patch should be harmless for default configuration, but it
can be a significant improvment for users who want to perform reuse for
checks. Thus, it should be backported up to 3.2.
QMux implements a record layer which is used to encapsulate QUIC frames.
This patch implements reception of an incomplete record in
qcc_qstrm_recv(). BUG_ON() failures are removed and now reading will
continue until the whole record is received or a fatal error occurs.
Several adjustments were made in the logic for read operation.
Previously, read syscall was only performed if either data buffer was
empty or current record was incomplete. An extra condition is added to
perform read if there is data in the buffer but not enough to decode a
record header. Another change is that buffer realign is also performed
in this latter case and if buffer wrapping position has been reached.
Remove BUG_ON() related to connection errors when invoking XPRT
snd_buf/rcv_buf in QMux operations. Such errors are now converted in
QC_CF_ERR_CONN flag, which will disable any I/O operations and close the
connection as soon as possible.
Note that this error management is pretty crude. In particular, it could
lead to truncated data when dealing with unidirectional connection
closure from the remote peer. However, it is considered sufficient for
now to continue interop testing without being disturbed by BUG_ON()
assertion crashes.
Support reception via QMux of flow control MAX-STREAMS frame for
bidirectional streams. This is similar to the QUIC with shared
qcc_recv_max_streams() function.
When xprt_qstrm layer is completed, MUX layer is started. Rx buffer from
the XPRT layer is transferred to the MUX so that it can handle any extra
data following the transport parameters first frame.
Since previous commit, QCC Rx buffer is dynamically allocated only when
needed. However, qmux_init() must still allocate it when there is data
to be transferred from the XPRT layer. As a result, code has been over
extended to continue to support this case.
This patch simplifies xprt_qstrm API for the Rx buffer transfer. Buffer
content and remaining record length can now be retrieved via the single
function xprt_qstrm_xfer_rxbuf(). If the buffer is empty, nothing is
performed and XPRT layer will release it. If not empty, MUX will take
ownership of the buffer from the XPRT layer.
Allocate and release as needed the QCC buffers used for QMux protocol.
This should reduce the memory consumption of QMux. This is performed
both for send and receive buffers. Along with this, always free these
buffers in qcc_release() to prevent a memory leak.
Improve QMux memory usage at the QCS level in accordance with the
haproxy model. The tx buffer is now allocated only when used and
released as soon as it is empty.
This change requires to extend qcc_get_stream_txbuf() for QMux. Code
part related to qc_stream_desc is protected via conn_is_quic(). A
dedicated QMux bloc is added. Similarly to QUIC, a small buf can be
allocated first.
This also requires to adapt qcc_realloc_stream_txbuf() in a similar
fashion.
Clean up qcc_qstrm_send_frames(). The main change is that now return
value is clearly specified at the end of the function, depending if
everything was sent or not.
Implement close callback for xprt_qstrm layer. This is called when a
connection is prematurely closed following a connect failure. Its
purpose is to clean up all xprt resources.
A special care is required for the frontend side. Indeed,
xprt_qstrm_io_cb() can call close callback via conn_create_mux() on the
latter failure. The tasklet should then immediately be stopped as the
whole xprt layer has been freed as well.
Function conn_create_mux() has different behavior for frontend and
backend connections. In particular, on FE side, there is a risk that the
connection is freed.
Write a comment to explain these differences clearly.
Recently, an extra check has been added so that a dead connection is
immediately release on at the end of qcc_recv() operation. This is
useful when a GOAWAY frame is received from a server, so that the
backend connection is released if idle.
This step is in fact only necessary for QUIC, as qcc_recv() is called
directly from the lower transport layer. It causes issues with QMux as
in this case qcc_recv() is called via qcc_io_recv(). A crash in this
context will occur as qcc_recv() does not indicate that a release has
been performed.
To fix this, simply disable the extra check at the end of qcc_recv() for
QMux. This is fine as in this case receive operation is always followed
by qcc_io_process() which is able to release the connection in a safe
way.
No need to backport.
When QMux XPRT has successfully been able to process to transport
parameters exchange, the MUX is initialized and immediately woken up to
start transfers. However, if the connection is in an unusable state, the
latter operation will instead release the connection and all of its
network stack.
A crash would occur in case of release when finalizing the XPRT tasklet
completion. To fix this, first free every XPRT resources. MUX wake is
now conducted in a safe way as the last operation before the tasklet is
completely released.
No need to backport.
Complete initialization of xprt_qstrm layer by setting local parameters
to zero. This should prevent to emit random values to the peer.
No backport needed.
qc_frm_free() is a helper used to clean up a QUIC frame object. It is
used by MUX layer both for QUIC and QMux protocols.
This function takes a pointer to the underlying quic_conn, used only for
trace purpose. This patch fixes its usage for QMux to ensure that in
this case a NULL value is used.
No need to backport.
When dealing with EOH block, we must be sure to force the close mode for
message with no payload but annoncing a non-null content-length.
It is mainly an issue on the server side but it could be encountered on
client side too. Without this fix, a request can be switched to the DONE
state while the server is still expecting the payload. In an ideal world,
this case should not happen. But in conjunction with other bugs, it may lead
to a desynchro between haproxy and the server.
Now, when a non-null content-length is announced but we know we reached the
end of the message, we force the close mode. The only exception is for
bodyless responses (204s, 304s and responses to head requests).
Thanks to Martino Spagnuolo (r3verii) for his detailed report on this issue.
This patch must be backported to all stable version.
Checks are already made on H2 to detect inconsistencies between
advertised content-length and transferred data (excess of data or
premature END_STREAM flag on DATA frame). However, as found by
Martino Spagnuolo (r3verii), a subtle case remains: if the END_STREAM
appears on the HEADERS frame (i.e. a regular request for example),
then the check is not made. In this case it is possible to advertise
more contents than will really be transferred. If the other side uses
HTTP/1.1, and the server responds before the end of the transfer,
this means that the number of advertised bytes that will never be
transferred and that the server will drain will be taken from the
next request, effectively hiding a part of the header.
In practice this can be used to force subsequent requests to fail, or
when running with "http-reuse never" or when running with a totally
idle server, to perform a request smuggling by constructing specially
crafted request pairs where the first one is used to trigger an early
response and hide parts of or all headers of the second one, to
instead use a second embedded one that was not subject to analysis.
The risk remains moderate given the low prevalence of "http-reuse never"
in production environments, and of idle servers.
The fix consists in detecting if advertised content-length remains when
processing an END_STREAM flag on a HEADERS frame. It also does it for
trailers, which turn out to be another way to abuse the bug. However it
takes great care not to break bodyless responses (204, 304 and responses
to HEAD requests) that may present a content-length that doesn't reflect
the presence of a body in the response.
A temporary alternative to the fix is to disable HTTP/2 by specifying
"alpn http/1.1" on "bind" lines, and adding "option disable-h2-upgrade"
in HTTP frontends.
This must be backported to all stable versions.
In 3.4-dev6, commit de5fc2f515 ("BUG/MINOR: server: set auto SNI for
dynamic servers") allowed to properly set the SNI, and return an error
message. However the error message is leaked after being printed on the
CLI.
This should be backported to 3.3.
In 3.4-dev7, commit e1738b665d ("MINOR: debug: read all libs in memory
when set-dumpable=libs") reads dependencies into memory to store them as
a tar archive for later debugging. There was an attempt to mark the whole
archive read-only, except that the size passed in argument to mprotect()
is wrong: lib_size is only assigned after the operation and is still zero
at the moment this is done. new_size ought to be used instead.
This needs to be backported wherever the commit above is backported, at
least 3.2.
3 new enum values and a mask were added in latest -dev with commit
24e05fe33a ("MINOR: stream: Use a pcli transaction to replace pcli_*
members"), unfortunately the entries needed by the "flags" command were
forgotten.
No backport is needed.
Since commit 0af603f46f ("MEDIUM: threads: change the default
max-threads-per-group value to 16"), it was written "Tha minimum" instead
of "The minimum". No backport needed, this is only in latest -dev.
In 3.4-dev8, commit e264523112 ("MINOR: servers: Don't update last_sess
if it did not change") adjusted the last_sess date to avoid writing to
the same cache line all the time, however a typo makes it pick the wrong
second because it uses now_ms instead of now_ns (so the date would roughly
change every 12 days).
No backport needed.
In 2.8, commit ead43fe4f2 ("MEDIUM: compression: Make it so we can
compress requests as well.") added the ability to independently enable
compression on request and/or response. However there's a bug in the
"compression direction response" case, which preserves only the request
flag and adds the response direction instead of clearing the request
flag, so this directive would clear offload and make it impossible to
disable request if it was already previously enabled.
This can be backported to stable releases as far as 2.8.
When a http request is sent during an http healthcheck, if an error is
triggered while the output buffer is a small buffer, another attempt is made
with a larger one. When this happens, the temporary chunk used to format
headers must be released.
No backport needed.
Ruleset are stored in a global tree, released on deinit staged. All errors
are fatal and abort the configuration parsing. So the current ruleset must
not be released here.
The test to remove trailers from chunked messages was inverted and is thus
ineffective. The flag for the requests was tested on client side and the flag
for the response was tested on server side. It should be the opposite.
This patch must be backported as far as 3.2.
When the EOH block is processed, before sending message headers, there is a
test to know if there is no payload. In case of a chunked message, a
null-chunk is emitted, except for bodyless response. For instance, a
response to a HEAD request has no payload at all and no null-chunk.
However, the test for bodyless responses is not correct. Only
H1S_F_BODYLESS_RESP flag is tested. But this flag can be set on server side
when we are processing the request. To fix the issue, the test was
adapted. The null-chunk is added if a message with no payload is chunked and
it is a request or a non-bodyless responses.
This patch must be backported to all stable version.
A typo in commit e51be30f78 ("BUG/MINOR: log: consider format expression
dependencies to decide when to log") made HRSHP appear twice (persistent
response) while the second one ought to be HRSHV (volatile response, e.g.
header values). This is harmless in practice since logs always wait for
at least headers.
This should be backported wherever the patch above was backported.
In h2_init(), if we have a failure while creating the h2c, and we
allocated shared_tx_bufs, don't forget to free it, otherwise we'll have
a memory leak.
This was introduced in 3.1 by commit a891534bfd ("MINOR: mux-h2: allocate
the array of shared rx bufs in the h2c"), so the fix should be backported
as far as 3.2.
When receiving a PRIORITY frame, when checking if the stream id provided
is ours, ignore bit 31, as it is the exclusive bit, and not part of the
stream id, whoever sends a PRIORITY frame with its own id and the
exclusive bit set will not be considered an error, as it should per the
RFC.
The impact is basically non-existent since we don't use PRIORITY frames,
it's only that we would ignore such an invalid frame instead of breaking
the connection.
The bug was introduced in 1.9 with commit 92153fccd3 ("BUG/MINOR: h2:
properly check PRIORITY frames") so the fix must be backported to all
versions.
Commit e67e36c9eb introduced
tune.h2.log-errors, that would let you pick if you wanted to know about
stream errors, connection errors, or no error.
However, a logic error made it so no error will be picked for any value
except for "none", in which case connection would be picked. Fix that by
just checking the strcmp() return value correctly.
This should be backported wherever e67e36c9eb
has been backported.
A malformed tcp option with an option length set to 0 can cause
an infinite loop on ip.fp converter.
The patch also forces the computation to use an unsigned char to
avoid a shift back during the parsing.
This fix should be backported on all versions including the ip.fp
converter.
In task_schedule(), before attempting to set the new task expiration
date, make sure it is not running by trying to set the TASK_RUNNING
flag, and waiting if it is already there. Having the flag set will
ensure that the task won't be running while we're modifying it.
There is a very rare race condition, where the expire would be set by
task_schedule(), then the running task might set it to something else,
and if it sets it to TICK_ETERNITY before task_schedule() calls
__task_queue(), then we will hit a BUG_ON() there.
This is very hard to reproduce, but has been reported a few times,
included in Github issue #3327, which should now be fixed.
This should be backported as far back as 2.8.
WIP: Make sure the task is not running before changing expire
The proxy error counter was not updated in h2c_frt_handle_headers() in
case of failure to decode a HEADERS frame. Make sure to keep it updated.
This can be backported to all stable versions.
Commit aab1a60977 ("BUG/MEDIUM: h2/htx: always fail on too large trailers")
explicitly returned an RST_STREAM on failure to decode some trailers, and
used the code H2_ERR_INTERNAL_ERROR. However there are multiple possible
causes for this failure to happen, and it turns out that it's much more
likely to be related to a protocol error than a decompression error. So
let's change this to PROTOCOL_ERROR, and count a protocol error on the
proxy and in the session.
This can be backported to all stable versions (with adjustments related
to these versions, maybe focusing on 3.2 max is reasonable).
This pointer was used during the appctx refactoring performed in 2.6. The
ctx union was still there and this pointer was used as the "shadow" of the
svcctx pointer used by most commands. In 2.7, the union was removed, making
the shadow pointer useless. Let's remove it now.
A new type of transaction was introduced for master-cli streams. So
SF_TXN_PCLI flag and functions to allocate and destroy PCLI transactions
were added.
In the stream structure, all pcli_* members were moved in the pcli
transaction and the txn union was updated accordingly.
When it was ambiguous, a test on the transaction type was performed. For
instance to destroy the transaciton.
To be able to deal with different types of transaction for a stream, new
stream flags was added to know the transaction type when allocated. For now
only HTTP transactions can be allocated, so only SF_TXN_HTTP was
introduced. The mask SF_TXN_MASK must be used to get the transaction type.
The transaction type is set when it is allocated and removed when it is
destroyed.
The HTTP transaction is moved in an union. For now, it is the only possible
transaction that can be allocated. But that will change. Thanks to this
commit and the next one, it will be possible to deal with different kind of
transactions for a stream.
This patch looks quite huge, but it is more or less a renaming of all
accesses to "txn" field by "txn.http".
The maximum size allowed for the payload pattern was increase up to 64 bytes
(65 bytes because of the trailing \0), to be able to use a sha256 of random
data for instance. It could be useful to prevent any data smuggling on the
payload.
Note that on the CLI, it could be possible to have only the buffer size as a
limit, because the command line is only consumed once all commands are
executed. The payload pattern is only a pointer in the buffer where the
command line was copied. However, for the master CLI, the data are streamed
to the worker, so we must keep a copy of he payload pattern. This is why we
must limit its size.
It is now possible to deal with too big payload to fit in a buffer, without
changing the buffer size. By default, a payload up to 128 KB can be
dynamically allocated. "tune.cli.max-payload-size" global parameter can be
used to change this value, with some caution for huge values.
For CLI command handler functions, there is no change at all. A pointer on
the payload is still passed as parameter. Internally, an area is allocated
for the payload only if it is too big.
The payload pattern used to detect the end of the payload is part from the
allocated area.
The payload is now saved as a buffer in the CLI context instead of a simple
pointer. It is mandatory to be able to reallocate the payload if it is too
big.
Instead of copying the payload pattern in the CLI context, we now only save
a pointer on this pattern. It is possible because the command line is copied
in the CLI context. Arguments are already handled this way when the command
is processed.
Detect shell parser errors in test LOG files right after vtest execution
and mark the run as failed when such errors are found.
This turns malformed feature cmd expressions from warning-like diagnostics
into hard failures, so broken test conditions are caught reliably.
The single-threaded build is currently broken in development since commit
0af603f46f ("MEDIUM: threads: change the default max-threads-per-group
value to 16") because it doesn't set the default for the non-threaded
build. Let's set it to 1.
No backport is needed.
Add an i686 job in order to run reg-tests on 32-bit architecture.
Use the i386 SSL and PCRE2 library provided by ubuntu.
VTest is still compiled in x86_64.
Commit 7d40b3134 ("MEDIUM: sched: do not run a same task multiple times
in series") required to slightly reorder a few fields in struct tasklet
and task in order to reuse an existing hole and keep tree nodes aligned.
The problem is that nice+expire were placed in struct task just before
rq, and that a 48-bit hole replaces them in struct tasklet on 64-bit
platforms, just before the struct list. However, on 32-bit platforms,
the hole is only 16-bit and preserves nice, but expire is overwritten
by the first pointer of the list element. This is not a problem for
real tasklets which do not use these fields, but it definitely is a
problem for tasks that are cast to tasklets in the run queues, because
the expire field can be overwritten when the task is woken up, and if
requeued as-is, it will expire at a completely random date.
This is what caused certain regtests to fail on i386 and 32-bit arm
machines.
This fix needs to be backported wherever the patch above was backported.
The bug has no effect on 64-bit platforms. The fix doesn't inflate
structs on 64-bit, but will raise struct tasklet from 40 to 44 bytes on
32-bit platforms.
Thanks to William for spotting the problem, bisecting it and providing
a working reproducer.
There was a test below the "release" label on conn->owner to decide
whether to kill the connection or not. But this test is not needed,
because:
- for frontends, it's always set so the test never matches
- for backends, it was NULL on the second stream once a request
was being reused from an idle pool, so it couldn't be used to
discriminate between connections. In practice, the goal was to
try to detect certain dead connections but all cases leading to
such connections are either already handled in the tests before
(which don't reach this label), or are handled by the other
conditions.
Thus, let's remove this confusing test.
Some places use conn->owner to retrieve the session. It's valid because
each time it is done, it's on the frontend, though it's not always 100%
obvious and sometimes requires deep code analysis. Let's clarify these
points and even rely on an intermediary variable to make it clearer. One
case where the owner couldn't differ from the session without being NULL
was also eliminated.
When installing a mux on the backend, unless we have a good reason for
keeping the session set in conn->owner, we must reset it. Having the
session there just hides potential bugs and prevents certain tests from
being properly done.
Now it is much clearer: conn->owner remains set to the session on
frontend connections, is set to the session when the connection is
private or assimilated private and belongs to the session list, or
is NULL.
When an idle connection is private or considered private, session_add_conn()
is called to add it to the list of connections owned by the session. But
in case of allocation failure, the session is not set, which results in
a long list of possible situations that are all corner cases which are
difficult to test (and debug).
This commit relies on the fact that it is already permitted to have
conn->owner pointing to a session even if the connection couldn't be
added to the session's list, as this was already the case in
conn_backend_get() when dealing with HOL_RISK. Also as seen in commit
3aab17bd56 added in 2.4, it is already possible to have conn->owner
set with the connection not being in a list, and only the list element
is checked for this.
This commit modifies session_add_conn() to always set conn->onwer, even
if the list element couldn't be allocated. This way it's possible to
always refer to conn->owner to find the session owning a private conn
even in case of failure to allocate an entry. This requires to change
the checks on conn->owner to a check of the list element to see if the
connection belongs to a session, the pre-assignment of sess to
conn->owner in conn_backend_get() is no longer needed, same for the
pre-assignment in http_wait_for_response(), and that's all.
The H1 mux remained unchanged because since it cannot multiplex, in
case it fails to allocate a pconn, it instantly kills the connection.
The bytes_in, bytes_out, {req,res}.bytes_{in,out} sample fetch functions
are marked as internal dependencies only. But that's not exact, they are
statistics. Request traffic (bytes_in, req.bytes*) is usable starting
from the request, while response traffic (bytes_out, res.bytes*) is usable
as soon as a response begins to be received, and all are valid till the
end of the transaction.
The impact is that the log-format below:
log-format "req.bytes_in=%[req.bytes_in] req.bytes_out=%[req.bytes_out] res.bytes_in=%[res.bytes_in] res.bytes_out=%[res.bytes_out]"
is emitted too early and only logs zeroes when uploading 1MB and
downloading 1MB:
req.bytes_in=0 req.bytes_out=0 res.bytes_in=15288 res.bytes_out=0
This patch marks the request stats RQFIN and the response stats RSFIN,
so that they're valid at any moment and the logs backend knows it must
wait for the latest moment to emit such a line. With this change, the
line above now correctly produces:
req.bytes_in=1000157 req.bytes_out=1000157 res.bytes_in=1048629 res.bytes_out=1048629
This should be backported as far as the latest LTS probably, along with
these 2 previous patches:
BUG/MINOR: log: consider format expression dependencies to decide when to log
MINOR: sample: make RQ/RS stats available everywhere
Sample fetch functions working on the request/response stats were marked
as being only compatible with the log phase. This is a mistake because
by definitions, stats can be consulted anywhere from the moment they
start to appear. It's only that they are valid as far as the logs. At the
moment, no sample fetch function depends on RQFIN, and only res.timer.data
depends on RSFIN. But this will be needed to relax certain sample fetch
functions (and will need to be backported along with a few other patches).
Log-format properly takes into account the LW_* flags set by the log
aliases, however its consideration for the sample fetch expressions is
very minimalistic (HTTP y/n). It poses a problem because logging some
statistics doesn't work unless some log aliases are involved to force
the log to wait till the end.
Before this change, the following log-format:
log-format "res.timer.data=%[res.timer.data]"
would log "res.timer.data=0" regardless of the time taken to transfer
data, and the log would be emitted instantly. However, this line:
log-format "res.timer.data=%[res.timer.data] %B"
would properly log the time taken to transfer the data because %B which
carries the log flag LW_BYTES forces the log to wait till the end.
This patch makes sure that anything requiring response (headers or body)
waits for at least the response, and that anything requiring response body
or end of transfer (req/res) waits till the end (LW_BYTES). Thanks to
this, the log above is now correct even without the "%B" hack.
This should be backported at least till the latest LTS.
The ACME Profiles extension (draft-ietf-acme-profiles) allows a client
to request a specific certificate profile by including a "profile" field
in the newOrder request. This lets the CA select the appropriate
certificate issuance policy (e.g. "classic", "shortlived") for a given
order.
A new "profile" keyword is added to the acme section. When set, its
value is included in the newOrder JSON payload sent to the CA.
pcre-devel (PCRE1) was removed in bded31dd3b but the make flags
were not updated to match; switch to USE_PCRE2/USE_PCRE2_JIT.
32bits job was also lacking pcre2-devel.
The target address type has been added to checks in commit
d759e60a32, but as part of that address
type is the "alt_proto" field, that was not properly set for dynamic
servers, That could lead to checks not working for any protocol that use
a non-zero alt_proto, such as QUIC. So set it properly.
gcc flags aead_tag_trash as potentially NULL at the chunk_memcpy call
inside the (!dec && gcm) block, because it cannot correlate the
condition with the allocation that only happens in that same branch. Add
an explicit NULL check to silence the warning.
This was caught by cross-zoo.yml:
In file included from include/haproxy/connection.h:28,
from src/ssl_sample.c:27:
In function ‘b_orig’,
inlined from ‘sample_conv_aes’ at src/ssl_sample.c:540:23:
include/haproxy/buf.h:80:17: error: potential null pointer dereference [-Werror=null-dereference]
80 | return b->area;
| ~^~~~~~
In function ‘b_data’,
inlined from ‘sample_conv_aes’ at src/ssl_sample.c:540:3:
include/haproxy/buf.h💯17: error: potential null pointer dereference [-Werror=null-dereference]
100 | return b->data;
| ~^~~~~~
When trying to read QMux transport parameters frame, the record length
is checked to ensure it is not bigger than the buffer size. The
objective is to detect as soon as possible when receiving data that
cannot be handled and to close the connection.
In fact, this check is not accurate, as it did not take into account the
size of the Record length field itself. This patch fixes the comparison
by substracting with the size of the decoded varint.
No need to backport.
This patch is related to the issue reported on the previous issue
related to QMux record length parsing.
QCC rx.rlen is used to store the decoded record length. Convert it into
a plain 64bits integer instead of a size_t. This ensures it is
sufficient to decode record length, even with an increase of
max_record_length value (not currently implemented).
This should fix github build issue #3334 for 32bits architecture.
No need to backport.
QMux record lengths are encoded as a QUIC varint. Thus in theory, it
requires a 64bits integer to be able to read the whole value. In
practice, if the record is bigger than bufsize, read operation cannot be
completed and an error must be reported.
This patch fixes record length decoding both in xprt_qstrm layer, which
is now performed in two steps. The value is first read in a 64bits
integer instead of a size_t whose size is dependent on the architecture.
Result is then checked against bufsize and if inferior stored in the
previously used variable (xprt ctx rxrlen member).
This should partially fix build issue reported on github #3334.
No need to backport.
halog does not make use of any of the "fancy" build flags that HAProxy does and
except for itself only includes ebtree. There is no need to build it as part of
the VTest workflows.
In commit 7640d794 ("CI: Integrate Musl build into vtest.yml"), the
alpine job was integrated into VTest.yml. However, most of the task are
still duplicated and changes in the workflow need edits of copy/paste
code in two places because of that.
This commit deduplicates the code by making the alpine job part of the
matrix, like it was done for macOS.
Replaced SH_ARGS variables with 'set --' and "${@}" to ensure proper
quoting of haproxy command-line arguments. Then replaced individual
per-config run scripts with a single generic run-test-config.sh that
derives the configuration directory from its own filename. The former
scripts became symlinks, and a new run-empty.sh symlink was added.
Removed the insecure-fork-wanted runtime check from the OTel filter parser
and all related mentions from documentation and test configuration.
The OpenTelemetry C wrapper library can now explicitly start all necessary
OTel threads immediately after configuration parsing, so it is no longer
affected by the HAProxy thread/process creation restriction and the
insecure-fork-wanted option is no longer needed.
Added getopts argument parsing with -h, -r and -d options, making sample
rate limits and wrk runtime configurable. Introduced a dry-run variable
for debugging, httpd cleanup in sh_exit, and removal of the log directory
on exit if empty.
Added per-thread ID tracking for the OpenTelemetry C wrapper debug system.
Registered HAProxy worker threads are identified by their tid;
unregistered threads (such as those created internally by the OTel SDK)
receive unique IDs from an atomic offset counter.
An idle backend connection is useless if a HTTP/3 GOAWAY frame has been
received. Indeed, it is forbid to open new stream on such connection.
Thus, this patch ensures such connections are removed as soon as
possible. This is performed via a new check in qcc_is_dead() on
QC_CF_CONN_SHUT flag for backend connections. This ensures that a shut
connection is released instead of being inserted in idle list on detach
operation.
This commits also completes qcc_recv() with a new call to qcc_is_dead()
on its ending. This is necessary if GOAWAY is received on an idle
connection. For now, this is only checked for backend connections as a
GOAWAY is without any real effect for frontend connections. Thus, this
extra protection ensures that we do not break by incident QUIC frontend
support.
qcc_io_recv() also performs qcc_decode_qcs(). However, an extra
qcc_is_dead() is not necessary in this case as the following
qcc_io_process() already performs it.
Implement the reception of a HTTP/3 GOAWAY frame. This is performed via
the new function h3_parse_goaway_frm(). The advertised ID is stored in
new <id_shut_r> h3c member. It serves to ensure that a bigger ID is not
advertised when receiving multiple GOAWAY frames.
GOAWAY frame reception is only really useful on the backend side for
haproxy. When this occurs, h3c is now flagged with H3_CF_GOAWAY_RECV.
Also, QCC is also updated with new flag QC_CF_CONN_SHUT. This flag
indicates that no new stream may be opened on the connection. Callback
avail_streams() is thus edited to report 0 in this case.
Rework GOAWAY emission handling at the HTTP/3 layer. Previously, h3c
member <id_goaway> were updated during the connection on each new
streams attach. This ID was finally reused when a GOAWAY was emitted.
However, this is unnecessary to keep an updated ID during the connection
lifetime. Indeed, <largest_bidi_r> QCC member can be used for the same
purpose. Note that this is only useful for the frontend side. For a
client connection, GOAWAY contains a PUSH ID, thus 0 can be used for
now.
Thus, <id_goaway> in h3c is renamed <id_shut_l>. Now it is only sent
when the GOAWAY is emitted. This allows to reject any streams with a
greater ID. This approach is considered simpler.
Note that <largest_bidi_r> is not strictly similar to the obsolete
<id_goaway>. Indeed, if an error occurs before the corresponding stream
layer allocation, the former would still be incremented. However,
this is not a real issue as GOAWAY specification is clear that lower IDs
are not guaranteed to being handled well, until either the stream is
closed or resetted, or the whole connection is teared down.
QUIC streams ID are encoded as 62-bit integer and cannot reuse an ID
within a connection. This is necessary to take into account this
limitation for backend connections.
This patch implements this via qmux_avail_streams() callback. In the
case where the connection is approaching the encoding limit, reduce the
advertised value until the limit is reached. Note that this is very
unlikely to happen as the value is pretty high.
This should be backported up to 3.3.
add comp_ prefix to all compression related functions, in anticipation
of decompression functions that will be integrated in the same file, so
we don't get mixed up between the two.
No change of behavior expected.
Since commit 7d40b31 ("MEDIUM: sched: do not run a same task multiple
times in series") I noticed that any valid config, once haproxy was
started, would produce uninitialised reads on valgrind:
[NOTICE] (243490) : haproxy version is 3.4-dev9-0af603-2
[NOTICE] (243490) : path to executable is /tmp/haproxy/haproxy
[WARNING] (243490) : missing timeouts for proxy 'test'.
| While not properly invalid, you will certainly encounter various problems
| with such a configuration. To fix this, please ensure that all following
| timeouts are set to a non-zero value: 'client', 'connect', 'server'.
[NOTICE] (243490) : Automatically setting global.maxconn to 491.
==243490== Thread 4:
==243490== Conditional jump or move depends on uninitialised value(s)
==243490== at 0x44DBD7: run_tasks_from_lists (task.c:567)
==243490== by 0x44E99E: process_runnable_tasks (task.c:913)
==243490== by 0x395A41: run_poll_loop (haproxy.c:2981)
==243490== by 0x396178: run_thread_poll_loop (haproxy.c:3211)
==243490== by 0x4E2DAA3: start_thread (pthread_create.c:447)
==243490== by 0x4EBAA63: clone (clone.S:100)
==243490==
Looking at it, it is caused by the fact that task->last_run member which
was introduced and used by commit above is never assigned a default value
so the first time it is used, reading from it causes uninitialised read.
To fix the issue, we simply ensure last_run task member gets a default
value when the task or tasklet is created. We use '0' as default value,
as the value itself is from minor importance because the member is used
to detect if the task has already been executed for the current loop cycle
so it will self-correct in any case.
No backport needed, unless 7d40b31 is
Originally, valid backend connections always used to have conn->owner
pointing to the owner session. In 1.9, commit 93c885 enforced this when
implementing backend H2 support by making sure that no orphaned connection
was left on its own with no remaining stream able to handle it.
Later, idle connections were reworked so that they were no longer
necessarily attached to a stream, but could be directly in the server,
accessed via a hash, so it started to become possible to have conn->owner
left to NULL when picking such a connection. It in fact happens for
http-reuse always, when the second stream picks the connection because
its owner is NULL and it's not changed.
More recently, a case was identified where it could be theoretically
possible to reinsert a dead connection into an idle list, and commit
59c599f3f0 ("BUG/MEDIUM: mux-h2: make sure not to move a dead
connection to idle") addressed that possibility in 3.3 by adding the
h2c_is_dead() test in h2_detach() before deciding to reinsert a
connection into the idle list.
Unfortunately, the combination of changes above results in the following
sequence being possible:
- a stream requires a connection, connect_server() creates one, sets
conn->owner to the session, then when the session is being set up,
the SSL stack calls conn_create_mux() which gets the session from
conn->owner, passes it to mux->init() (h2_init), which in turn
creates the backend stream and assigns it this session.
- when the stream ends, it detaches (h2_detach), and the call to
h2c_is_dead() returns false because h2c->conn->owner is set. The
connection is thus added into the server's idle list.
- a new stream comes, it finds the connection in the server's list,
which doesn't require to set conn->owner, the stream is added via
h2_attach() which passes the stream's session, and that one is
properly set on h2s again, but never on conn->owner.
- the stream finishes, detaches, and this time the call to h2c_is_dead()
sees the owner is NULL, thus indicates that the connection seems dead
so it's not added again to the idle list, and it's destroyed.
Note that this most only happens at low loads (at most one active stream
per connection, so typically at most than one active stream per thread),
where the H2 reuse ratio on a server configured with http-reuse always
or http-reuse aggressive is close to 50%. At high loads, this is much more
rare, though looking at the reuse stats for a server, it's visible that a
sustained load still shows around 1% of the connections being periodically
renewed.
Interestingly, for RHTTP the impact is more important because there
was already a work around for this test in h2c_is_dead() but it uses
conn_is_reverse(), which is never correct in this case (it should be
called conn_to_reverse() because it says the conn must be reversed
and has not yet been), so this extra test doesn't protect against the
NULL check, and connections are closed after each stream is terminated
(if there is no other stream left).
After a long analysis with Amaury and Olivier, it was concluded that:
- the h2c_is_dead() addition is finally not the best solution and
could be refined, however in the current state it's a bit tricky.
- the conn->owner test in h2c_is_dead() is no longer relevant,
probably since 2.4 when connections were stored using hash_nodes
in the servers and would no longer depend on a session, so that
test should be removed.
- the test conn_is_reverse() on the same line, that was added to
ignore the former for RHTTP, and which doesn't properly work either
should be removed as well.
Some further cleanups should be performed to clarify this situation.
This patch implements the points above, and it should be backported
wherever commit 59c599f3f0 was backported.
A lot of our subsystems start to be shared by thread groups now
(listeners, queues, stick-tables, stats, idle connections, LB algos).
This has allowed to recover the performance that used to be out of
reach on losely shared platforms (typically AMD EPYC systems), but in
parallel other large unified systems (Xeon and large Arm in general)
still suffer from the remaining contention when placing too many
threads in a group.
A first test running on a 64-core Neoverse-N1 processor with a single
backend with one server and no LB algo specifiied shows 1.58 Mrps with
64 threads per group, and 1.71 Mrps with 16 threads per group. The
difference is essentially spent updating stats counters everywhere.
Another test is the connection:close mode, delivering 85 kcps with
64 threads per group, and 172 kcps (202%) with 16 threads per group.
In this case it's mostly the more numerous listeners which improve
the situation as the change is mostly in the kernel:
max-threads-per-group 64:
# perf top
Samples: 244K of event 'cycles', 4000 Hz, Event count (approx.): 61065854708 los
Overhead Shared Object Symbol
10.41% [kernel] [k] queued_spin_lock_slowpath
10.36% [kernel] [k] _raw_spin_unlock_irqrestore
2.54% [kernel] [k] _raw_spin_lock
2.24% [kernel] [k] handle_softirqs
1.49% haproxy [.] process_stream
1.22% [kernel] [k] _raw_spin_lock_bh
# h1load
time conns tot_conn tot_req tot_bytes err cps rps bps ttfb
1 1024 84560 83536 4761666 0 84k5 83k5 38M0 11.91m
2 1024 168736 167713 9559698 0 84k0 84k0 38M3 11.98m
3 1024 253865 252841 14412165 0 85k0 85k0 38M7 11.84m
4 1024 339143 338119 19272783 0 85k1 85k1 38M8 11.80m
5 1024 424204 423180 24121374 0 84k9 84k9 38M7 11.86m
max-threads-per-group 16:
# perf top
Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 375998622679 lost
Overhead Shared Object Symbol
15.20% [kernel] [k] queued_spin_lock_slowpath
4.31% [kernel] [k] _raw_spin_unlock_irqrestore
3.33% [kernel] [k] handle_softirqs
2.54% [kernel] [k] _raw_spin_lock
1.46% haproxy [.] process_stream
1.12% [kernel] [k] _raw_spin_lock_bh
# h1load
time conns tot_conn tot_req tot_bytes err cps rps bps ttfb
1 1020 172230 171211 9759255 0 172k 171k 78M0 5.817m
2 1024 343482 342460 19520277 0 171k 171k 78M0 5.875m
3 1021 515947 514926 29350953 0 172k 172k 78M5 5.841m
4 1024 689972 688949 39270207 0 173k 173k 79M2 5.783m
5 1024 863904 862881 49184274 0 173k 173k 79M2 5.795m
So let's change the default value to 16. It also happens to match what's
used by default on EPYC systems these days.
This change was marked MEDIUM as it will increase the number of listening
sockets on some systems, to match their counter parts from other vendors,
which is easier for capacity planning.
It was spelled "max-thread-per-group" (without 's'). No backport is
needed unless commit 7e22d9c484 ("MEDIUM: cpu-topo: Add a new
"max-threads-per-group" global keyword") and its possible successors
are backported.
Released version 3.4-dev9 with the following main changes :
- DOC: config: fix ambiguous info in log-steps directive description
- MINOR: filters: add filter name to flt_conf struct
- MEDIUM: filters: add "filter-sequence" directive
- REGTESTS: add a test for "filter-sequence" directive
- Revert "CLEANUP: tcpcheck: Don't needlessly expose proxy_parse_tcpcheck()"
- MINOR: tcpcheck: reintroduce proxy_parse_tcpcheck() symbol
- BUG/MEDIUM: haterm: Move all init functions of haterm in haterm_init.c
- BUG/MEDIUM: mux-h1: Disable 0-copy forwarding when draining the request
- MINOR: servers: The right parameter for idle-pool.shared is "full"
- DOC: config: Fix two typos in the server param "healthcheck" description
- BUG/MINOR: http-act: fix a typo in the "pause" action error message
- MINOR: tcpcheck: Reject unknown keyword during parsing of healthcheck section
- BUG/MEDIUM: tcpcheck/server: Fix parsing of healthcheck param for dynamic servers
- BUG/MINOR: counters: fix unexpected 127 char GUID truncation for shm-stats-file objects
- BUG/MEDIUM: tcpcheck: Properly retrieve tcpcheck type to install the best mux
- BUG/MEDIUM: payload: validate SNI name_len in req.ssl_sni
- BUG/MEDIUM: jwe: fix NULL deref crash with empty CEK and non-dir alg
- BUG/MEDIUM: jwt: fix heap overflow in ECDSA signature DER conversion
- BUG/MEDIUM: jwe: fix memory leak in jwt_decrypt_secret with var argument
- BUG: hlua: fix stack overflow in httpclient headers conversion
- BUG/MINOR: hlua: fix stack overflow in httpclient headers conversion
- BUG/MINOR: hlua: fix format-string vulnerability in Patref error path
- BUG/MEDIUM: chunk: fix typo allocating small trash with bufsize_large
- BUG/MEDIUM: chunk: fix infinite loop in get_larger_trash_chunk()
- BUG/MINOR: peers: fix OOB heap write in dictionary cache update
- CI: VTest build with git clone + cache
- BUG/MEDIUM: connection: Wake the stconn on error when failing to create mux
- CI: github: update to cache@v5
- Revert "BUG: hlua: fix stack overflow in httpclient headers conversion"
- CI: github: fix vtest path to allow correct caching
- CI: github: add the architecture to the cache key for vtest2
- MEDIUM: connections: Really enforce mux protocol requirements
- MINOR: tools: Implement net_addr_type_is_quic()
- MEDIUM: check: Revamp the way the protocol and xprt are determined
- BUG/MAJOR: slz: always make sure to limit fixed output to less than worst case literals
- MINOR: lua: add tune.lua.openlibs to restrict loaded Lua standard libraries
- REGTESTS: lua: add tune.lua.openlibs to all Lua reg-tests
- BUG/MINOR: resolvers: fix memory leak on AAAA additional records
- BUG/MINOR: spoe: fix pointer arithmetic overflow in spoe_decode_buffer()
- BUG/MINOR: http-act: validate decoded lengths in *-headers-bin
- BUG/MINOR: haterm: Return the good start-line for 100-continue interim message
- BUG/MEDIUM: samples: Fix handling of SMP_T_METH samples
- BUG/MINOR: sample: fix info leak in regsub when exp_replace fails
- BUG/MEDIUM: mux-fcgi: prevent record-length truncation with large bufsize
- BUG/MINOR: hlua: fix use-after-free of HTTP reason string
- BUG/MINOR: mux-quic: fix potential NULL deref on qcc_release()
- BUG/MINOR: quic: increment pos pointer on QMux transport params parsing
- MINOR: xprt_qstrm: implement Rx buffering
- MINOR: xprt_qstrm/mux-quic: handle extra QMux frames after params
- MINOR: xprt_qstrm: implement Tx buffering
- MINOR: xprt_qstrm: handle connection errors
- MEDIUM: mux-quic: implement QMux record parsing
- MEDIUM: xprt_qstrm: implement QMux record parsing
- MEDIUM: mux-quic/xprt_qstrm: implement QMux record emission
- DOC: update draft link for QMux protocol
- BUG/MINOR: do not crash on QMux reception of BLOCKED frames
- Revert "BUG/MEDIUM: haterm: Move all init functions of haterm in haterm_init.c"
- BUG/MEDIUM: haterm: Properly initialize the splicing support for haterm
- BUG/MINOR: mux_quic: prevent QMux crash on qcc_io_send() error path
- BUG/MINOR: xprt_qstrm: do not parse record length on read again
- MEDIUM: otel: added OpenTelemetry filter skeleton
- MEDIUM: otel: added configuration and utility layer
- MEDIUM: otel: added configuration parser and event model
- MEDIUM: otel: added post-parse configuration check
- MEDIUM: otel: added memory pool and runtime scope layer
- MEDIUM: otel: implemented filter callbacks and event dispatcher
- MEDIUM: otel: wired OTel C wrapper library integration
- MEDIUM: otel: implemented scope execution and span management
- MEDIUM: otel: added context propagation via carrier interfaces
- MEDIUM: otel: added HTTP header operations for context propagation
- MEDIUM: otel: added HAProxy variable storage for context propagation
- MINOR: otel: added prefix-based variable scanning
- MEDIUM: otel: added CLI commands for runtime filter management
- MEDIUM: otel: added group action for rule-based scope execution
- MINOR: otel: added log-format support to the sample parser and runtime
- MINOR: otel: test: added test and benchmark suite for the OTel filter
- MINOR: otel: added span link support
- MINOR: otel: added metrics instrument support
- MINOR: otel: added log-record signal support
- MINOR: otel: test: added full-event test config
- DOC: otel: added documentation
- DOC: otel: test: added test README-* files
- DOC: otel: test: added speed test guide and benchmark results
- DOC: otel: added cross-cutting design patterns document
- MINOR: otel: added flt_otel_sample_eval and exposed flt_otel_sample_add_kv
- MINOR: otel: changed log-record attr to use sample expressions
- MINOR: otel: changed instrument attr to use sample expressions
- DOC: otel: added README.md overview document
- CLEANUP: ot: use the item API for the variables trees
- BUG/MINOR: ot: removed dead code in flt_ot_parse_cfg_str()
- BUG/MINOR: ot: fixed wrong NULL check in flt_ot_parse_cfg_group()
- BUILD: ot: removed explicit include path when building opentracing filter
- MINOR: ot: renamed the variable dbg_indent_level to flt_ot_dbg_indent_level
- CI: Drop obsolete `packages: write` permission from `quic-interop-*.yml`
- CI: Consistently add a top-level `permissions` definition to GHA workflows
- CI: Wrap all `if:` conditions in `${{ }}`
- CI: Fix regular expression escaping in matrix.py
- CI: Update to actions/checkout@v6
- CI: Simplify version extraction with `haproxy -vq`
- CI: Merge `aws-lc.yml` and `aws-lc-fips.yml` into `aws-lc.yml`
- CI: Merge `aws-lc-template.yml` into `aws-lc.yml`
- CI: Consistently set up VTest with `./.github/actions/setup-vtest`
- MINOR: mux_quic: remove duplicate QMux local transport params
- CI: github: add bash to the musl job
- BUG/MINOR: quic: do not use hardcoded values in QMux TP frame builder
- BUG/MINOR: log: Fix error message when using unavailable fetch in logfmt
- CLEANUP: log: Return `size_t` from `sess_build_logline_orig()`
- CLEANUP: stream: Explain the two-step initialization in `stream_generate_unique_id()`
- CLEANUP: stream: Reduce duplication in `stream_generate_unique_id()`
- CLEANUP: http_fetch: Use local `unique_id` variable in `smp_fetch_uniqueid()`
- CI: build WolfSSL job with asan enabled
- MINOR: tools: memvprintf(): remove <out> check that always true
- BUG/MEDIUM: cli: Properly handle too big payload on a command line
- REGTESTS: Never reuse server connection in reg-tests/jwt/jwt_decrypt.vtc
- MINOR: errors: remove excessive errmsg checks
- BUG/MINOR: haterm: preserve the pipe size margin for splicing
- MEDIUM: acme: implement dns-persist-01 challenge
- MINOR: acme: extend resolver-based DNS pre-check to dns-persist-01
- DOC: configuration: document dns-persist-01 challenge type and options
- BUG/MINOR: acme: read the wildcard flag from the authorization response
- BUG/MINOR: acme: don't pass NULL into format string
- BUG/MINOR: haterm: don't apply the default pipe size margin twice
- CLEANUP: Make `lf_expr` parameter of `sess_build_logline_orig()` const
- MINOR: Add `generate_unique_id()` helper
- MINOR: Allow inlining of `stream_generate_unique_id()`
- CLEANUP: log: Stop touching `struct stream` internals for `%ID`
- MINOR: check: Support generating a `unique_id` for checks
- MINOR: http_fetch: Add support for checks to `unique-id` fetch
- MINOR: acme: display the type of challenge in ACME_INITIAL_DELAY
- MINOR: mjson: reintroduce mjson_next()
- CI: Remove obsolete steps from musl.yml
- CI: Use `sh` in `actions/setup-vtest/action.yml`
- CI: Sync musl.yml with vtest.yml
- CI: Integrate Musl build into vtest.yml
- CI: Use `case()` function
- CI: Generate vtest.yml matrix on `ubuntu-slim`
- CI: Run contrib.yml on `ubuntu-slim`
- CI: Use `matrix:` in contrib.yml
- CI: Build `dev/haring/` as part of contrib.yml
- MINOR: htx: Add helper function to get type and size from the block info field
- BUG/MEDIUM: htx: Properly handle block modification during defragmentation
- BUG/MEDIUM: htx: Don't count delta twice when block value is replaced
- MINOR: ssl: add TLS 1.2 values in HAPROXY_KEYLOG_XX_LOG_FMT
- EXAMPLES: ssl: keylog entries are greater than 1024
- BUILD: Makefile: don't forget to also delete haterm on make clean
- MINOR: stats: report the number of thread groups in "show info"
- CLEANUP: sample: fix the comment regarding the range of the thread sample fetch
- MINOR: sample: return the number of the current thread group
- MINOR: sample: add new sample fetch functions reporting current CPU usage
- BUG/MEDIUM: peers: trash of expired entries delayed after fullresync
- DOC: remove the alpine/musl status job image
- MINOR: mux-quic: improve documentation for qcs_attach_sc()
- MINOR: mux-quic: reorganize code for app init/shutdown
- MINOR: mux-quic: perform app init in case of early shutdown
- MEDIUM: quic: implement fe.stream.max-total
- MINOR: mux-quic: close connection when reaching max-total streams
- REGTESTS: add QUIC test for max-total streams setting
- MEDIUM: threads: start threads by groups
- MINOR: acme: opportunistic DNS check for dns-persist-01 to skip challenge-ready steps
- BUG/MINOR: acme: fix fallback state after failed initial DNS check
- CLEANUP: acme: no need to reset ctx state and http_state before nextreq
- BUG/MINOR: threads: properly set the number of tgroups when non using policy
When nbthread is set, the CPU policies are not used and do not set
nbthread nor nbtgroups. When back into thread_detect_count(), these
are set respectively to thr_max and 1. The problem which becomes very
visible with max-threads-per-group, is that setting this one in
combination with nbthreads results in only one group with the calculated
number of threads per group. And there's not even a warning. So basically
a configuration having:
global
nbthread 64
max-threads-per-group 8
would only start 8 threads.
In this case, grp_min remains valid and should be used, so let's just
change the assignment so that the number of groups is always correct.
A few ifdefs had to move because the calculations were only made for
the USE_CPU_AFFINITY case. Now these parts have been refined so that
all the logic continues to apply even without USE_CPU_AFFINITY.
One visible side effect is that setting nbthread above 64 will
automatically create the associated number of groups even when
USE_CPU_AFFINITY is not set. Previously it was silently changed
to match the per-group limit.
Ideally this should be backported to 3.2 where the issue was
introduced, though it may change the behavior of configs that were
silently being ignored (e.g. "nbthread 128"), so the backport should
be considered with care. At least 3.3 should have it because it uses
cpu-policy by default so it's only for failing cases that it would be
involved.
The nextreq label already implement setting http_state to ACME_HTTP_REQ
and setting ctx->state to st. It is only needed to set the st variable
before jumping to nextreq.
When the opportunistic initial DNS check (ACME_INITIAL_RSLV_READY) fails,
the state machine was incorrectly transitioning to ACME_RSLV_RETRY_DELAY
instead of ACME_CLI_WAIT. This caused the challenge to enter the DNS retry
loop rather than falling back to the normal cond_ready flow that waits for
the CLI signal.
Also reorder ACME_CLI_WAIT in the state enum and trace switch to reflect
the actual execution order introduced in the previous commit: it comes after
ACME_INITIAL_RSLV_READY, not before ACME_INITIAL_RSLV_TRIGGER.
No backport needed.
For dns-persist-01, the "_validation-persist.<domain>" TXT record is set once
and never changes between renewals. Add an initial opportunistic DNS check
(ACME_INITIAL_RSLV_TRIGGER / ACME_INITIAL_RSLV_READY states) that runs before
the challenge-ready conditions are evaluated. If all domains already have the
TXT record, the challenge is submitted immediately without going through the
cli/delay/dns challenge-ready steps, making renewals faster once the record is
in place.
The new ACME_RDY_INITIAL_DNS flag is automatically set for
dns-persist-01 in cond_ready.
Till now, threads were all started one at a time from thread 1. This
will soon cause us limitations once we want to reduce shared stuff
between thread groups.
Let's slightly change the startup sequence so that the first thread
starts one initial thread for each group, and that each of these
threads then starts all other threads from their group before switching
to the final task. Since it requires an intermediary step, we need to
store that threads' start function to access it from the group, so it
was put into the tgroup_info which still has plenty of room available.
It could also theoretically speed up the boot sequence, though in
practice it doesn't change anything because each thead's initialization
is made one at a time to avoid races during the early boot. However
ther is now a function in charge of starting all extra threads of a
group, and whih is called from this group.
Add a new QUIC regtest to test the new frontend stream.max-total
setting.
This test relies on two haproxy instances, as QUIC client and server.
New setting stream.max-total is set to 3 on the server side. In total, 6
requests are performed, with a check to ensure that a new connection has
been reopened for the last ones.
This commit completes the previous one which implements a new setting to
limit the number of streams usable by a client on a QUIC connection.
When the connection becomes idle after reaching this limit, it is
immediately closed. This is implemented by extending checks in
qcc_is_dead(). This results in a CONNECTION_CLOSE emission, which is
useful to free resources as soon as possible.
Implement a new setting to limit the total number of bidirectional
streams that the client may use on a single connection. By default, it
is set to 0 which means it is not limited at all.
If a positive value is configured, the client can only open a fixed
number of request streams per QUIC connection. Internally, this is
implemented in two steps :
* First, MAX_STREAMS_BIDI flow control advertizing will be reduced when
approaching the limit before being completely turned off when reaching
it. This guarantees that the client cannot exceed the limit without
violating the flow control.
* Second, when attaching the latest stream with ID matching max-total
setting, connection graceful shutdown is initiated. In HTTP/3, this
results in a GOAWAY emission. This allows the remaining streams to be
completed before the connection becomes completely idle.
Adds a qcc_app_init() call in qcc_app_shutdown(). This is necessary if
shutdown is performed early, before any invokation of qcc_io_send().
Currently, this should never occur in practice. However, this will
become necessary with the new settings tune.quic.fe.stream.max-total.
Indeed, when using a very small value, app-ops layer may be closed early
in the connection lifetime.
Refactor code related to app-layer init/shutdown operations. In short,
qcc_shutdown() is renamed to qcc_app_shutdown(). It is also moved next
to qcc_app_init() to better reflect their link.
Complete function doc for qcs_attach_sc() by using the proper
terminology related to stream/stconn/sedesc. The purpose of this
function should be clearer now.
stksess_new has set the entry expire to the table expire delay,
if it is a new entry, set_entry inserts at that position in the expire
tree. There was a touch_remote updating the expire setting but the
tree's re-ordering is not designed to set back in the past resulting
to an entry that will be trashed only after a full table's expire delay
regardless the expire set on the stktsess.
This patch sets the newts expire before the call of 'set_entry'.
This way a new inserted entry is set directly at the right position
in the tree to trash the entry in time.
This patch should be backported on all supported branches and at
least v2.8
Some features can automatically turn on or off depending on CPU usage,
but it's not easy to measure it. Let's provide 3 new sample fetch functions
reporting the CPU usage as measured inside haproxy during the previous
polling loop, and reported in "idle" stats header / "show info", or used
by tune.glitches.kill.cpu-usage, or maxcompcpuusage:
- cpu_usage_thr: CPU usage between 0 and 100 of the current thread, used
by functions above
- cpu_usage_grp: CPU usage between 0 and 100, averaged over all threads of
the same group as the current one.
- cpu_usage_proc: CPU usage between 0 and 100, averaged over all threads
of the current process
Note that the value will fluctuate since it only covers a few tens to
hundreds of requests of the last polling loop, but it reports what is
being used to take decisions.
It could also be used to disable some non-essential debugging/processing
under too high loads for example.
Just like we have a sample fetch function that returns the number of the
current thread, let's have the same with the thread group number. This
can be useful for troubleshooting, given that certain things are currently
per thread-group (e.g. idle backend connections, certain LB algos etc).
The comment says "between 1 and nbthread" while it's in fact between 0 and
nbthread-1 and this is also documented like this in the config manual. No
backport needed though it cannot hurt.
Since thread groups were enabled by default in 3.3, it has become an
important element of diagnostic that we're missing in "show info". Let's
add it under "NbThreadGroups".
haterm depends on the same source files as haproxy, yet it wasn't deleted
on "make clean", resulting in confusion when rebuilding and believing to
run the freshly built one. Let's just add it to the "clean" target. No
backport is needed since haterm is 3.4-only.
Adjust the log size to 2048, the default 1024 bytes of a log line are
too small since f28dd15 ("MINOR: ssl: add TLS 1.2 values in
HAPROXY_KEYLOG_XX_LOG_FMT")
Add the CLIENT_RANDOM line for TLS1.2 in HAPROXY_KEYLOG_FC_LOG_FMT and
HAPROXY_KEY_LOG_BC_FMT. These are useful to produce a keylog file
compatible with both TLS1.3 and TLS1.2.
A regression was introduced by the commit a8887e55a ("BUG/MEDIUM: htx: Fix
function used to change part of a block value when defrag").
When a block value was replaced and a defragmentation was performed, the
delta between the old value and the new one was counted twice. htx_defrag()
already is responsible to set the new size for the HTX message. So it must
not be performed in htx_replace_blk_value().
This patch must be backported with the commit above. So theorically to all
stable versions.
A regression was introcuded by the commit 0c6f2207f ("MEDIUM: htx: Refactor
htx defragmentation to merge data blocks").
When a defragmentation is performed, it is possible to alter a block
size. The main usage is to prepare a block value replacement. However, since
the commit above, the change is no longer handled. The block info are
changed but the size of the message is not modified accordingly.
This patch depends on the commit "MINOR: htx: Add helper function to get
type and size from the block info field"
No backport needed.
__htx_blkinfo_type() and __htx_blkinfo_size() function was added to return,
respectively, the type and the size from the block info field. The main
usage for these functions is internal to the htx code.
This makes it much easier to add additional "smoke-tests" to contrib.yml. The
previous set-up also didn't allow to easily see all failures when a single
build fails, because it would abort after any failed step.
With the previous sync, these two workflows perform almost the same steps and
both logically belong to "Run VTest tests". Integrate musl.yml into vtest.yml,
which will hopefully encourage future changes to consistently apply to all jobs
in that workflow.
This syncs up musl.yml with vtest.yml as much as possible by:
- Aligning indentation.
- Reordering steps.
- Aligning step names.
- Adding missing functionality to musl.yml.
Bash might not always be preinstalled and we don't make use of any
bash-specific features either. Switch to POSIX sh for simplicity.
This partly reverts the fix in 073240044e, which
installed `bash` for the musl job.
The lack of mjson_next() prevents to iterate easily and need to hack by
iterating on a loop of snprintf + $.field[XXX] combined with
mjson_find().
This reintroduce mjson_next() so we could iterate without having to
build the string.
The patch does not reintroduce MJSON_ENABLE_NEXT so it could be used
without having to define it.
The ACME_INITIAL_DELAY state displays a message about 'dns-01', but this
state is also used for 'dns-persist-01'.
This patch displays the challenge that was configured instead of dns-01
This allows to use the `unique-id` fetch within `tcp-check` or `http-check`
ruleset. The format is taken from the checked server's backend (which is
naturally inherited from the corresponding `defaults` section).
This is particularly useful with
http-check send ... hdr request-id %[unique-id]
to ensure all requests sent by HAProxy have a unique ID header attached.
This resolves GitHub Issue #3307.
Reviewed-by: Volker Dusch <github@wallbash.com>
This implementation is directly modeled after `stream_generate_unique_id()` and
the corresponding `unique_id` field on `struct stream`.
It will be used in a future commit to enable the use of the `%[unique-id]`
fetch in check rules.
Use the return value of `stream_generate_unique_id()` instead of relying on the
`unique_id` field of `struct stream` when handling the `%ID` log placeholder.
This also allowed to unify the "stream available" and "stream not available"
paths.
Reviewed-by: Volker Dusch <github@wallbash.com>
With the introduction of the `generate_unique_id()` helper, the actual
complicated logic is sitting in a different file. Allow inlining of
`stream_generate_unique_id()`, so that callers can benefit from an abstraction
without hiding away the access of `strm->unique_id` behind a function call.
This new function will handle the actual generation of the unique ID according
to a format. The caller is responsible to check that no unique ID is stored
yet.
Commit 6d16b11022 ("BUG/MINOR: haterm: preserve the pipe size margin
for splicing") solved the issue of pipe size being sufficient for the
vmsplice() call, but as Christopher pointed out, the ratio was applied
to the default size of 64k, so now it's applied twice, giving 100k
instead of 80k. Let's drop it from there.
No backport needed.
Printing a "(null)" when NULL passed with the %s format specifier is a
GNU extension, so it must be avoided for portability reasons.
Must be backported as far as 3.2
The wildcard field was declared and used when building the dns-persist-01
TXT record value (policy=wildcard suffix), but was never populated from
the server's authorization response. Add the missing mjson_get_bool() call
to read $.wildcard before saving auth->dns.
Document the dns-persist-01 challenge type under the challenge keyword,
the challenge-ready dns option (existence-only TXT check for dns-persist-01),
and the default challenge-ready value when challenge is dns-persist-01.
Add challenge_type parameter to acme_rslv_start() to select the correct
DNS lookup prefix: _validation-persist.<domain> for dns-persist-01 and
_acme-challenge.<domain> for dns-01.
Default cond_ready to ACME_RDY_DNS|ACME_RDY_DELAY for dns-persist-01.
Extend ACME_CLI_WAIT to cover dns-persist-01 alongside dns-01.
In ACME_RSLV_READY, check only TXT record existence for dns-persist-01
since the resolver cannot parse multiple strings within a single TXT entry.
Implements draft DNS-PERSIST-01 challenge based on
https://datatracker.ietf.org/doc/html/draft-ietf-acme-dns-persist
Blog post: https://letsencrypt.org/2026/02/18/dns-persist-01
This challenge is designed to use preprovisioned DNS records,
unlike DNS-01 challenge it doesn't need per provider API integration.
In short instead of validating order by crafting a custom response
based on input recieved from ACME server, like other challenges do
in particular DNS-01, HTTP-01, TLS-ALPN-01, in this challenge you
authorize domain statically, ACME account key functions similar to
a private key and accounturi in the record functions like a public key,
ACME server verifies that account uri matches account key and authorizes
based on that. You only need to write DNS record one time,
accounturi binds to an account key, and will only change if new account
key is created, although it is possible to rotate account key without
changing account uri.
Main benefits of this challenge in contrast to DNS-01:
1. Security, no need to give reverse proxy write access to the DNS.
2. Simplicity, no complex per provider integrations like Lego needed.
3. Robustness, no worrying about DNS record cache each renewal.
It would be used like this:
1. generate an account key ahead of time
2. add required DNS record manually or automatically using IaC tools
3. start HAProxy with the same account key used
Intended way to use this challenge is with a code that will print
and maybe sets DNS records ahead of time. For example that could
be integrated into the IaC provisioning step. This challenge type
is extremely recent though, so those integrations are yet to be written.
It is possible to do this challenge without extra tools too,
with pebble / challtestsrv steps would be as following:
After starting HAProxy it will print required records in the logs.
With challtestsrv you can then set those records like this:
curl -d '{
"host":"_validation-persist.localhost.",
"value": "pebble.letsencrypt.org; accounturi=...; policy=wildcard"}
' http://localhost:8055/set-txt
After setting the records run renew with the name of the certificate:
echo "acme renew @cert/localhost.pem" \
| socat stdio tcp4-connect:127.0.0.1:9999
Or just restart HAProxy.
Unlike with DNS-01 you don't have to worry about DNS records changing,
if there is any problem with DNS records you can just retry.
Originally in httpterm we used to allocate 5/4 of the size of a pipe to
permit to use vmsplice because there's some fragmentation or overhead
internally that requires to use a bit of margin. While this was initially
applied to haterm as well, it was accidentally lost with commit fb82dece47
("BUG/MEDIUM: haterm: Properly initialize the splicing support for haterm"),
resulting in errors about vmsplice() whenever tune.pipesize is set. Let's
enforce the ratio again.
No backport is needed.
I noticed some strange checks for presence of errmsg. Called functions
generate non-empty error message in case of failure, so a non-NULL address
of the error message is enough.
No backport needed.
When command line is parsed, when the payload was too big the error was not
properly handled. Instead of leaving the parsing function to print the
error, we looped infinitly trying to parse remaining data.
When the command line is too big, we must exit the parsing function in
CLI_ST_PRINT_ERR state. Instead of exiting the function, we only left the
while loop, setting this way the cli applet in CLI_ST_PROMPT state.
This patch must be backported as far as 3.2.
Reference: https://github.com/haproxy/haproxy/issues/3317
this allows to distribute memory checking to WolfSSL code as well
Only applies on the WolfSSL weekly job which build the wolfssl git
version.
Instead of relying on the implementation detail that
`stream_generate_unique_id()` will store the unique ID in `strm->unique_id` we
should use the returned value, especially since that one is already checked in
the `isttest()`.
Reviewed-by: Volker Dusch <github@wallbash.com>
The return value of the `if()` and `else` branch is identical. We can just move
it out of conditional paths.
Reviewed-by: Volker Dusch <github@wallbash.com>
`sess_build_logline_orig()` takes a `size_t maxsize` as input and accordingly
should also return `size_t` instead of `int` as the resulting length. In
practice most of the callers already stored the result in a `size_t` anyways.
The few places that used an `int` were adjusted.
This Coccinelle patch was used to check for completeness:
@@
type T != size_t;
T var;
@@
(
* var = build_logline(...)
|
* var = build_logline_orig(...)
|
* var = sess_build_logline(...)
|
* var = sess_build_logline_orig(...)
)
Reviewed-by: Volker Dusch <github@wallbash.com>
The following configuration:
defaults
unique-id-format TEST-%[srv_name]
frontend fe_http
mode http
bind :::8080 v4v6
Emitted the following error:
[ALERT] (219835) : Parsing [./patch.cfg:2]: failed to parse unique-id : sample fetch <srv_name]> may not be reliably used here because it needs 'server' which is not available here.
The `]` in the name of the sample fetch should not be there.
This bug exists since at least HAProxy 2.4, which is the oldest supported
version. The fix should be backported there.
Reviewed-by: Volker Dusch <github@wallbash.com>
Reuse QUIC transport parameters value set in xprt_qstrm layer in frame
builder function. Prior to this patch, mux_quic would use different
values from the advertised ones.
No need to backport.
Previous commit 6e67b59 ("CI: Consistently set up VTest with
./.github/actions/setup-vtest") requires bash to use the github action.
This commit adds bash to the list of installed package in alpine.
When QMux was first implemented, values used for emitted transport
parameters in xprt_qstrm and local flow control in mux_quic were
initialized separately. This is error prone in particular if a value is
change in one layer but not the other.
This patch fixes this by using xprt_qstrm_lparams() in QMux init
function. Mux flow control is then loaded with these values. Thus all
values are now initialized in a single place which is xprt_qstrm_init().
These two jobs run on exactly the same triggers and are effectively variations
of each other. There is no need to have two separate workflows for them.
This fixes:
.github/matrix.py:72: SyntaxWarning: "\." is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\."? A raw string is also an option.
return re.match('^v[0-9]+(\.[0-9]+)*$', version_string)
.github/matrix.py:89: SyntaxWarning: "\." is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\."? A raw string is also an option.
return re.match('^AWS-LC-FIPS-[0-9]+(\.[0-9]+)*$', version_string)
.github/matrix.py:106: SyntaxWarning: "\." is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\."? A raw string is also an option.
return re.match('^v[0-9]+(\.[0-9]+)*-stable$', version_string)
The thread-local variable dbg_indent_level used a generic name that could
collide with identifiers in other compilation units. Renamed it to
flt_ot_dbg_indent_level so that it carried the flt_ot_ prefix consistent
with the rest of the OpenTracing filter namespace. The rename covered the
declaration, definition, and all macro references in debug.h, parser.c and
util.c.
The -Iaddons/ot/include flag in OT_CFLAGS allowed source files to use a
bare #include "include.h", which was fragile because it depended on the
compiler search path. Removed that flag from the Makefile and changed
every source file under addons/ot/src/ to use the relative include path
../include/include.h instead. This made header resolution explicit and
consistent with standard addon conventions.
After calling flt_ot_conf_group_init() and storing the result in
flt_ot_current_group, the code incorrectly checked flt_ot_current_config
for NULL instead of the newly assigned flt_ot_current_group. This meant
a failed group init was never detected and the error path was never taken.
The local variable str was declared but never assigned a value other than
NULL. The error-handling block that called flt_ot_conf_str_free(&str) on
it was therefore a no-op. Removed both the unused variable and the dead
cleanup path.
In flt_ot_vars_scope_dump(), switched from cebu64_first()/cebu64_next() to
cebu64_imm_first()/cebu64_imm_next() for iterating the variable name trees.
Since this function only reads variables under a read lock, the immutable
traversal API is the correct choice. Also updated the container_of()
member from 'node' to 'name_node' to match the current struct var layout.
Replaced the static key-value attribute storage in update-form instruments
with sample-evaluated attributes, matching the log-record attr change.
The 'attr' keyword now accepts a key and a HAProxy sample expression
evaluated at runtime.
The struct (conf.h) changed from otelc_kv/attr_len to a list of
flt_otel_conf_sample entries. The parser (parser.c) calls
flt_otel_parse_cfg_sample() with n=1 per attr keyword. At runtime
(event.c) each attribute is evaluated via flt_otel_sample_eval() and
added via flt_otel_sample_add_kv() to a bare flt_otel_scope_data_kv,
which is passed to the meter.
Updated documentation, debug macro and test configurations.
Replaced the static key-value attribute storage in log-record with
sample-evaluated attributes. The 'attr' keyword now accepts a key and a
HAProxy sample expression evaluated at runtime, instead of a static string
value.
The struct (conf.h) changed from otelc_kv/attr_len to a list of
flt_otel_conf_sample entries. The parser (parser.c) calls
flt_otel_parse_cfg_sample() with n=1 per attr keyword. At runtime
(event.c) each attribute is evaluated via flt_otel_sample_eval() and added
via flt_otel_sample_add_kv() to a bare flt_otel_scope_data_kv, which is
passed to logger->log_span().
Updated documentation, debug macro and test configurations.
Factored the sample evaluation logic out of flt_otel_sample_add() into a
new flt_otel_sample_eval() function that evaluates a sample definition
into an otelc_value. Both the log-format path and the bare sample
expression path are handled, with a flag_native parameter controlling
native type preservation for single-expression samples.
flt_otel_sample_add() now calls flt_otel_sample_eval() and dispatches the
result.
Made flt_otel_sample_add_kv() non-static so callers outside util.c can
add key-value pairs directly to a bare flt_otel_scope_data_kv without
requiring the full flt_otel_scope_data structure.
The test directory gained a speed test guide (README-test-speed)
explaining how to run performance benchmarks at various rate-limit levels,
together with benchmark result files for the standalone, composite,
context-propagation, and frontend-backend test configurations.
Added README documentation for each test configuration (sa, cmp, ctx,
fe-be, empty, full) describing event coverage, signal usage, instrument
tables, span hierarchies and run instructions.
Added the full documentation set for the OpenTelemetry filter.
The main README served as the user-facing guide covering build
instructions, core OpenTelemetry concepts, the complete filter
configuration reference, usage examples with worked scenarios,
CLI commands, and known limitations.
Supplementary documents provided a detailed configuration guide with
worked examples (README-configuration), an internal C structure reference
for developers (README-conf), a function reference organized by source
file (README-func), an architecture and implementation review
(README-implementation), and miscellaneous notes (README-misc).
Added the 'full' test configuration that exercises all 29 supported OTel
filter events with all three signal types (traces, metrics, logs). Every
instrument definition has a corresponding update.
Added "log-record" as the third OpenTelemetry signal alongside traces
(span) and metrics (instrument). This includes the
flt_otel_conf_log_record structure definition, parser keyword defines,
the otel-scope section parser with optional "id", "event", "span", and
"attr" keywords followed by sample fetch expressions or a log-format
string, init/free lifecycle, scope list wiring, log-format evaluation
in flt_otel_scope_run_instrument_record(), a test configuration example,
log-record span reference validation in flt_otel_check(), and logger
handle creation, startup, and teardown in the filter lifecycle.
Added the "instrument" keyword to otel-scope sections for recording metric
measurements alongside traces.
Introduced flt_otel_conf_instrument holding instrument type, description,
unit, sample expressions, and optional key-value attributes. The
supported synchronous integer-precision instrument types were counters,
histograms, up-down counters, and gauges.
Instruments followed a two-form design: a "create" form defined a new
instrument with its type and value expression, while an "update" form
recorded measurements against an existing instrument with per-scope
attributes.
Instrument creation was performed lazily at first use with HA_ATOMIC_CAS
to guarantee thread-safe one-time initialization. The configuration
check phase validated that every update-form had a matching create-form
definition and that create-form names were unique across all scopes.
The meter lifecycle was integrated into filter init and deinit, starting
the meter alongside the tracer and shutting it down during cleanup.
Added span link support, allowing a span to reference other spans or
extracted contexts without establishing a parent relationship.
Introduced the flt_otel_conf_link structure and added a links list to
flt_otel_conf_span. The parser accepted both an inline syntax on the span
declaration line ("span <name> link <target>") and a standalone multi-
argument form ("link <span> ..."), each creating a conf_link entry
appended to the span's link list.
At runtime, each configured link name was resolved against the active
spans and extracted contexts in the runtime context. Resolved references
were collected into flt_otel_scope_data_link entries and passed to the C
wrapper add_link API during span creation.
Initialization, cleanup, and debug dump routines were added for the link
data structures at both configuration and runtime levels.
Added a test suite under addons/otel/test/ for the OpenTelemetry filter.
Five scenarios exercise different filter capabilities: standalone (sa)
covers all hook points including idle-timeout heartbeats, metrics and log
records; compact (cmp) covers the full request/response lifecycle with
ACL-based error handling; context (ctx) tests explicit inject/extract
propagation through numbered context variables; frontend/backend (fe/be)
tests distributed tracing across two HAProxy instances; and empty tests
bare filter initialisation with no active scopes.
A performance benchmarking script (test-speed.sh) uses wrk to measure
throughput and latency at different rate-limit settings (100% through 0%,
disabled, and filter-off). Each scenario includes comprehensive YAML
exporter definitions covering OTLP file/gRPC/HTTP, ostream, memory,
Zipkin, and Elasticsearch backends.
Extended flt_otel_parse_cfg_sample() to accept log-format strings in
addition to bare sample expressions. Added lf_expr and lf_used fields
to flt_otel_conf_sample.
Extended flt_otel_sample_add() to evaluate log-format expressions when
lf_used was set.
Added the "otel-group" action keyword that allows executing a named group
of OTel scopes from HAProxy TCP and HTTP action rule contexts.
The new group.c module registers the "otel-group" keyword for all four
action contexts (tcp-request, tcp-response, http-request, http-response)
and implements the action lifecycle callbacks.
The parser flt_otel_group_parse() accepts a filter ID and group ID as
arguments, duplicates them into the action rule's argument slots, and
wires up the check, action, and release callbacks.
The post-parse validator flt_otel_group_check() resolves the filter ID and
group ID string references into direct configuration pointers by searching
the proxy's filter list for a matching OTel filter and then looking up the
named group within that filter's configuration.
The action handler flt_otel_group_action() retrieves the filter and group
configuration from the resolved rule arguments, verifies the filter is
attached to the stream and not disabled, then iterates through all scopes
in the group and executes each via flt_otel_scope_run() with a shared
timestamp pair. This allows operators to trigger OTel instrumentation
conditionally from HAProxy rules, for example applying different tracing
scopes based on ACL conditions or request properties.
Added HAProxy CLI commands that allow runtime inspection and modification
of OTel filter settings without requiring a configuration reload.
The new cli.c module registers CLI keywords under the "otel" prefix and
implements the following commands: flt_otel_cli_parse_status() displays a
comprehensive status report of all OTel filter instances including filter
ID, proxy, disabled state, hard-error mode, logging state, rate limit,
analyzer bits, and SDK diagnostic message count;
flt_otel_cli_parse_disabled() enables or disables filtering across all
instances; flt_otel_cli_parse_option() toggles the hard-error mode that
controls whether errors disable the filter for a stream or are silently
ignored; flt_otel_cli_parse_logging() manages the logging state with
support for off, on, and dontlog-normal modes; flt_otel_cli_parse_rate()
adjusts the sampling rate limit as a floating-point percentage; and
flt_otel_cli_parse_debug() sets the debug verbosity level in debug builds.
All modifications are applied atomically across every OTel filter instance
in every proxy.
The CLI initialization is called from flt_otel_ops_init() during filter
startup via flt_otel_cli_init(), which registers the keyword table through
cli_register_kw().
Supporting changes include the FLT_OTEL_U32_FLOAT macro for converting the
internal uint32_t rate representation to a human-readable percentage, the
FLT_OTEL_PROXIES_LIST_START/END iteration macros for traversing all OTel
filter instances across the proxy list, and flt_otel_filters_dump() for
debug logging of filter instances.
Introduced an alternative variable scanning strategy that directly walked
the CEB tree of HAProxy's variable store instead of maintaining a separate
tracking buffer.
The Makefile auto-detected whether struct var carried a "name" member
by inspecting include/haproxy/vars-t.h and conditionally defined
USE_OTEL_VARS_NAME. When enabled, the tracking buffer (flt_otel_ctx) and
its callback type were compiled out and replaced by direct tree walks.
flt_otel_vars_unset() walked the CEB tree for the resolved scope, removed
every variable whose normalized name matched the given prefix followed by
a dot, and adjusted the variable accounting. flt_otel_vars_get()
performed the same prefix scan under a read lock, denormalized each
matching variable name back to its original OTel form, and assembled the
results into an otelc_text_map.
A helper flt_otel_vars_get_scope() was added to resolve scope name strings
("txn", "sess", "proc", "req", "res") to the corresponding HAProxy
variable store. The set path skipped the tracking buffer update when
prefix scanning was available.
Added support for storing OTel span context in HAProxy transaction
variables as an alternative to HTTP headers, enabled by the OTEL_USE_VARS
compile flag.
The new vars.c module implements variable-based context propagation
through the HAProxy variable subsystem. Variable names are constructed
from a configurable prefix and the OTel propagation key, with dots
normalized to underscores for HAProxy variable name compatibility
and denormalized back during retrieval. The module provides
flt_otel_var_register() to pre-register variables at parse time,
flt_otel_var_set() and flt_otel_vars_unset() to store and clear context
key-value pairs in the txn scope, flt_otel_vars_get() to collect all
variables matching a prefix into an otelc_text_map for context extraction,
and flt_otel_vars_dump() for debug logging of all OTel variables.
The inject/extract keywords in the scope parser now accept an optional
"use-vars" argument alongside "use-headers", controlled by the new
FLT_OTEL_CTX_USE_VARS flag. Both storage types can be used simultaneously
on the same span context, allowing context to be propagated through both
HTTP headers and variables.
The scope runner in event.c was extended to handle variable-based
context in parallel with headers: during extraction, it reads matching
variables via flt_otel_vars_get() when FLT_OTEL_CTX_USE_VARS is set;
during injection, it stores each propagation key as a variable via
flt_otel_var_set(). The unused resource cleanup now also unsets context
variables when removing failed extraction contexts.
The filter attach callback registers and sets the sess.otel.uuid variable
with the generated session UUID, making the trace identifier available to
HAProxy log formats and ACL expressions.
The feature is conditionally compiled: the OTEL_USE_VARS flag controls
whether vars.c is included in the build and whether the "use-vars" keyword
is available in the configuration parser.
Added the HTTP header manipulation layer that enables span context
injection into and extraction from HAProxy's HTX message buffers,
completing the end-to-end context propagation path.
The new http.c module implements three public functions:
flt_otel_http_headers_get() extracts HTTP headers matching a name prefix
from the channel's HTX buffer into an otelc_text_map structure, stripping
the prefix and separator dash from header names before storage;
flt_otel_http_header_set() constructs a full header name from a prefix and
suffix joined by a dash, removes all existing occurrences, and optionally
adds the header with a new value; and flt_otel_http_headers_remove()
removes all headers matching a given prefix. A debug-only
flt_otel_http_headers_dump() logs all HTTP headers from a channel at
NOTICE level.
The scope runner in event.c now extracts propagation contexts from HTTP
headers before processing spans: for each configured extract context, it
calls flt_otel_http_headers_get() to read matching headers into a text
map, then passes the text map to flt_otel_scope_context_init() which
extracts the OTel span context from the carrier. After span execution,
the span runner injects the span context back into HTTP headers via
flt_otel_inject_http_headers() followed by flt_otel_http_header_set()
for each propagation key.
The unused resource cleanup in flt_otel_scope_free_unused() now also
removes contexts that failed extraction by deleting their associated
HTTP headers via flt_otel_http_headers_remove() before freeing the scope
context structure.
Added the span context injection and extraction layer that bridges the
OTel C wrapper's propagation API with HAProxy's HTTP headers and text map
carriers.
The new otelc.c module implements four public functions that wrap the
OTel C wrapper's context propagation methods: flt_otel_inject_text_map()
and flt_otel_inject_http_headers() serialize a span's context into
a text map or HTTP headers carrier for outbound propagation, while
flt_otel_extract_text_map() and flt_otel_extract_http_headers()
deserialize an inbound carrier into an otelc_span_context for parent
linking.
Each direction uses a pair of callbacks registered on the carrier
structure. The injection writers (flt_otel_text_map_writer_set_cb and
flt_otel_http_headers_writer_set_cb) store key-value pairs emitted by the
SDK into the carrier's text map via OTELC_TEXT_MAP_ADD(). The extraction
readers (flt_otel_text_map_reader_foreach_key_cb and
flt_otel_http_headers_reader_foreach_key_cb) iterate the carrier's text
map entries and pass each pair to the SDK's handler callback.
The scope context initialization in flt_otel_scope_context_init() now
calls flt_otel_extract_http_headers() to extract the span context from the
provided text map carrier and stores it in the scope context structure,
making extracted contexts available for parent linking in subsequent span
creation.
Implemented the scope execution engine that creates OTel spans, evaluates
sample expressions to collect telemetry data, and manages span lifecycle
during request and response processing.
The scope runner flt_otel_scope_run() was expanded from a stub into a
complete implementation that evaluates ACL conditions on the scope,
extracts span contexts from HTTP headers when configured, iterates over
the scope's span definitions calling flt_otel_scope_run_span() for each,
marks and finishes completed spans, and cleans up unused runtime
resources.
The span runner flt_otel_scope_run_span() creates OTel spans via the
tracer with optional parent references (from other spans or extracted
contexts), collects telemetry by calling flt_otel_sample_add() for each
configured attribute, event, baggage and status entry, then applies the
collected data to the span (attributes, events with their own key-value
arrays, baggage items, and status code with description) and injects the
span context into HTTP headers when configured.
The sample evaluation layer converts HAProxy sample expressions into OTel
telemetry data. flt_otel_sample_add() evaluates each sample expression
against the stream, converts the result via flt_otel_sample_to_value()
which preserves native types (booleans as OTELC_VALUE_BOOL, integers as
OTELC_VALUE_INT64, all others as strings), and routes the key-value pair
to the appropriate collector based on the sample type (attribute, event,
baggage, or status). The key-value arrays grow dynamically using the
FLT_OTEL_ATTR_INIT_SIZE and FLT_OTEL_ATTR_INC_SIZE constants.
Span finishing is handled in two phases: flt_otel_scope_finish_mark()
marks spans and contexts for completion using exact name matching or
wildcards ("*" for all, "*req*" for request-direction, "*res*" for
response-direction), and flt_otel_scope_finish_marked() ends all marked
spans with a common monotonic timestamp and destroys their contexts.
Connected the OpenTelemetry C wrapper library to the filter lifecycle by
implementing the library initialization, tracer creation, memory and
thread callbacks, shutdown sequence, and span completion.
The flt_otel_lib_init() function now verifies the C wrapper library
version against the compiled headers, calls otelc_init() with the absolute
configuration file path, and creates the tracer via otelc_tracer_create().
On success, it registers HAProxy pool-based memory callbacks
(flt_otel_mem_malloc, flt_otel_mem_free) and a thread ID callback
(flt_otel_thread_id) through otelc_ext_init(), so the C++ SDK allocates
span and context objects from pool_head_otel_span_context. A custom log
handler (flt_otel_log_handler_cb) is registered via otelc_log_set_handler()
to count OTel SDK internal diagnostic messages in the flt_otel_drop_cnt
counter.
The per-thread init callback now starts the tracer thread via
OTELC_OPS(tracer, start) instead of unconditionally returning success.
The deinit callback saves the tracer handle before freeing the
configuration, then shuts down the library via otelc_deinit() after the
pool is destroyed, ensuring the ext callbacks remain valid while the
configuration structures are still being freed. In debug builds, it logs
wrapper statistics, attach counters, and per-event HTX usage counters
before shutdown.
The runtime context cleanup in flt_otel_runtime_context_free() now ends
all active spans with a common monotonic timestamp via
OTELC_OPSR(span, end_with_options) before freeing them. The scope context
cleanup in flt_otel_scope_context_free() now destroys the underlying OTel
span context via OTELC_OPSR(context, destroy).
The parser gained static storage for the debug memory tracker
(OTELC_DBG_MEM) and its initialization in the parse entry point, used when
compiled with the OTELC_DBG_MEM flag.
Replaced the stub filter callbacks with full implementations that dispatch
OTel events through the scope execution engine, and added the supporting
debug, error handling and utility infrastructure.
The filter lifecycle callbacks (init, deinit, init_per_thread) now
initialize the OpenTelemetry C wrapper library, create the tracer from the
instrumentation configuration file, enable HTX stream filtering, and clean
up the configuration and memory pools on shutdown.
The stream callbacks (attach, stream_start, stream_set_backend,
stream_stop, detach, check_timeouts) create the per-stream runtime context
on attach with rate-limit based sampling, fire the corresponding OTel
events (on-stream-start, on-backend-set, on-stream-stop), manage the
idle timeout timer with reschedule logic in detach, and free the runtime
context in check_timeouts. The attach callback also registers the
required pre and post channel analyzers from the instrumentation
configuration.
The channel callbacks (start_analyze, pre_analyze, post_analyze,
end_analyze) register per-channel analyzers, map analyzer bits to event
indices via flt_otel_get_event(), and dispatch the matching events.
The end_analyze callback also fires the on-server-unavailable event
when response analyzers were configured but never executed.
The HTTP callbacks (http_headers, http_end, http_reply, and the debug-only
http_payload and http_reset) dispatch their respective request/response
events based on the channel direction.
The event dispatcher flt_otel_event_run() in event.c iterates over all
scopes matching a given event index and calls flt_otel_scope_run() for
each, sharing a common monotonic and wall-clock timestamp across all spans
within a single event.
Error handling is centralized in flt_otel_return_int() and
flt_otel_return_void(), which implement the hard-error/soft-error policy:
hard errors disable the filter for the stream, soft errors are silently
cleared.
The new debug.h header provides conditional debug macros
(FLT_OTEL_DBG_ARGS, FLT_OTEL_DBG_BUF) and the FLT_OTEL_LOG macro for
structured logging through the instrumentation's log server list. The
utility layer gained debug-only label functions for channel direction,
proxy mode, stream position, filter type, and analyzer bit name lookups.
Added the memory pool management and the runtime scope layer that track
per-stream OTel spans and contexts during request processing.
The pool layer in pool.c manages HAProxy memory pools for the runtime
structures used by the filter: scope spans, scope contexts, runtime
contexts, and span contexts. Each pool is conditionally compiled via
USE_POOL_OTEL_* macros defined in config.h and registered with
REGISTER_POOL(). The allocation functions (flt_otel_pool_alloc,
flt_otel_pool_strndup, flt_otel_pool_free) transparently fall back to
heap allocation when the corresponding pool is not enabled. Trash buffer
helpers (flt_otel_trash_alloc, flt_otel_trash_free) provide scratch space
using either HAProxy's trash chunk pool or direct heap allocation.
The scope layer in scope.c implements the per-stream runtime state. The
flt_otel_runtime_context structure is allocated when a stream starts and
holds the stream and filter references, hard-error/disabled/logging flags
copied from the instrumentation configuration, idle timeout state, a
generated UUID, and lists of active scope spans and extracted scope
contexts. Scope spans (flt_otel_scope_span) carry the operation name,
fetch direction, the OTel span handle, and optional parent references
resolved from other spans or extracted contexts. Scope contexts
(flt_otel_scope_context) hold an extracted span context obtained from
a carrier text map via the tracer. The scope data structures
(flt_otel_scope_data) aggregate growable key-value arrays for attributes
and baggage, a linked list of named events with their own attribute
arrays, and a span status code with description, representing the
telemetry collected during a single event execution.
Implemented the flt_otel_ops_check() callback that validates the parsed
OTel filter configuration after all HAProxy configuration sections have
been processed.
The check callback performs the following validations: resolves deferred
sample fetch arguments under full frontend and backend capabilities,
verifies uniqueness of filter IDs across all proxies, ensures the
instrumentation section and its configuration file are present, checks
for duplicate group and scope section names, verifies that groups are not
empty, resolves group-to-scope and instrumentation-to-group/scope
cross-references by linking placeholder entries to their definitions,
detects unused scopes, counts root spans and warns when the count differs
from one, and accumulates the required channel analyzer bits from all used
scopes into the instrumentation configuration.
The commit also added the flt_otel_counters structure to track per-event
diagnostic counters in debug builds, the FLT_OTEL_ALERT macro for
filter-scoped error messages, and the FLT_OTEL_DBG_LIST macro for
iterating and dumping named configuration lists.
Added the full configuration parser that reads the OTel filter's external
configuration file and the event model that maps filter events to HAProxy
channel analyzers.
The event model in event.h defines an X-macro table
(FLT_OTEL_EVENT_DEFINES) that maps each filter event to its HAProxy
channel analyzer bit, sample fetch direction, and event name. Events
cover stream lifecycle (start, stop, backend-set, idle-timeout), client
and server sessions, request analyzers (frontend and backend TCP and
HTTP inspection, switching rules, sticking rules, RDP cookie), response
analyzers (TCP inspection, HTTP response processing), and HTTP headers,
end, and reply callbacks. The event names are partially compatible with
the SPOE filter. The flt_otel_event_data[] table in event.c is generated
from the same X-macro and provides per-event metadata at runtime.
The parser in parser.c implements section parsers for the three OTel
configuration blocks: otel-instrumentation (tracer identity, log server,
config file path, groups, scopes, ACLs, rate-limit, options for
disabled/hard-errors/nolognorm, and debug-level), otel-group (group
identity and scope list), and otel-scope (scope identity, span definitions
with optional root/parent modifiers, attributes, events, baggages, status
codes, inject/extract context operations, finish lists, idle-timeout,
ACLs, and otel-event binding with optional if/unless ACL conditions).
Each section has a post-parse callback that validates the parsed state.
The top-level flt_otel_parse_cfg() temporarily registers these section
parsers, loads the external configuration file via parse_cfg(), and
handles deferred resolution of sample fetch arguments by saving them in
conf->smp_args for later resolution in flt_otel_check() when full frontend
and backend capabilities are available. The main flt_otel_parse() entry
point was extended to parse the filter ID and config file keywords, verify
that insecure-fork-wanted is enabled, and wire the parsed configuration
into the flt_conf structure.
The utility layer gained flt_otel_strtod() and flt_otel_strtoll() for
validated string-to-number conversion used by rate-limit and debug-level
parsing.
Added the configuration structures that model the OTel filter's
instrumentation hierarchy and the utility functions that support the
configuration parser.
The configuration is organized as a tree rooted at flt_otel_conf, which
holds the proxy reference, filter identity, and lists of groups and
scopes. Below it, flt_otel_conf_instr carries the instrumentation
settings: tracer handle, rate limiting, hard-error mode, logging state,
channel analyzers, and placeholder references to groups and scopes.
Groups (flt_otel_conf_group) aggregate scopes by name. Scopes
(flt_otel_conf_scope) bind an event to its ACL condition, span context
declarations, span definitions and a list of spans scheduled for
finishing. Spans (flt_otel_conf_span) carry attributes, events,
baggages and status entries, each represented as flt_otel_conf_sample
structures that pair a key with concatenated sample-expression arguments.
All configuration types share a common header macro (FLT_OTEL_CONF_HDR)
that embeds an identifier string, its length, a configuration line number,
and a list link. Their init and free functions are generated by the
FLT_OTEL_CONF_FUNC_INIT and FLT_OTEL_CONF_FUNC_FREE macros in
conf_funcs.h, with per-type custom initialization and cleanup bodies.
The utility layer in util.c provides argument counting and concatenation
for the configuration parser, sample data to string conversion covering
boolean, integer, IPv4, IPv6, string and HTTP method types, and debug
helpers for dumping argument arrays and linked list state.
The OpenTelemetry (OTel) filter enables distributed tracing of requests
across service boundaries, export of metrics such as request rates,
latencies and error counts, and structured logging tied to trace context,
giving operators a unified view of HAProxy traffic through any
OpenTelemetry-compatible backend.
The OTel filter is implemented using the standard HAProxy stream filter
API. Stream filters attach to proxies and intercept traffic at each stage
of processing: they receive callbacks on stream creation and destruction,
channel analyzer events, HTTP header and payload processing, and TCP data
forwarding. This allows the filter to collect telemetry data at every
stage of the request/response lifecycle without modifying the core proxy
logic.
This commit added the minimum set of files required for the filter to
compile: the addon Makefile with pkg-config-based detection of the
opentelemetry-c-wrapper library, header files with configuration
constants, utility macros and type definitions, and the source files
containing stub filter operation callbacks registered through
flt_otel_ops and the "opentelemetry" keyword parser entry point.
The filter uses the opentelemetry-c-wrapper library from HAProxy
Technologies, which provides a C interface to the OpenTelemetry C++ SDK.
This wrapper allows HAProxy, a C codebase, to leverage the full
OpenTelemetry observability pipeline without direct C++ dependencies
in the HAProxy source tree.
https://github.com/haproxytech/opentelemetry-c-wrapperhttps://github.com/open-telemetry/opentelemetry-cpp
Build options:
USE_OTEL - enable the OpenTelemetry filter
OTEL_DEBUG - compile the filter in debug mode
OTEL_INC - force the include path to the C wrapper
OTEL_LIB - force the library path to the C wrapper
OTEL_RUNPATH - add the C wrapper RUNPATH to the executable
Example build with OTel and debug enabled:
make -j8 USE_OTEL=1 OTEL_DEBUG=1 TARGET=linux-glibc
conn_recv_qstrm() may be called several times per connection if the read
data is too short and a truncated record is received.
Previously, record length was parsed every time the function is invoked.
However, this must only be performed if record length varint is
incomplete. Once read and parsed, data are removed from the buffer via
b_quic_dec_int(). Thus, next conn_recv_qstrm() run will reread an
invalid record length this time.
This patch fixes this by only parsing record length if <rxrlen> member
is null. Prior to it, parsing of QMux transport parameters would fail in
case of a first truncated read, which would prevent the connection
initialization.
No need to backport.
A QCC connection may be flagged with QC_CF_ERRL to trigger a
CONNECTION_CLOSE emission. However, for now error reporting is not
functional with QMux, as it relies on quic_conn layer access.
To prevent a crash in qcc_io_send() when using QMux, add a
conn_is_quic() check when QC_CF_ERRL is set to ensure no access will be
performed on quic_conn layer. In the future, this should be extended so
that QMux is also able to emit CONNECTION_CLOSE for connection closure.
No need to backport.
First, we must not emit any warning if splicing is not configured and the
global maxpipes value is 0. Then we must not remove GTUNE_USE_SPLICE flag
when we fail to allocate the haterm master pipe. Instead, we test it when we
negociate with the opposite side, to properly exclude the splicing if it is
not usable.
No backport needed.
This reverts commit 8056117e98.
Moving haterm init from haproxy is not the right way to fix the issue
because it should be possible to use a haterm configuration in haproxy.
So let's revert the commit above.
Add QUIC BLOCKED frames in the list of supported types in
qstrm_parse_frm(). Nothing is really implemented for them as for QUIC,
but this prevents a crash when receiving one of them via QMux.
No need to backport.
QMux draft 01 support is mostly achieved thanks to the recent
implementation of the Record layer. This patch thus updates the link in
the documentation to the validated draft version.
This patch implements emission of the new Record layer for QMux frames.
This handles mux-quic and xprt_qstrm layers as this is performed
similarly in both cases.
Currently, the simplest approach has been prefered : each frame is
encoded in its own record. This is not the most efficient in size but it
is extremely simple to implement for a first interop testing.
This patch implements the new QMux record layer parsing for xprt_qstrm.
This is mostly similar to the MUX code from the previous patch.
Along with this change, a new xprt_qstrm layer accessor exposes the
possible remaining record length after Transport parameters parsing.
This can only occur when xprt_qstrm Rx buffer is not completely emptied
due to other following frames. If stored in the same record, MUX layer
has to know the remaining record length.
Thus, xprt_qstrm_rxrlen() is now used in qmux_init() to preinitialize
<rx.rlen> QCC field.
This is the first patch of a serie which aims to support the new Record
layer defined by the draft 01 of QMux protocol.
https://www.ietf.org/archive/id/draft-ietf-quic-qmux-01.html#name-qmux-records
This patch deals with QMux reception at the MUX layer. The function
qcc_qstrm_recv() is adapted to read record headers before frame parsing.
This requires to keep the last record length read in a new QCC field
named <rx.rlen>.
Frames are only parsed once a full record is received. One of the
advantage of the record layer is that it can only contains whole frame
without truncation.
This patch implements proper connection error handling for xprt_qstrm
layer. Basically, processing is interrupted if CO_FL_ERROR is
encountered after either rcv_buf or snd_buf operations. Connectionn
error is set to the newly defined value CO_ER_QSTRM.
This commit adds buffering on transmission for xprt_qstrm layer. This is
necessary in the rare case where send syscall only emits partial data.
A new <txbuf> member is defined in xprt_qstrm context. On first send
invokation, buffer is allocated and then the QMux transport parameters
frame is encoded. Then emission is performed via snd_buf and each time
the send function is invoked.
Layer xprt_qstrm is responsible to read the initial QMux transport
parameters frame. However, it could receive more data if some other
frames follow it. This extra content can only be handled by the MUX
layer once initialized.
Theorically, it could have been implemented via MSG_PEEK. However, this
flag is currently ignored by SSL layer. Besides, it is tedious to
implement safely. A new approach has been prefered where the MUX layer
is responsible to retrieve remaining data via xprt_qstrm_rxbuf()
accessor function during its initialization.
Thus, qmux_init() now may retrieve the buffer from xprt_qstrm layer.
This is performed via b_xfer() which will result in a zero copy
transfer. If this happens, tasklet is immediately scheduled to start
demuxing.
Implement buffering for reception on xprt_qstrm layer. This is necessary
to handle reception of a truncated QMux transport parameters frame.
This is performed via a new dedicated <rxbuf> member in xprt_qstrm
context. Read is performed by reusing the buffer until a whole frame can
be read.
QUIC frame parsers functions take a <pos> pointer as input argument for
the data to be parsed. If parsing is successful, <pos> must be
incremented to point to the next data.
Increment was not performed when parsing QMux transport parameters
frame. This commit fixes this. Note that for now there is no real issue
as xprt_qstrm does not check the QMux frame length.
No need to backport.
In qcc_release(), <conn> may be NULL. Thus every access on it must be
tested.
With recent QMux introduction, a call to conn_is_quic() has been added
prior to registration of the stream rejection callback. It could lead to
NULL deref as <conn> is not tested there. Fix this by adding an extra
check on the pointer validity.
No need to backport.
hlua_applet_http_status() stored the result of luaL_optlstring()
directly in http_ctx->reason. The pointer references Lua-managed
string storage which is only guaranteed valid until the C function
returns to Lua. If the GC runs between applet:set_status(200, str)
and applet:start_response(), the pointer dangles.
hlua_applet_http_send_response() then calls ist(http_ctx->reason)
which does strlen() on freed memory, followed by memcpy into the
HTX status line. The freed-and-reallocated chunk contents are sent
verbatim to the HTTP client.
Trigger:
applet:set_status(200, table.concat({"Reason ", str:rep(50)}))
collectgarbage("collect"); collectgarbage("collect")
applet:start_response()
With heap grooming, adjacent allocation contents (session data, TLS
material from the same thread) leak into the response status line.
Anchor the Lua string in the registry keyed by the http_ctx field
address so it survives until the applet is done with it. The
registry entry is overwritten on each call (handles repeated
set_status) and naturally cleaned up when the lua_State is closed.
This patch should be backported to all stable versions.
FCGI content_length is a 16-bit field but fcgi_set_record_size()
is called with size_t/uint32_t arguments. With tune.bufsize >= 65544
(legal; cfgparse-global.c only enforces <= INT_MAX-16), a single
HTX DATA block or accumulated outbuf can exceed 65535 bytes. The
implicit conversion to uint16_t silently truncates the length field
while b_add(mbuf, outbuf.data) writes the full body.
A client posting ~99000 bytes can craft the body so that bytes
after the truncated length are parsed by PHP-FPM as fresh FCGI
records on the connection: a smuggled BEGIN_REQUEST + PARAMS with
arbitrary SCRIPT_FILENAME / PHP_VALUE bypasses all haproxy ACLs.
Fix the zero-copy path by refusing it when the block exceeds 65535
bytes (falls through to copy). Fix the copy path by capping
outbuf.size to 65535 + header so the data-fill loop naturally
stops at the FCGI maximum and emits the rest in a subsequent record.
The PARAMS path at line 2084 is similarly affected but harder to
trigger (requires combined header+param size > 65535) and is
covered by the same outbuf.size cap pattern if applied there.
This patch must be backported to all stable versions.
exp_replace() returns int and returns -1 when the back-reference
expansion overflows the output buffer (regex.c:51). output->data is
size_t, so -1 becomes SIZE_MAX. There was no error check.
The subsequent comparisons interpret SIZE_MAX as a huge length:
"output->data > b_room(trash)" tries to grow trash, then
"max > output->data" is false so max stays at trash->size, and
memcpy(trash, output->area, trash->size) copies the full chunk.
output->area is a pool_alloc()'d chunk that is NOT zeroed; the
bytes after the partial exp_replace output are stale data from a
prior pool user (request headers, response bodies from the same
worker thread).
Trigger with a backreference whose expansion exceeds bufsize:
http-request set-header X %[req.hdr(In),regsub('(.+)','\1\1')]
and a request with In: of ~9000 bytes. The X header sent to the
backend then contains ~9KB of stale heap data.
With tune.bufsize.large set, get_larger_trash_chunk() upgrades trash
and the memcpy reads up to ~50KB past the (smaller) output->area
allocation.
http_ana.c:2728 and http_act.c:551 already check exp_replace() for
-1; this call site was missed when backreferences were added.
This patch must be backported to all stable versions.
Samples of type SMP_T_METH were not properly handled in smp_dup(),
smp_is_safe() and smp_is_rw(). For "other" methods, for instance PATCH, a
fallback was performed on the SMP_T_STR type. Only the buffer considered
changed. "smp->data.u.meth.str" should be used for the SMP_T_METH samples
while smp->data.u.str should be used for SMP_T_STR samples. However, in
smp_dup(), the result was stored in wrong buffer, the string one instead of
the method one. In smp_is_safe() and smp_is_rw(), the method buffer was not
used at all.
We now take care to use the right buffer.
This patch must be backported to all stable versions.
When "Expect" header was found in request headers, "HTTP/1.1 100-continue"
was returned instead of "HTTP/1.1 100 continue". Let's fix it.
No backport needed.
http_action_set_headers_bin() decodes varint name and value lengths
from a binary sample but never validates that the decoded length
fits in the remaining sample data before constructing the ist.
If the value's varint decodes to a large number with only a few
bytes following, v.len exceeds the buffer and http_add_header()
memcpys past the sample, copying adjacent heap data into a header
sent to the backend (or client, with http-response).
The intended source for this action is the hdrs_bin sample fetch
which produces well-formed output, but nothing prevents an admin
from feeding it req.body or another untrusted source. With:
http-request set-var(txn.h) req.body
http-request add-headers-bin var(txn.h)
a POST body of [05]"X-Foo"[c8]"AB" produces v = {ptr="AB", len=200}
and 198 bytes of adjacent heap data go into X-Foo.
http_action_del_headers_bin() was fixed too.
Compare spoe_decode_buffer() which has the equivalent check.
Validate both name and value lengths against remaining data.
No backport needed.
decode_varint() has no iteration cap and accepts varints decoding to
any uint64_t value. When sz is large enough that p + sz wraps modulo
2^64, the check "p + sz > end" passes, *buf is set to the wrapped
pointer, and the caller's parsing loop continues from an arbitrary
relative offset before the demux buffer.
A malicious SPOE agent sending an AGENT_HELLO frame with a key-name
length varint of 0xfffffffffffff000 causes spop_conn_handle_hello()
to dereference memory ~64KB before the dbuf allocation, resulting in
SIGSEGV (DoS) or, if the read lands on live heap data, parser
confusion. The relative offset is fully attacker-controlled and
ASLR-independent.
Compare against the remaining length instead of computing p + sz.
Since p <= end is guaranteed after a successful decode_varint(),
end - p is non-negative.
This patch must be backport to all stable versions.
Commit c84c15d393 ("BUG/MINOR: resolvers: Apply dns-accept-family
setting on additional records") converted a switch statement to an
if/else chain but left the break; in the AAAA branch. In the new
form, break exits the surrounding for loop instead of a switch case.
For every AAAA additional record in an SRV response:
- answer_record allocated at line 1460 is never freed and never
inserted into answer_tree -> ~580 bytes leaked per response
- all subsequent additional records in the response are silently
discarded
A DNS server controlling SRV responses for haproxy service discovery
can leak memory at MB/min rates given default resolution intervals.
Also breaks IPv6 SRV target resolution outright since the AAAA record
is leaked rather than attached to its SRV entry.
Ensure that all Lua regression tests exercise the restricted library
mode by setting "tune.lua.openlibs none" in their global section.
Only txn_get_priv-thread.vtc requires "string,table"
HAProxy has always called luaL_openlibs() unconditionally, which opens
all standard Lua libraries including io, os, package and debug. This
makes it impossible to prevent Lua scripts from executing binaries
(os.execute, io.popen), loading native C modules (package/require), or
bypassing any Lua-level sandbox via the debug library.
Add a new global directive tune.lua.openlibs that accepts a comma-separated
list of library names to load:
tune.lua.openlibs none # only base + coroutine
tune.lua.openlibs string,math,table,utf8 # safe libs only
tune.lua.openlibs all # default, same as before
The base and coroutine libraries are always loaded regardless: base provides
core Lua functions that HAProxy relies on, and coroutine is required because
HAProxy overrides coroutine.create() with its own safe implementation.
When all libraries are enabled (the default), the fast path still calls
luaL_openlibs() directly with no overhead. A parse error is returned if
the directive appears after lua-load or lua-load-per-thread (the Lua state
is already initialised at that point), or if 'none' is combined with other
library names. Note that fork() and new thread creation are already blocked
by default regardless of this setting (see "insecure-fork-wanted").
Literals are sent in two ways:
- in EOB state, unencoded and prefixed with their length
- in FIXED state, huffman-encoded
And references are only sent in FIXED state.
The API promises that the amount of data will not grow by more than
5 bytes every 65535 input bytes (the comment was adjusted to remind
this last point). This is guaranteed by the literal encoding in EOB
state (BT, LEN, NLEN + bytes), which is supposed to be the worst
case by design.
However, as reported by Greg KH, this is currently not true: the test
that decides whether or not to switch to FIXED state to send references
doesn't properly account for the number of bytes needed to roll back
to the *exact* same state in EOB, which means sending EOB, BT,
alignment, LEN and NLEN in addition to the referenced bytes, versus
sending the encoding for the reference. By not taking into account the
cost of returning to the initial state (BT+LEN+NLEN), it was possible
to stay too long in the FIXED state and to consume the extra bytes that
are needed to return to the EOB state, resulting in producing much more
data in case of multiple switchovers (up to 6.25% increase was measured
in tests, or 1/16, which matches worst case estimates based on the code).
And this check is only valid when starting from EOB (in order to restore
the same state that offers this guarantee). When already in FIXED state,
the encoded reference is always smaller than or same size as the data.
The smallest match length we support is 4 bytes, and when encoded this
is no more than 28 bits, so it is safe to stay in FIXED state as long
as needed while checking the possibility of switching back to EOB.
This very slightly reduces the compression ratio (-0.17% on a linux
kernel source) but makes sure we respect the API promise of no more
than 5 extra bytes per 65535 of input. A side effect of the slightly
simpler check is an ~7.5% performance increase in compression speed.
Many thanks to Greg for the detailed report allowing to reproduce
the issue.
This is libslz upstream commit 002e838935bf298d967f670036efa95822b6c84e.
Note: in haproxy's default configuration (tune.bufsize 16384,
tune.maxrewrite 1024), this problem cannot be triggered, because the
reserve limits input to 15360 bytes, and the overflow is maximum
960 bytes resulting in 16320 bytes total, which still fits into the
buffer. However, reducing tune.maxrewrite below 964, or tune.bufsize
above 17408 can result in overflows for specially crafted patterns.
A workaround for larger buffers consists in always setting tune.bufsize
to at least 1/16 of tune.bufsize.
Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://www.mail-archive.com/haproxy@formilux.org/msg46837.html
Storing the protocol directly into the check was not a good idea,
because the protocol may not be determined until after a DNS resolution
on the server, and may even change at runtime, if the DNS changes.
What we can, however, figure out at start up, is the net_addr_type,
which will contain all that we need to find out which protocol to use
later.
Also revert the changes made by commit 07edaed191
that would not reuse the server xprt if a different alpn is set for
checks. The alpn is just a string, and should not influence the choice
of the xprt.
We'll now make sure to use the server xprt, unless an address is
provided, in which case we'll use whatever xprt matches that address, or
a port, in which case we'll assume we want TCP, and use check_ssl to
know whetver we want the SSL xprt or not.
Now that the check contains all that is needed to know which protocol to
look up, always just use that when creating a new check connection if it
is the default check connection, and for now, always use TCP when a
tcp-check or http-check connect rule is used (which means those can't be
used for QUIC so far).
This should hopefully fix github issue #3324.
Commit 1b0dfff552 attempted to make it so
the mux would expect a QUIC-like protocol or not, however it only made
that we would not instantiate a non-QUIC mux on a QUIC protocol, but not
that we tried to instance a QUIC mux on a non-QUIC protocol, so fix
that.
The vtest binary does not seem to be cached correctly by actions/cache,
the cause of the problem seems to be the binary is installed outside the
github workspace. This patch installs the binary in ~/vtest/ to fix the
issue.
This reverts commit a03120e228.
A WIP version of the patch was applied before the actual patch by
accident. The correct patch is 2db801c ("BUG/MINOR: hlua: fix stack
overflow in httpclient headers conversion")
github complains about cache@v4:
Node.js 20 actions are deprecated. The following actions are running on
Node.js 20 and may not work as expected: actions/cache@v4. Actions will
be forced to run with Node.js 24 by default starting June 2nd, 2026.
Node.js 20 will be removed from the runner on September 16th, 2026.
Please check if updated versions of these actions are available that
support Node.js 24. To opt into Node.js 24 now, set the
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the
runner or in your workflow file. Once Node.js 24 becomes the default,
you can temporarily opt out by setting
ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see:
https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
When the app_ops were removed, direct calls to the SC wake callback function
were replaced by tasklet wakeups. However, in conn_create_mux(), it was
replaced by a direct call to sc_conn_process(). However, sc_conn_process()
is only usable when the SC is attach to a stream. A backend mux can be
created for a healcheck. In this context, sc_conn_process() cannot be
called.
Because of this bug, crashes can be experienced when an error is triggered
during a SSL connection attempt from a healthcheck.
To fix the issue, the call to sc_conn_process() was replaced by a tasklet
wakeup.
This patch should fix the issue #3326. No backport needed.
The VTest2 tarball URL at code.vinyl-cache.org/vtest/VTest2/archive/main.tar.gz
no longer works. Switch scripts/build-vtest.sh to use a git clone of the
repository instead.
Add a cache step in the setup-vtest CI action so VTest is only rebuilt
when its HEAD commit changes, keyed on the runner OS and the VTest2 HEAD
SHA.
When a peer sends a dictionary entry update with a value (the else
branch at line 2109), the entry id decoded from the wire was never
validated against dc->max_entries before being used as an array index
into dc->rx[].
A malicious peer can send id=N where N > 128 (PEER_STKT_CACHE_MAX_ENTRIES)
to:
- dc->rx[id-1].de at line 2123: OOB read followed by atomic decrement
and potential free of an attacker-controlled pointer via
dict_entry_unref()
- dc->rx[id-1].de = de at line 2124: OOB write of a heap pointer at
an attacker-controlled offset (16-byte stride, ~64 GiB range)
The bounds check was added to the key-only branch in commit f9e51beec
("BUG/MINOR: peers: Do not ignore a protocol error for dictionary
entries.") but was never added to the with-value branch. The bug has
been present since dictionary support was introduced in commit
8d78fa7def ("MINOR: peers: Make peers protocol support new
"server_name" data type.").
Reachable from any TCP client that knows the configured peer name
(no cryptographic authentication on the peers protocol). Requires a
stick-table with "store server_key" in the configuration.
Fix by hoisting the bounds check above the branch so it covers both
paths.
Must be backported as far as 2.6.
When the input chunk is already the large buffer (chk->size ==
large_trash_size), the <= comparison still matched and returned
another large buffer of the same size. Callers that retry on a
non-NULL return value (sample.c:4567 in json_query) loop forever.
The json_query infinite loop is trivially triggered: mjson_unescape()
returns -1 not only when the output buffer is too small but also for
any \uXXYY escape where XX != "00" (mjson.c:305) and for invalid
escapes like \q. The retry loop assumes -1 always means "grow the
buffer", so a 14-byte JSON body of {"k":"\u0100"} hangs the worker
thread permanently. Send N such requests to exhaust all worker
threads.
Use < instead of <= so a chunk that is already large yields NULL.
This also fixes the json converter overflow at sample.c:2869 where
no recheck happens after the "growth" returned a same-size buffer.
Introduced in commit ce912271db ("MEDIUM: chunk: Add support for
large chunks"). No backport needed.
A copy-paste error in alloc_trash_buffers_per_thread() passes
global.tune.bufsize_large to alloc_small_trash_buffers() instead of
global.tune.bufsize_small. This sets small_trash_size = bufsize_large.
When tune.bufsize.large is configured, get_larger_trash_chunk() then
incorrectly matches a large buffer against small_trash_size at line
169 and "grows" it to a regular (smaller) buffer. b_xfer() at line
179 attempts to copy the large buffer's contents into the smaller one:
- Default builds (DEBUG_STRICT=1): BUG_ON in __b_putblk() aborts
the process -> remote DoS
- DEBUG_STRICT=0 builds: BUG_ON becomes ASSUME() and the compiler
elides the check -> heap overflow with attacker-controlled bytes
Reachable via the json converter (sample.c:2862) when escaping
~bufsize_large/6 control characters in attacker-supplied data such
as a request header or body.
Introduced in commit 92a24a4e87 ("MEDIUM: chunk: Add support for
small chunks"). No backport needed.
hlua_error() is a printf-family function (calls vsnprintf), but
hlua_patref_set, hlua_patref_add, and _hlua_patref_add_bulk pass
errmsg directly as the format string. errmsg is built by pattern.c
helpers that embed the user-supplied key or value verbatim, e.g.
pat_ref_set_elt() generates "unable to parse '<value>'".
A Lua script calling:
ref:set("key", "%p.%p.%p.%p.%p.%p.%p.%p")
against a map with an integer output type (where the parse fails)
gets stack/register contents formatted into the (nil, err) return
value -> ASLR/canary leak. With %n and no _FORTIFY_SOURCE this
becomes an arbitrary write primitive.
This must be backported as far as the Patref Lua API exists.
hlua_httpclient_table_to_hdrs() declares a VLA of size
global.tune.max_http_hdr (default 101) on the stack but never checks
hdr_num against that bound. A Lua script that supplies a header table
with more than 101 values writes struct http_hdr entries (two ist =
two heap pointers + two lengths) past the end of the VLA, smashing
the stack frame.
Trigger from any Lua action/task/service:
local hc = core.httpclient()
local v = {}
for i = 1, 300 do v[i] = "x" end
hc:get{ url = "http://127.0.0.1/", headers = { ["X"] = v } }
Each out-of-bounds entry writes a heap pointer (controllable
allocation contents via istdup) plus an attacker-chosen length onto
the stack, overwriting the saved return address.
[wla: this is only reachable if the Lua script passes more than
max_http_hdr header values, which requires access to the script itself]
This must be backported as far as the httpclient Lua API exists.
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
hlua_httpclient_table_to_hdrs() declares a VLA of size
global.tune.max_http_hdr (default 101) on the stack but never checks
hdr_num against that bound. A Lua script that supplies a header table
with more than 101 values writes struct http_hdr entries (two ist =
two heap pointers + two lengths) past the end of the VLA, smashing
the stack frame.
Trigger from any Lua action/task/service:
local hc = core.httpclient()
local v = {}
for i = 1, 300 do v[i] = "x" end
hc:get{ url = "http://127.0.0.1/", headers = { ["X"] = v } }
Each out-of-bounds entry writes a heap pointer (controllable
allocation contents via istdup) plus an attacker-chosen length onto
the stack, overwriting the saved return address. With no stack
canary, this is direct RCE; with a canary, it requires a leak first.
Reachable from any deployment that loads Lua scripts. While Lua
scripts are nominally trusted, this turns "can edit Lua" into "can
execute arbitrary native code", which is a meaningful boundary in
many setups (Lua sandbox escape).
This must be backported as far as the httpclient Lua API exists.
When the secret argument to jwt_decrypt_secret is a variable
(ARGT_VAR) rather than a literal string, alloc_trash_chunk() is
called to hold the base64-decoded secret but the buffer is never
released. The end: label frees input, decrypted_cek, out, and the
decoded_items array but not secret.
Each request leaks one trash chunk (~tune.bufsize, default 16KB).
At ~65000 requests per GiB this allows slow memory exhaustion DoS
against any config of the form:
http-request set-var(txn.x) req.hdr(...),jwt_decrypt_secret(txn.key)
This must be backported as far as JWE support exists.
convert_ecdsa_sig() calls i2d_ECDSA_SIG(ecdsa_sig, &p) where p
points into signature->area, a trash chunk of tune.bufsize bytes
(default 16384). i2d writes with no output bound.
The raw R||S input can be up to bufsize bytes (filled by
base64urldec at jwt.c:520-527), giving bignum_len up to 8192. The
DER encoding adds a SEQUENCE header (2-4 bytes), two INTEGER headers
(2-4 bytes each), and up to two leading-zero sign-padding bytes when
the bignum high bit is set. With two 8192-byte bignums having the
high bit set, the encoding is ~16398 bytes, overflowing the 16384-
byte buffer by ~14 bytes.
Triggered by any JWT with alg=ES256/384/512 and a ~21830-character
base64url signature. The signature does not need to verify
successfully; the overflow happens before verification. Reachable
from any config using jwt_verify with an EC algorithm.
Also fixes the existing wrong check: i2d returns -1 on error which
became SIZE_MAX in the size_t signature->data, defeating the
"== 0" test.
This must be backported as far as JWT support exists.
In sample_conv_jwt_decrypt_secret(), when a JWE token has an empty
encrypted-key section but the algorithm is not "dir" (e.g. A128KW),
neither branch initializes decrypted_cek. The NULL pointer is then
passed to decrypt_ciphertext() which dereferences it:
- For GCM encodings: aes_process() calls b_orig(NULL) -> SIGSEGV
- For CBC encodings: b_data(NULL) at jwe.c:463 -> SIGSEGV
A single HTTP request with a crafted Authorization header crashes the
worker process. Trigger token (JOSE header {"alg":"A128KW","enc":"A128GCM"},
empty CEK section between the two dots):
eyJhbGciOiJBMTI4S1ciLCJlbmMiOiJBMTI4R0NNIn0..AAAAAAAAAAAAAAAA.AA.AA
Reachable in any configuration using the jwt_decrypt_secret converter.
The other two decrypt converters (jwt_decrypt_jwk, jwt_decrypt_cert)
already have the check.
This must be backported as far as JWE support exists.
The 16-bit name_len field is read directly from the ClientHello and
stored as the sample length without any validation against srv_len,
ext_len, or the channel buffer size. A 65-byte ClientHello with
name_len=0xffff produces a sample claiming 65535 bytes of data when
only ~4 bytes are actually present in the buffer.
Downstream consumers then read tens of kilobytes past the channel
buffer:
- pattern.c:741 XXH3() hashes 65535 bytes -> ~50KB OOB heap read
- sample.c smp_dup memcpy if large trash configured
- log-format %[req.ssl_sni] leaks heap contents to logs/headers
Reachable pre-authentication on any TCP frontend using req.ssl_sni
(req_ssl_sni), which is the documented way to do SNI-based content
switching in TCP mode. No SSL handshake is required; the parser
runs on raw buffer contents in tcp-request content rules.
Bug introduced in commit d4c33c8889 (2013). The ALPN parser in
the same file at line 1044 has the equivalent check; SNI never did.
This must be backported to all supported versions.
When the healthcheck section support was added, the tcpcheck type was moved
into the tcpcheck ruleset. However, conn_install_mux_chk() function was not
updated accordingly. So the TCP mode was always returned.
No backport needed. This patch is related to #3324 but it is not the root
cause of the issue.
As reported by GH @phihos on GH #3320, using the shm-stats-file feature
with objects exceeding 127 chars would result in object name being
unexpectedly truncated, while GUID API supports up to 128 chars.
Indeed, with the config below, and shm-stats-file enabled:
server s1 127.0.0.1:1 guid srv:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:SRV_1 disabled
server s10 127.0.0.1:1 guid srv:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:SRV_10 disabled
haproxy would store the second server object with the same id as the first
one, but upon reload, only the first one would be restored, which would
eventually cause shm-stats-file slot exhaustion with repetitive reloads.
@phihos, found out the underlying issue, in counters.c we used snprintf()
with sizeof(shm_obj->guid) - 1 as <size> parameter, while we should have
use sizeof(shm_obj->guid) instead since shm_obj->guid already takes the
terminating NULL byte into account.
So we simply apply the fix suggested by @phihos, and hopefully this should
solve the shm-stats-file slot leak that was observed.
Unfortunately, for now, we cannot warn the user that a duplicate
shm-stats-file object was found, because we accept duplicate objects
by design for 2 reasons. The first one is for a new process to be able
to change the object type for a previously known GUID while allowing
previous processes to use the old object as long as they are alive.
The second reason is that upon startup we cannot afford to scan the
whole object list, as soon as we find a match (type + GUID), we bind
the object, and this way we avoid unnecessary lookup time.
Perhaps we have room for improvement in the future, but for now let's
keep it this way.
It should be backported to 3.3
Big thanks to @phihos for the bug description, analysis and
suggestions.
The parsing of the "healthcheck" parameter for dynamic servers was not
finished. The post-config was missing, leading to a crash because the
ruleset pointer was NULL.
To fix the issue, check_server_tcpcheck() function is called in
cli_parse_add_server().
No backport needed.
There was 2 typos here. First, the 'k' was missing on the parameter name.
Then "sectino" was used in the description instead of "section". Let's fix
them.
When an early response is sent to the client and the H1 connection is
switched to the draining state, we must take care to disable the 0-copy data
forwarding because the backend side is no longer here. It is an issue
because this prevent any regular receive to be performed.
This patch should fix the issue #3316. It must be backported as far as 3.0.
Functions used to initialize haterm (the splicing and the response buffers)
were defined and registered in haterm.c. The problem is that this file in
compiled with haproxy. So it may be an issue. And for the splicing part,
warnings may be emitted when haproxy is started.
To avoid any issue during haproxy startup and to avoid to initialize some
part of haterm, all init functions were moved into haterm_init.c file.
No backport needed.
We add a reg-test, filter_sequence.vtc, with associated lua file
dummy_filters.lua to cover the "filter-sequence" directive and
ensure it is working as expected, both for request and responses
paths.
This regtest will only be effective starting with 3.4-dev0
This is another pre-requisite work for upcoming decompression filter.
In this patch we implement the "filter-sequence" directive which can be
used in proxy section (frontend,backend,listen) and takes 2 parameters
The first one is the direction (request or response), the second one
is a comma separated list of filter names previously declared on the
proxy using the "filter" keyword.
The main goal of this directive is to be able to instruct haproxy in which
order the filters should be executed on request and response paths,
especially if the ordering between request and response handling must
differ, and without relying on the filter declaration ordering (within
the proxy) which is used by default by haproxy.
Another benefit of this feature is that it becomes possible to "ignore"
a previously declared filter on the proxy. Indeed, when filter-sequence
is defined for a given direction (request/response), then it will be used
over the implicit filter ordering, but if a filter which was previously
declared is not specified in the related filter-sequence, it will not be
executed on purpose. This can be used as a way to temporarily disable a
filter without completely removing its configuration.
Documentation was updated (check examples for more info)
flt_conf struct stores the filter id, which is used internally to check
match the filter against static pointer identifier, and also used as
descriptive text to describe the filter. But the id is not consistent
with the public name as used in the configuration (for instance when
selecting filter through the 'filter' directive).
What we do in this patch is that we add flt_conf->name member, which
stores the real filter name as seen in the configuration. This will
allow to select filters by their name from other directives in the
configuration.
log-steps takes <steps> as parameter. <steps> is made of individual
log origins separated by commas, as shown in the examples, but the
directive's description says it should be separated by spaces, which
is wrong.
Let's fix that
It should be backported up to 3.2
Released version 3.4-dev8 with the following main changes :
- MINOR: log: split do_log() in do_log() + do_log_ctx()
- MINOR: log: provide a way to override logger->profile from process_send_log_ctx
- MINOR: log: support optional 'profile <log_profile_name>' argument to do-log action
- BUG/MINOR: sock: adjust accept() error messages for ENFILE and ENOMEM
- BUG/MINOR: qpack: fix 62-bit overflow and 1-byte OOB reads in decoding
- MEDIUM: sched: do not run a same task multiple times in series
- MINOR: sched: do not requeue a tasklet into the current queue
- MINOR: sched: do not punish self-waking tasklets anymore
- MEDIUM: sched: do not punish self-waking tasklets if TASK_WOKEN_ANY
- MEDIUM: sched: change scheduler budgets to lower TL_BULK
- MINOR: mux-h2: assign a limited frames processing budget
- BUILD: sched: fix leftover of debugging test in single-run changes
- BUG/MEDIUM: acme: fix multiple resource leaks in acme_x509_req()
- MINOR: http_htx: use enum for arbitrary values in conf_errors
- MINOR: http_htx: rename fields in struct conf_errors
- MINOR: http_htx: split check/init of http_errors
- MINOR/OPTIM: http_htx: lookup once http_errors section on check/init
- MEDIUM: proxy: remove http-errors limitation for dynamic backends
- BUG/MINOR: acme: leak of ext_san upon insertion error
- BUG/MINOR: acme: wrong error when checking for duplicate section
- BUG/MINOR: acme/cli: wrong argument check in 'acme renew'
- BUG/MINOR: http_htx: fix null deref in http-errors config check
- MINOR: buffers: Move small buffers management from quic to dynbuf part
- MINOR: dynbuf: Add helper functions to alloc large and small buffers
- MINOR: quic: Use b_alloc_small() to allocate a small buffer
- MINOR: config: Relax tests on the configured size of small buffers
- MINOR: config: Report the warning when invalid large buffer size is set
- MEDIUM: htx: Add htx_xfer function to replace htx_xfer_blks
- MINOR: htx: Add helper functions to xfer a message to smaller or larger one
- MINOR: http-ana: Use HTX API to move to a large buffer
- MEDIUM: chunk: Add support for small chunks
- MEDIUM: stream: Try to use a small buffer for HTTP request on queuing
- MEDIUM: stream: Try to use small buffer when TCP stream is queued
- MEDIUM: stconn: Use a small buffer if possible for L7 retries
- MEDIUM: tree-wide: Rely on htx_xfer() instead of htx_xfer_blks()
- Revert "BUG/MEDIUM: mux-h2: make sure to always report pending errors to the stream"
- MEDIUM: mux-h2: Stop dealing with HTX flags transfer in h2_rcv_buf()
- MEDIUM: tcpcheck: Use small buffer if possible for healthchecks
- MINOR: proxy: Review options flags used to configure healthchecks
- DOC: config: Fix alphabetical ordering of proxy options
- DOC: config: Fix alphabetical ordering of external-check directives
- MINOR: proxy: Add use-small-buffers option to set where to use small buffers
- DOC: config: Add missing 'status-code' param for 'http-check expect' directive
- DOC: config: Reorder params for 'tcp-check expect' directive
- BUG/MINOR: acme: NULL check on my_strndup()
- BUG/MINOR: acme: free() DER buffer on a2base64url error path
- BUG/MINOR: acme: replace atol with len-bounded __strl2uic() for retry-after
- BUG/MINOR: acme/cli: fix argument check and error in 'acme challenge_ready'
- BUILD: tools: potential null pointer dereference in dl_collect_libs_cb
- BUG/MINOR: ech: permission checks on the CLI
- BUG/MINOR: acme: permission checks on the CLI
- BUG/MEDIUM: check: Don't reuse the server xprt if we should not
- MINOR: checks: Store the protocol to be used in struct check
- MINOR: protocols: Add a new proto_is_quic() function
- MEDIUM: connections: Enforce mux protocol requirements
- MEDIUM: server: remove a useless memset() in srv_update_check_addr_port.
- BUG/MINOR: config: Warn only if warnif_cond_conflicts report a conflict
- BUG/MINOR: config: Properly test warnif_misplaced_* return values
- BUG/MINOR: http-ana: Only consider client abort for abortonclose
- BUG/MEDIUM: acme: skip doing challenge if it is already valid
- MINOR: connections: Enhance tune.idle-pool.shared
- BUG/MINOR: acme: fix task allocation leaked upon error
- BUG/MEDIUM: htx: Fix htx_xfer() to consume more data than expected
- CI: github: fix tag listing by implementing proper API pagination
- CLEANUP: fix typos and spelling in comments and documentation
- BUG/MINOR: quic: close conn on packet reception with incompatible frame
- CLEANUP: stconn: Remove usless sc_new_from_haterm() declaration
- BUG/MINOR: stconn: Always declare the SC created from healthchecks as a back SC
- MINOR: stconn: flag the stream endpoint descriptor when the app has started
- MINOR: mux-h2: report glitches on early RST_STREAM
- BUG/MINOR: net_helper: fix length controls on ip.fp tcp options parsing
- BUILD: net_helper: fix unterminated comment that broke the build
- MINOR: resolvers: basic TXT record implementation
- MINOR: acme: store the TXT record in auth->token
- MEDIUM: acme: add dns-01 DNS propagation pre-check
- MEDIUM: acme: new 'challenge-ready' option
- DOC: configuration: document challenge-ready and dns-delay options for ACME
- SCRIPTS: git-show-backports: list new commits and how to review them with -L
- BUG/MEDIUM: ssl/cli: tls-keys commands warn when accessed without admin level
- BUG/MEDIUM: ssl/ocsp: ocsp commands warn when accessed without admin level
- BUG/MEDIUM: map/cli: map/acl commands warn when accessed without admin level
- BUG/MEDIUM: ssl/cli: tls-keys commands are missing permission checks
- BUG/MEDIUM: ssl/ocsp: ocsp commands are missing permission checks
- BUG/MEDIUM: map/cli: CLI commands lack admin permission checks
- DOC: configuration: mention QUIC server support
- MEDIUM: Add set-headers-bin, add-headers-bin and del-headers-bin actions
- BUG/MEDIUM: mux-h1: Don't set MSG_MORE on bodyless responses forwarded to client
- BUG/MINOR: http_act: Properly handle decoding errors in *-headers-bin actions
- MEDIUM: stats: Hide the version by default and add stats-showversion
- MINOR: backends: Don't update last_sess if it did not change
- MINOR: servers: Don't update last_sess if it did not change
- MINOR: ssl/log: add keylog format variables and env vars
- DOC: configuration: update tune.ssl.keylog URL to IETF draft
- BUG/MINOR: http_act: Make set/add-headers-bin compatible with ACL conditions
- MINOR: action: Add a sample expression field in arguments used by HTTP actions
- MEDIUM: http_act: Rework *-headers-bin actions
- BUG/MINOR: tcpcheck: Remove unexpected flag on tcpcheck rules for httchck option
- MEDIUM: tcpcheck: Refactor how tcp-check rulesets are stored
- MINOR: tcpcheck: Deal with disable-on-404 and send-state in the tcp-check itself
- BUG/MINOR: tcpcheck: Don't enable http_needed when parsing HTTP samples
- MINOR: tcpcheck: Use tcpcheck flags to know a healthcheck uses SSL connections
- BUG/MINOR: tcpcheck: Use tcpcheck context for expressions parsing
- CLEANUP: tcpcheck: Don't needlessly expose proxy_parse_tcpcheck()
- MINOR: tcpcheck: Add a function to stringify the healthcheck type
- MEDIUM: tcpcheck: Split parsing functions to prepare healthcheck sections parsing
- MEDIUM: tcpcheck: Add parsing support for healthcheck sections
- MINOR: tcpcheck: Extract tcpheck ruleset post-config in a dedicated function
- MEDIUM: tcpcheck/server: Add healthcheck server keyword
- REGTESTS: tcpcheck: Add a script to check healthcheck section
- MINOR: acme: add 'dns-timeout' keyword for dns-01 challenge
- CLEANUP: net_helper: fix typo in comment
- MINOR: acme: set the default dns-delay to 30s
- MINOR: connection: add function to identify a QUIC connection
- MINOR: quic: refactor frame parsing
- MINOR: quic: refactor frame encoding
- BUG/MINOR: quic: fix documentation for transport params decoding
- MINOR: quic: split transport params decoding/check
- MINOR: quic: remove useless quic_tp_dec_err type
- MINOR: quic: define QMux transport parameters frame type
- MINOR: quic: implement QMux transport params frame parser/builder
- MINOR: mux-quic: move qcs stream member into tx inner struct
- MINOR: mux-quic: prepare Tx support for QMux
- MINOR: mux-quic: convert init/closure for QMux compatibility
- MINOR: mux-quic: protect qcc_io_process for QMux
- MINOR: mux-quic: prepare traces support for QMux
- MINOR: quic: abstract stream type in qf_stream frame
- MEDIUM: mux-quic: implement QMux receive
- MINOR: mux-quic: handle flow-control frame on qstream read
- MINOR: mux-quic: define Rx connection buffer for QMux
- MINOR: mux_quic: implement qstrm rx buffer realign
- MEDIUM: mux-quic: implement QMux send
- MINOR: mux-quic: implement qstream send callback
- MINOR: mux-quic: define Tx connection buffer for QMux
- MINOR: xprt_qstrm: define new xprt module for QMux protocol
- MINOR: xprt_qstrm: define callback for ALPN retrieval
- MINOR: xprt_qstrm: implement reception of transport parameters
- MINOR: xprt_qstrm: implement sending of transport parameters
- MEDIUM: ssl: load xprt_qstrm after handshake completion
- MINOR: mux-quic: use QMux transport parameters from qstrm xprt
- MAJOR: mux-quic: activate QMux for frontend side
- MAJOR: mux-quic: activate QMux on the backend side
- MINOR: acme: split the CLI wait from the resolve wait
- MEDIUM: acme: initialize the dns timer starting from the first DNS request
- DEBUG: connection/flags: add QSTRM flags for the decoder
- BUG/MINOR: mux_quic: fix uninit for QMux emission
- MINOR: acme: remove remaining CLI wait in ACME_RSLV_TRIGGER
- MEDIUM: acme: split the initial delay from the retry DNS delay
- BUG/MINOR: cfgcond: properly set the error pointer on evaluation error
- BUG/MINOR: cfgcond: always set the error string on openssl_version checks
- BUG/MINOR: cfgcond: always set the error string on awslc_api checks
- BUG/MINOR: cfgcond: fail cleanly on missing argument for "feature"
- MINOR: ssl: add the ssl_fc_crtname sample fetch
- MINOR: hasterm: Change hstream_add_data() to prepare zero-copy data forwarding
- MEDIUM: haterm: Add support for 0-copy data forwading and option to disable it
- MEDIUM: haterm: Prepare support for splicing by initializing a master pipe
- MEDIUM: haterm: Add support for splicing and option to disable it
- MINOR: haterm: Handle boolean request options as flags
- MINOR: haterm: Add an request option to disable splicing
- BUG/MINOR: ssl: fix memory leak in ssl_fc_crtname by using SSL_CTX ex_data index
The ssl_crtname_index was registered with SSL_get_ex_new_index() but the
certificate name is stored on a SSL_CTX object via SSL_CTX_set_ex_data().
The free callback is only invoked for the object type matching the index
registration, so the strdup'd name was never freed when the SSL_CTX was
released.
Fix this by using SSL_CTX_get_ex_new_index() instead, which ensures the
free callback fires when the SSL_CTX is destroyed.
No backport needed.
Following request options are now handled as flags:
- ?k=1 => flag HS_ST_OPT_CHUNK_RES is set
- ?c=0 => flag HS_ST_OPT_NO_CACHE is set
- ?R=1 => flag HS_ST_OPT_RANDOM_RES is set
- ?A=A => flag HS_ST_OPT_REQ_AFTER_RES is set.
By default, none is set.
The support for the splicing was added and enabled by default, if
supported. The command line option '-dS' was also added to disable the
feature.
When the splicing can be used and the front multiplexer agrees to proceed,
tee() is used to "copy" data from the master pipe to the client pipe.
Now the zero-copy data forwarding is supported, we will add the splicing
support. To do so, we first create a master pipe with vmsplice() during
haterm startup. It is only performed if the splicing is supported. And its
size can be configured by setting "tune.pipesize" global parameter.
This master pipe will be used to fill the pipe with the client.
The support for the zero-copy data forwarding was added and enabled by
default. The command line option '-dZ' was also added to disable the
feature.
Concretely, when haterm pushes the response payload, if the zero-copy
forwarding is supported, a dedicated function is used to do so.
hstream_ff_snd() will rely on se_nego_ff() to know how many data can send
and at the end, on se_done_ff() to really send data.
hstream_add_ff_data() function was added to perform the raw copy of the
payload in the sedesc I/O buffer.
hstream_add_data() function is renamed to hstream_add_htx_data() because
there will be a similar function to add data in zero-copy forwarding
mode. The function was also adapted to take the data length to add in
parameter and to return the number of written bytes.
This new sample fetch returns the name of the certificate selected for
an incoming SSL/TLS connection, as it would appear in "show ssl cert".
It may be a filename with its relative or absolute path, or an alias,
depending on how the certificate was declared in the configuration.
The certificate name is stored as ex_data on the SSL_CTX at load time
in ckch_inst_new_load_store(), and freed via a dedicated free callback.
The "feature" predicate takes an argument name. Not passing one will
cause strstr() to always find something, including at the end of the
string, and to read past end that ASAN detects. We need to check that
we didn't reach end before proceeding.
This bug was reported by OSS Fuzz here:
https://issues.oss-fuzz.com/issues/499133314
The issue is present since 2.4 with commit 58ca706e16 ("MINOR: config:
add predicate "feature" to detect certain built-in features") so this
fix must be backported to all stable versions.
Using awslc_api_before() with an invalid argument results in "(null)"
appearing in the error message due to -1 being returned without the
error message being filled. Let's always fill the error message on error.
This was introduced in 3.3 with commit 3d15c07ed0 ("MINOR: cfgcond: add
"awslc_api_atleast" and "awslc_api_before""), and this fix must be
backported to 3.3.
Using openssl_version_before() with an invalid argument results in "(null)"
appearing in the error message due to -1 being returned without the error
message being filled. Let's always fill the error message on error.
This was introduced in 2.5 with commit 3aeb3f9347 ("MINOR: cfgcond:
implements openssl_version_atleast and openssl_version_before"), and
this fix must be backported to 2.6.
cfg_eval_condition() says that the <errptr> pointer will be set upon
error. However, cfg_eval_cond_expr() can fail (e.g. failure to handle
a dynamic argument) but would branch to "done" and leave errptr unset.
Let's check for this case as well.
This bug was reported by OSS Fuzz here:
https://issues.oss-fuzz.com/issues/499135825
The bug was introduced in 2.5 around commit ca81887599 ("MINOR:
cfgcond: insert an expression between the condition and the term") so
the fix must be backported as far as 2.6.
The previous ACME_RSLV_WAIT state served a dual role: it applied the
initial dns-delay before the first DNS probe and also handled the
delay between retries. There was no way to simply wait a fixed delay
before submitting the challenge without also triggering DNS pre-checks.
Replace ACME_RSLV_WAIT with two distinct states:
- ACME_INITIAL_DELAY: an optional initial wait before proceeding,
only applied when "challenge-ready" includes the new "delay" keyword
- ACME_RSLV_RETRY_DELAY: the delay between resolution retries, always
applied when DNS pre-checks are in progress
The new "delay" keyword in "challenge-ready" can be used standalone
(wait then submit the challenge directly) or combined with "dns" (wait
then start the DNS pre-checks). When "delay" is not set, the first DNS
probe fires immediately.
Update the documentation accordingly.
The TASK_WOKEN_TIMER check that previously handled the case where
RSLV_TRIGGER was reached directly from the CLI command is therefore dead
code and can be removed.
Fix the following build warning from obsolete compilers for <orig_frm>
variable in qcc_qstrm_send_frames() function :
src/mux_quic_qstrm.c:266:17: warning: 'orig_frm' may be used
uninitialized in this function [-Wmaybe-uninitialized]
The variable is now explicitely initialized to NULL on each loop, which
should prevent this warning. Note that for code clarity, the variable is
renamed <next_frm>.
No need to backport.
Previously the dns timeout timer was initialized in ACME_RSLV_WAIT,
before the initial dns-delay expires. This meant the countdown started
before any DNS request was actually sent, so the effective timeout was
shorter than expected by one dns-delay period.
Move the initialization to ACME_RSLV_TRIGGER so the timer starts only
when the first DNS resolution attempt is triggered. Update the
documentation to clarify this behaviour.
Defines an API for xprt_qstrm so that the QMux transport parameters can
be retrieved by the MUX layer on its initialization. This concerns both
local and remote parameters.
Functions xprt_qstrm_lparams/rparams() are defined and exported for
this. They are both used in qmux_init() if QMux protocol is active.
On SSL handshake completion, MUX layer can be initialized if not already
the case. However, for QMux protocol, it is necessary first to perform
transport parameters exchange, via the new xprt_qstrm layer. This patch
ensures this is performed if any flag CO_FL_QSTRM_* is set on the
connection.
Also, SSL layer registers itself via add_xprt. This ensures that it can
be used by xprt_qstrm for the emission/reception of the necessary
frames.
This patch implements QMux emission of transport parameters via
xprt_qstrm. Similarly to receive, this is performed in conn_send_qstrm()
which uses lower xprt snd_buf operation. The connection must first be
flagged with CO_FL_QSTRM_SEND to trigger this step.
Extend xprt_qstrm to implement the reception of QMux transport
parameters. This is performed via conn_recv_qstrm() which relies on the
lower xprt rcv_buf operation. Once received, parameters are kept in
xprt_qstrm context, so that the MUX can retrieve them on init.
For the reception of parameters to be active, the connection must first
be flagged with CO_FL_QSTRM_RECV.
Add get_alpn operation support for xprt_qstrm. This simply acts as a
passthrough method to the underlying XPRT layer.
This function is necessary for QMux when running above SSL, as mux-quic
will access ALPN during its initialization in order to instantiate the
proper application protocol layer.
Define a new XPRT layer for the new QMux protocol. Its role will be to
perform the initial exchange of transport parameters.
On completion, contrary to XPRT handshake, xprt_qstrm will first init
the MUX and then removes itself. This will be necessary so that the
parameters can be retrieved by the MUX during its initialization.
This patch only declares the new xprt_qstrm along with basic operations.
Future commits will implement the proper reception/emission steps.
Similarly to reception, a new buffer is defined in QCC connection to
handle emission for QMux protocol. This replaces the trash buffer usage
in qcc_qstrm_send_frames().
This buffer is necessary to handle partial emission. On retry, the
buffer must be completely emitted before starting to send new frames.
Each time a QUIC frame is emitted, mux-quic layer is notified via a
callback to update the underlying QCS. For QUIC, this is performed via
qc_stream_desc element.
In QMux protocol, this can be simplified as there is no
qc_stream_desc/quic_conn layer interaction. Instead, each time snd_buf
is called, QCS can be updated immediately using its return value. This
is performed via a new function qstrm_ctrl_send().
Its work is similar to the QUIC equivalent but in a simpler mode. In
particular, sent data can be immediately removed from the Tx buffer as
there is no need for retransmission when running above TCP.
This patchs implement mux-quic reception for the new QMux protocol. This
is performed via the new function qcc_qstrm_send_frames(). Its interface
is similar to the QUIC equivalent : it takes a list of frames and
encodes them in a buffer before sending it via snd_buf.
Contrary to QUIC, a check on CO_FL_ERROR flag is performed prior to
every qcc_qstrm_send_frames() invokation to interrupt emission. This is
necessary as the transport layer may set it during snd_buf. This is not
the case currently for quic_conn layer, but maybe a similar mechanism
should be implemented as well for QUIC in the future.
The previous patch defines a new QCC buffer member to implement QMux
reception. This patch completes this by perfoming realign on it during
qcc_qstrm_recv(). This is necessary when there is not enough contiguous
data to read a whole frame.
When QMux is used, mux-quic must actively performed reception of new
content. This has been implemented by the previous patch.
The current patch extends this by defining a buffer on QCC dedicated to
this operation. This replaces the usage of the trash buffer. This is
necessary to deal with incomplete reads.
Implements parsing of frames related to flow-control for mux-quic
running on the new QMux protocol. This simply calls qcc_recv_*() MUX
functions already used by QUIC.
This patch implements a new function qcc_qstrm_recv() dedicated to the
new QMux protocol. It is responsible to perform data reception via
rcv_buf() callback. This is defined in a new mux_quic_strm module.
Read data are parsed in frames. Each frame is handled via standard
mux-quic functions. Currently, only STREAM and RESET_STREAM types are
implemented.
One major difference between QUIC and QMux is that mux-quic is passive
on the reception side on the former protocol. For the new one, mux-quic
becomes active. Thus, a new call to qcc_qstrm_recv() is performed via
qcc_io_recv().
STREAM frame will also be used by the new QMux protocol. This requires
some adaptation in the qf_stream structure. Reference to qc_stream_desc
object is replaced by a generic void* pointer.
This change is necessary as QMux protocol will not use any
qc_stream_desc elements for emission.
Ensure mux-quic traces will be compatible with the new QMux protocol.
This is necessary as the quic_conn element is accessed to display some
transport information. Use conn_is_quic() to protect these accesses.
Ensure mux-quic operations related to initialization and shutdown will
be compatible with the new QMux protocol. This requires to use
conn_is_quic() before any access to the quic_conn element, in
qmux_init(), qcc_shutdown() and qcc_release().
Adapts mux-quic functions related to emission for future QMux protocol
support.
In short, QCS will not used a qc_stream_desc object but instead a plain
buffer. This is inserted as a union in QCS structure. Every access to
QUIC qc_stream_desc is protected by a prior conn_is_quic() check. Also,
pacing is useless for QMux and thus is disabled for such protocol.
Move <stream> field from qcs type into the inner structure 'tx'. This
change is only a minor refactoring without any impact. It is cleaner as
Rx buffer elements are already present in 'rx' inner structure.
This reorganization is performed before introducing of a new Tx buffer
field used for QMux protocol.
Implement parse/build methods for QX_TRANSPORT_PARAMETER frame. Both
functions may fail due to buffer space too small (encoding) or truncated
frame (parsing).
Define a new frame type for QMux transport parameter exchange. Frame
type is 0x3f5153300d0a0d0a and is declared as an extra frame, outside of
quic_frame_parsers / quic_frame_builders.
The next patch will implement parsing/encoding of this frame payload.
The previous patch refactored QUIC transport parameters decoding and
validity checks. These two operation are now performed in two distinct
functions. This renders quic_tp_dec_err type useless. Thus, this patch
removes it. Function returns are converted to a simple integer value.
Function quic_transport_params_decode() is used for decoding received
parameters. Prior to this patch, it also contained validity checks on
some of the parameters. Finally, it also tested that mandatory
parameters were indeed found.
This patch separates this two parts. Params validity is now tested in a
new function quic_transport_params_check(), which can be called just
after decode operation.
This patch will be useful for QMux protocol, as this allows to reuse
decode operation without executing checks which are tied to the QUIC
specification, in particular for mandatory parameters.
The documentation for functions related to transport parameters decoding
is unclear or sometimes completely wrong on the meaning of the <server>
argument. It must be set to reflect the origin of the parameters,
contrary to what was implied in function comments.
Fix this by rewriting comments related to this <server> argument. This
should prevent to make any mistake in the future.
This is purely a documentation fix. However, it could be useful to
backport it up to 2.6.
This patch is a direct follow-up of the previous one. This time,
refactoring is performed on qc_build_frm() which is used for frame
encoding.
Function prototype has changed as now packet argument is removed. To be
able to check frame validity with a packet, one can use the new parent
function qc_build_frm_pkt() which relies on qc_build_frm().
As with the previous patch, there is no function change expected. The
objective is to facilitate a future QMux implementation.
This patch refactors parsing in QUIC frame module. Function
qc_parse_frm() has been splitted in three :
* qc_parse_frm_type()
* qc_parse_frm_pkt()
* qc_parse_frm_payload()
No functional change. The main objective of this patch is to facilitate
a QMux implementation. One of the gain is the ability to manipulate QUIC
frames without any reference to a QUIC packet as it is irrelevant for
QMux. Also, quic_set_connection_close() calls are extracted as this
relies on qc type. The caller is now responsible to set the required
error code.
Set the default dns-delay to 30s so it can be more efficient with fast
DNS providers. The dns-timeout is set to 600s by default so this does
not have a big impact, it will only do more check and allow the
challenge to be started more quickly.
When using the dns-01 challenge method with "challenge-ready dns", HAProxy
retries DNS resolution indefinitely at the interval set by "dns-delay". This
adds a "dns-timeout" keyword to set a maximum duration for the DNS check phase
(default: 600s). If the next resolution attempt would be scheduled beyond that
deadline, the renewal is aborted with an explicit error message.
A new "dnsstarttime" field is stored in the acme_ctx to record when DNS
resolution began, used to evaluate the timeout on each retry.
Thanks to this patch, it is now possible to specify an healthcheck section
on the server line. In that case, the server will use the tcpcheck as
defined in the correspoding healthcheck section instead of the proxy's one.
tcpcheck_ruleset struct was extended to host a config part that will be used
for healthcheck sections. This config part is mainly used to store element
for the server's tcpcheck part.
When a healthcheck section is parsed, a ruleset is created with its name
(which must be unique). "*healthcheck-{NAME}" is used for these ruleset. So
it is not possible to mix them with regular rulesets.
For now, in a healthcheck section, the type must be defined, based on the
options name (tcp-check, httpchk, redis-check...). In addition, several
"tcp-check" or "http-check" rules can be specified, depending on the
healthcheck type.
Functions used to parse directives related to tcpchecks were split to have a
first step testing the proxy and creating the tcpcheck ruleset if necessary,
and a second step filling the ruleset. The aim of this patch is to preapre
the parsing of healthcheck sections. In this context, only the second steip
will be used.
When log-format stirngs were parsed in context of a tcpcheck, ARGC_SRV
context was used instead of ARGC_TCK. This context is used to report
accurrate errors.
This patch could be backported to all stable versions.
The proxy flag PR_O_TCPCHK_SSL is replaced by a flag on the tcpcheck
itself. When TCPCHK_FL_USE_SSL flag is set, it means the healthcheck will
use an SSL connection and the SSL xprt must be prepared for the server.
In tcpchecks context, when HTTP sample expressions are parsed, there is no
reason to set the proxy's http_needed value to 1. This value is only used
for streams to allocate an HTTP txn.
This patch could be backported to all stable versions.
disable-on-404 and send-state options, configured on an HTTP healtcheck,
were handled as proxy options. Now, these options are handled in the
tcp-check itself. So the corresponding PR_O and PR_02 flags are removed.
The tcpcheck_rules structure is replaced by the tcpcheck structure. The main
difference is that the ruleset is now referenced in the tcpcheck structure,
instead of the rules list. The flags about the ruleset type are moved into
the ruleset structure and flags to track unused rules remains on the
tcpcheck structure. So it should be easier to track unused rulesets. But it
should be possible to configure a set of tcpcheck rules outside of the proxy
scope.
The main idea of these changes is to prepare the parsing of a new
healthcheck section. So this patch is quite huge, but it is mainly about
renaming some fields.
When parsing httpchck option, a wrong flag (TCPCHK_SND_HTTP_FROM_OPT) was
set on the rules, while it is in fact a flag for a send rule. Let's remove
it. There is no issue here because there is no corresponding flag for
tcpcheck rules.
This patch must be backported to all stable versions.
These actions were added recently and it appeared the way binary headers
were retrieved could be simplified.
First, there is no reason to retrieve a base64 encoded string. It is
possible to rely on the binary string directly. "b64dec" converter can be
used to perform a base64 decoding if necessary.
Then, using a log-format string is quite overkill and probably
conterintuitive. Most of time, the headers will be retrieved from a
variable. So a sample expression is easier to use. Thanks to the previous
patch, it is quite easy to achieve.
This patch relies on the commit "MINOR: action: Add a sample expression
field in arguments used by HTTP actions". The documentation was updated
accordingly.
An error is erroneously triggered if a if/unless statement is found after
set-headers-bin and add-headers-bin actions. To make it works, during
parsing of these actions, we should leave when an unknown argument is found
to let the rule parser the opportunity to parse an if/unless statement.
No backport needed.
Add keylog_format_fc and keylog_format_bc global variables containing
the SSLKEYLOGFILE log-format strings for the frontend (client-facing)
and backend (server-facing) TLS connections respectively. These produce
output compatible with the SSLKEYLOGFILE format described at:
https://tlswg.org/sslkeylogfile/draft-ietf-tls-keylogfile.html
Both formats are also exported as environment variables at startup:
HAPROXY_KEYLOG_FC_LOG_FMT
HAPROXY_KEYLOG_BC_LOG_FMT
These variables contains \n so they might not be compatible with syslog
servers, using them with stderr or a sink might be required.
These can be referenced directly in "log-format" directives to produce
SSLKEYLOGFILE-compatible output, usable by network analyzers such as
Wireshark to decrypt captured TLS traffic.
Check that last_sess actually changed before attempting to set it, as it
should only change once every second, that will avoid a lot of atomic
writes on a busy cache line.
Check that last_sess actually changed before attempting to set it, as it
should only change once every second, that will avoid a lot of atomic
writes on a busy cache line.
Reverse the default, to hide the version from stats by default, and add
a new keyword, "stats show-version", to enable them, as we don't want to
disclose the version by default, especially on public websites.
When binary headers are decoded, return value of decode_varint() function is
not properly handled. On error, it can return -1. However, the result is
inconditionnaly added to an unsigned offset.
Now, a temporary variable is used to be abl to test decode_varint() return
value. It is added to the offset on success only.
No backport needed.
When h1_snd_buf() inherits the CO_SFL_MSG_MORE flag from the upper layer, it
unconditionally propagates it to H1C_F_CO_MSG_MORE, which eventually sets
MSG_MORE on the sendmsg() call. For bodyless responses (HEAD, 204, 304), this
causes the kernel to cork the TCP connection for ~200ms waiting for body data
that will never be sent.
With an H1 frontend and H2 backend, this adds ~200ms of latency to many or
all bodyless responses. The 200ms corresponds to the kernel's tcp_cork_time
default. H1 backends are less affected because h1_postparse_res_hdrs() sets
HTX_FL_EOM during header parsing for bodyless responses, but H2 backends
frequently deliver the end-of-stream signal in a separate scheduling round,
leaving htx_expect_more() returning TRUE when headers are first forwarded.
The fix guards H1C_F_CO_MSG_MORE so it is only set when the connection is a
backend (H1C_F_IS_BACK) or the response is not bodyless
(!H1S_F_BODYLESS_RESP). This ensures bodyless responses on the front
connection are sent immediately without corking.
This should be backported to all stable branches.
Co-developed-by: Billy Campoli <bcampoli@meta.com>
Co-developed-by: Chandan Avdhut <cavdhut@meta.com>
Co-developed-by: Neel Raja <neelraja@meta.com
These actions allow setting, adding and deleting multiple headers from
the same action, without having to know the header names during parsing.
This is useful when doing things with SPOE.
Adds 'quic4@' / 'quic6@' as prefixes available for server addresses.
This is explicitely listed as experimental for now.
This must be backported up to 3.3.
The CLI commands (get|add|del|clear|commit|set) | (acl|map) does not
contain a permission check on admin level.
Must be backported to 3.3. This can be a breaking change for some users.
Initially reported by Cameron Brown.
'set ssl ocsp-response', 'update ssl ocsp-response', 'show ssl
ocsp-response', 'show ssl ocsp-updates' are lacking permissions checks
on admin level.
Must be backported in 3.3. This can be a breaking change for some users.
Initially reported by Cameron Brown.
Both 'set ssl tls-key' and 'show tls-keys' command are missing the
permission checks so the commands can be used only in admin mode.
Must be backported to 3.3. This can be a breaking change for some users.
Initially reported by Cameron Brown.
This commit adds an ha_warning() when map/acl commands are accessed
without admin level. This is to warn users that these commands will be
restricted to admin only in HAProxy 3.3.
Must be backported in every stable branches.
Initially reported by Cameron Brown.
This commit adds an ha_warning() when OCSP commands are accessed without
admin level. This is to warn users that these commands will be
restricted to admin only in HAProxy 3.3.
Must be backported in every stable branches.
Initially reported by Cameron Brown.
This commit adds an ha_warning() when 'show tls-keys' or 'set ssl
tls-key' are accessed without admin level. This is to warn users that
these commands will be restricted to admin only in HAProxy 3.3.
Must be backported in every stable branches.
Initially reported by Cameron Brown.
The new "-L" option is convenient for quick backport sessions, but it
doesn't list the commit subjects nor the review command. Let's just add
these to ease backport sessions. However we don't do it in quiet mode
(-q) because the output is sometimes parsed by automatic backport
scripts.
Add documentation for two new directives in the acme section:
- challenge-ready: configures the conditions that must be satisfied
before notifying the ACME server that a dns-01 challenge is ready.
Accepted values are cli, dns and none. cli waits for an operator
to signal readiness via the "acme challenge_ready" CLI command. dns
performs a DNS pre-check against the "default" resolvers section,
not the authoritative name servers. When both are combined, HAProxy
waits for the CLI confirmation before triggering the DNS check.
- dns-delay: configures the delay before the first DNS resolution
attempt and between retries when challenge-ready includes dns.
Default is 300 seconds.
The previous patch implemented the 'dns-check' option. This one replaces
it by a more generic 'challenge-ready' option, which allows the user to
chose the condition to validate the readiness of a challenge. It could
be 'cli', 'dns' or both.
When in dns-01 mode it's by default to 'cli' so the external tool used to
configure the TXT record can validate itself. If the tool does not
validate the TXT record, you can use 'cli,dns' so a DNS check would be
done after the CLI validated with 'challenge_ready'.
For an automated validation of the challenge, it should be set to 'dns',
this would check that the TXT record is right by itself.
When using the dns-01 challenge type, TXT record propagation across
DNS servers can take time. If the ACME server verifies the challenge
before the record is visible, the challenge fails and it's not possible
to trigger it again.
This patch introduces an optional DNS pre-check mechanism controlled
by two new configuration directives in the "acme" section:
- "dns-check on|off": enable DNS propagation verification before
notifying the ACME server (default: off)
- "dns-delay <time>": delay before querying DNS (default: 300s)
When enabled, three new states are inserted in the state machine
between AUTH and CHALLENGE:
- ACME_RSLV_WAIT: waits dns-delay seconds before starting
- ACME_RSLV_TRIGGER: starts an async TXT resolution for each
pending authorization using HAProxy's resolver infrastructure
- ACME_RSLV_READY: compares the resolved TXT record against the
expected token; retries from ACME_RSLV_WAIT if any record is
missing or does not match
The "acme_rslv" structure is implemented in acme_resolvers.c, it holds
the resolution for each domain. The "auth" structure which contains each
challenge to resolve contains an "acme_rslv" structure. Once
ACME_RSLV_TRIGGER leaves, the DNS tasks run on the same thread, and the
last DNS task which finishes will wake up acme_process().
Note that the resolution goes through the configured resolvers, not
through the authoritative name servers of the domain. The result may
therefore still be affected by DNS caching at the resolver level.
This patch adds support for TXT records. It allows to get the first
string of a TXT-record which is limited to 255 characters.
The rest of the record is ignored.
Latest commit a336c467a0 ("BUG/MINOR: net_helper: fix length controls
on ip.fp tcp options parsing") was malformed and broke the build. This
should be backported wherever the fix above is backported.
If opt len is truncated by tcplen we may read 1 Byte after the
tcp header.
There is also missing controls parsing MSS and WS we may compute
invalid values on fingerprint reading after the tcp header in
case of truncated options.
This patch should be backported on versions including ip.fp
We leverage the SE_FL_APP_STARTED flag to detect whether the application
layer had a chance to run or not when an RST_STREAM is received. This
allows us to triage RST_STREAM between regular ones and harmful ones,
and to count glitches for them. It reveals extremely effective at
detecting fast HEADERS+RST pairs.
It could be useful to backport it to 3.2, though it depends on these
two previous patches to be backported first (the first one was already
planned and the second one is harmless, though will require to drop
the haterm changes):
BUG/MINOR: stconn: Always declare the SC created from healthchecks as a back SC
MINOR: stconn: flag the stream endpoint descriptor when the app has started
In order to improve our ability to distinguish operations that had
already started from others under high loads, it would be nice to know
if an application layer (stream) has started to work with an endpoint
or not. The use case typically is a frontend mux instantiating a stream
to instantly cancel it. Currently this info will take some time to be
detected and processed if the applcation's task takes time to wake up.
By flagging the sedesc with SE_FL_APP_STARTED the first time a the app
layer starts, the lower layers can know whether they're cancelling a
stream that has started to work or not, and act accordingly. For now
this is done unconditionally on the backend, and performed early in the
only two app layers that can be reached by a frontend: process_stream()
and process_hstream() (for haterm).
The SC created from a healthcheck is always a back SC. But SC_FL_ISBACK
flags was missing. Instead of passing it when sc_new_from_check() is called,
the function was simplified to set SC_FL_ISBACK flag systematically when a
SC is created from a healthcheck.
This patch should be backported as far as 2.6.
RFC 9000 lists each supported frames and the type of packets in which it
can be present.
Prior to this patch, a packet with an incompatible frame is dropped.
However, QUIC specification mandates that the connection is immediately
closed with PROTOCOL_VIOLATION error code. This patch completes
qc_parse_frm() to add such connection closure.
This must be backported up to 2.6.
The GitHub API silently caps per_page at 100, so passing per_page=200
was silently returning at most 100 tags. AWS-LC-FIPS tags appear late
in the list, causing version detection to fail.
Replace the single-page fetch in get_all_github_tags() with a loop that
iterates all pages.
Could be backported in previous branches.
When an htx DATA block is partially transfer, we must take care to remove
exactly the copied size. To do so, we must save the size of the last block
value copied and not rely on the last data block after the copy. Indeed,
data can be merged with an existing DATA block, so the last block size can
be larger than the last part copied.
Because of this issue, it is possible to remove more data than
expected. Worse, this could lead to a crash by performing an integer
overflow on the block size.
No backport needed.
Fix a leak of the task object in acme_start_task() when one of the
condition in the function failed.
Fix issue #3308.
Must be backported to 3.2 and later.
There are two settings to control idle connection sharing across
threads.
tune.idle-pool.shared, that enables or disables it, and then
tune.takeover-other-tg-connections, which lets you or not get idle
connections from other thread groups.
Add a new keyword for tune.idle-pool.shared, "full", that lets you get
connections from other thread groups (equivalent to "full" keyword for
tune.takeover-other-tg-connections). The "on" keyword now will be
equivalent to the "restrict" one, which allowed getting connection from
other thread groups only when not doing it would result in a connection
failure (when reverse-http or when strict-macxonn are used).
tune.takeover-other-tg-connections will be deprecated.
If server returns an auth with status valid it seems that client
needs to always skip it, CA can recycle authorizations, without
this change haproxy fails to obtain certificates in that case.
It is also something that is explicitly allowed and stated
in the dns-persist-01 draft RFC.
Note that it would be better to change how haproxy does status polling,
and implements the state machine, but that will take some thought
and time, this patch is a quick fix of the problem.
See:
https://github.com/letsencrypt/boulder/issues/2125https://github.com/letsencrypt/pebble/issues/133
This must be backported to 3.2 and later.
When abortonclose option is enabled (by default since 3.3), the HTTP rules
can no longer yield if the client aborts. However, stream aborts were also
considered. So it was possible to interrupt yielding rules, especially on
the response processing, while the client was still waiting for the
response.
So now, when abortonclose option is enabled, we now take care to only
consider client aborts to prevent HTTP rules to yield.
Many thanks to @DirkyJerky for his detailed analysis.
This patch should fix the issue #3306. It should be backported as far as
2.8.
warnif_misplaced_* functions return 1 when a warning is reported and 0
otherwise. So the caller must properly handle the return value.
When parsing a proxy, ERR_WARN code must be added to the error code instead
of the return value. When a warning was reported, ERR_RETRYABLE (1) was
added instead of ERR_WARN.
And when tcp rules were parsed, warnings were ignored. Message were emitted
but the return values were ignored.
This patch should be backported to all stable versions.
When warnif_cond_conflicts() is called, we must take care to emit a warning
only when a conflict is reported. We cannot rely on the err_code variable
because some warnings may have been already reported. We now rely on the
errmsg variable. If it contains something, a warning is emitted. It is good
enough becasue warnif_cond_conflicts() only reports warnings.
This patch should fix the issue #3305. It is a 3.4-dev specific issue. No
backport needed.
When picking a mux, pay attention to its MX_FL_FRAMED. If it is set,
then it means we explicitely want QUIC, so don't use that mux for any
protocol that is not QUIC.
When parsing the check address, store the associated proto too.
That way we can use the notation like quic4@address, and the right
protocol will be used. It is possible for checks to use a different
protocol than the server, ie we can have a QUIC server but want to run
TCP checks, so we can't just reuse whatever the server uses.
WIP: store the protocol in checks
Don't assume the check will reuse the server's xprt. It may not be true
if some settings such as the ALPN has been set, and it differs from the
server's one. If the server is QUIC, and we want to use TCP for checks,
we certainly don't want to reuse its XPRT.
Permission checks on the CLI for ACME are missing.
This patch adds a check on the ACME commands
so they can only be run in admin mode.
ACME is stil a feature in experimental-mode.
Initial report by Cameron Brown.
Must be backported to 3.2 and later.
Permission checks on the CLI for ECH are missing.
This patch adds a check for "(add|set|del|show) ssl ech" commands
so they can only be run in admin mode.
ECH is stil a feature in experimental-mode and is not compiled by
default.
Initial report by Cameron Brown.
Must be backported to 3.3.
This patch fixes a warning that can be reproduced with gcc-8.5 on RHEL8
(gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)).
This should fix issue #3303.
Must be backported everywhere 917e82f283 ("MINOR: debug: copy debug
symbols from /usr/lib/debug when present") was backported, which is
to branch 3.2 for now.
Fix the check or arguments of the 'acme challenge_ready' command which
was checking if all arguments are NULL instead of one of the argument.
Must be backported to 3.2 and later.
Replace atol() by _strl2uic() in cases the input are ISTs when parsing
the retry-after header. There's no risk of an error since it will stop
at the first non-digit.
Must be backported to 3.2 and later.
In acme_req_finalize() the data buffer is only freed when a2base64url
succeed. This patch moves the allocation so it free() the DER buffer in
every cases.
Must be backported to 3.2 and later.
In the documentation of 'http-check expect' directive, the parameter
'status-code' was missing. Let's add it.
This patch could be backported to all stable versions.
Thanks to previous commits, it is possible to use small buffers at different
places: to store the request when a connection is queued or when L7 retries
are enabled, or for health-checks requests. However, there was no
configuration parameter to fine tune small buffer use.
It is now possible, thanks to the proxy option "use-small-buffers".
Documentation was updated accordingly.
When healthchecks were configured for a proxy, an enum-like was used to
sepcify the check's type. The idea was to reserve some values for futur
types of healthcheck. But it is overkill. I doubt we will ever have
something else than tcp and external checks. So corresponding PR_O2 flags
were slightly reviewed and a hole was filled.
Thanks to this change, some bits were released in options2 bitfield.
If support for small buffers is enabled, we now try to use them for
healthcheck requests. First, we take care the tcpcheck ruleset may use small
buffers. Send rules using LF strings or too large data are excluded. The
ability to use small buffers or not are set on the ruleset. All send rules
of the ruleset must be compatible. This info is then transfer to server's
healthchecks relying on this ruleset.
Then, when a healthcheck is running, when a send rule is evaluated, if
possible, we try to use small buffers. On error, the ability to use small
buffers is removed and we retry with a regular buffer. It means on the first
error, the support is disabled for the healthcheck and all other runs will
use regular buffers.
In h2_rcv_buf(), HTX flags are transfer with data when htx_xfer() is
called. There is no reason to continue to deal with them in the H2 mux. In
addition, there is no reason to set SE_FL_EOI flag when a parsing error was
reported. This part was added before the stconn era. Nowadays, when an HTX
parsing error is reported, an error on the sedesc should also be reported.
This reverts commit 44932b6c41.
The patch above was only necessary to handle partial headers or trailers
parsing. There was nothing to prevent the H2 multiplexer to start to add
headers or trailers in an HTX message and to stop the processing on error,
leaving the HTX message with no EOH/EOT block.
From the HTX API point of view, it is unexepected. And this was fixed thanks
to the commit ba7dc46a9 ("BUG/MINOR: h2/h3: Never insert partial
headers/trailers in an HTX message").
So this patch can be reverted. It is important to not report a parsign error
too early, when there are still data to transfer to the upper layer.
This patch must be backport where 44932b6c4 was backported but only after
backporting ba7dc46a9 first.
When a HTX stream is queued, if the request is small enough, it is moved
into a small buffer. This should save memory on instances intensively using
queues.
Applet and connection receive function were update to block receive when a
small buffer is in use.
In the same way support for large chunks was added to properly work with
large buffers, we are now adding supports for small chunks because it is
possible to process small buffers.
So a dedicated memory pool is added to allocate small
chunks. alloc_small_trash_chunk() must be used to allocate a small
chunk. alloc_trash_chunk_sz() and free_trash_chunk() were uppdated to
support small chunks.
In addition, small trash buffers are also created, using the same mechanism
than for regular trash buffers. So three thread-local trash buffers are
created. get_small_trash_chunk() must be used to get a small trash buffer.
And get_trash_chunk_sz() was updated to also deal with small buffers.
htx_move_to_small_buffer()/htx_move_to_large_buffer() and
htx_copy_to_small_buffer()/htx_copy_to_large_buffer() functions can now be
used to move or copy blocks from a default buffer to a small or large
buffer. The destination buffer is allocated and then each blocks are
transferred into it.
These funtions relies in htx_xfer() function.
htx_xfer() function should replace htx_xfer_blks(). It will be a bit easier to
maintain and to use. The behavior of htx_xfer() can be changed by calling it
with specific flags:
* HTX_XFER_KEEP_SRC_BLKS: Blocks from the source message are just copied
* HTX_XFER_PARTIAL_HDRS_COPY: It is allowed to partially xfer headers or trailers
* HTX_XFER_HDRS_ONLY: only headers are xferred
By default (HTX_XFER_DEFAULT or 0), all blocks from the source message are moved
into to the destination mesage. So copied in the destination messageand removed
from the source message.
The caller must still define the maximum amount of data (including meta-data)
that can be xferred.
It is no longer necessary to specify a block type to stop the copy. Most of
time, with htx_xfer_blks(), this parameter was set to HTX_BLK_UNUSED. And
otherwise it was only specified to transfer headers.
It is important to not that the caller is responsible to verify the original
HTX message is well-formated. Especially, it must be sure headers part and
trailers part are complete (finished by EOH/EOT block).
For now, htx_xfer_blks() is not removed for compatiblity reason. But it is
deprecated.
When small buffer size was greater than the default buffer size, an error
was triggered. We now do the same than for large buffer. A warning is
emitted and the small buffer size is set to 0 do disable small buffer
allocation.
Because small buffers were only used by QUIC streams, the pool used to alloc
these buffers was located in the quic code. However, their usage will be
extended to other parts. So, the small buffers pool was moved into the
dynbuf part.
http-errors parsing has been refactored in a recent serie of patches.
However, a null deref was introduced by the following patch in case a
non-existent http-errors section is referenced by an "errorfiles"
directive.
commit 2ca7601c2d
MINOR/OPTIM: http_htx: lookup once http_errors section on check/init
Fix this by delaying ha_free() so that it is called after ha_alert().
No need to backport.
The cfg_parse_acme() function checks if an 'acme' section is already
existing in the configuration with cur_acme->linenum > 0. But the wrong
filename and line number are displayed in the commit message.
Must be backported to 3.2 and later.
This patch fixes a leak of the ext_san structure when
sk_X509_EXTENSION_push() failed. sk_X509_EXTENSION_pop_free() is already
suppose to free it, so ext_san must be set to NULL upon success to avoid
a double-free.
Must be backported to 3.2 and later.
Use proxy_check_http_errors() on defaults proxy instances. This will
emit alert messages for errorfiles directives referencing a non-existing
http-errors section, or a warning if an explicitely listed status code
is not present in the target section.
This is a small behavior changes, as previouly this was only performed
for regular proxies. Thus, errorfile/errorfiles directives in an unused
defaults were never checked.
This may prevent startup of haproxy with a configuration file previously
considered as valid. However, this change is considered as necessary to
be able to use http-errors with dynamic backends. Any invalid defaults
will be detected on startup, rather than having to discover it at
runtime via "add backend" invokation.
Thus, any restriction on http-errors usage is now lifted for the
creation of dynamic backends.
The previous patch has splitted the original proxy_check_errors()
function in two, so that check and init steps are performed separately.
However, this renders the code inefficient for "errorfiles" directive as
tree lookup on http-errors section is performed twice.
Optimize this by adding a reference to the section in conf_errors
structure. This is resolved during proxy_check_http_errors() and
proxy_finalize_http_errors() can reuse it.
No need to backport.
Function proxy_check_errors() is used when configuration parsing is
over. This patch splits it in two newly named ones.
The first function is named proxy_check_http_errors(). It is responsible
to check for the validity of any "errorfiles" directive which could
reference non-existent http-errors section or code not defined in such
section. This function is now called via proxy_finalize().
The second function is named proxy_finalize_http_errors(). It converts
each conf_errors type used during parsing in a proper http_reply type
for runtime usage. This function is still called via post-proxy-check,
after proxy_finalize().
This patch does not bring any functional change. However, it will become
necessary to ensure http-errors can be used as expected with dynamic
backends.
This patch is the second part of the refactoring for http-errors
parsing. It renames some fields in <conf_errors> structure to clarify
their usage. In particular, union variants are renamed "inl"/"section",
which better highlight the link with the newly defined enum
http_err_directive.
In conf_errors struct, arbitrary integer values were used for both
<type> field and <status> array. This renders the code difficult to
follow.
Replaces these values with proper enums type. Two new types are defined
for each of these fields. The first one represents the directive type,
derived from the keyword used (errorfile vs errorfiles). This directly
represents which part of <info> union should be manipulated.
The second enum is used for errorfiles directive with a reference on a
http-errors section. It indicates whether or not if a status code should
be imported from this section, and if this import is explicit or
implicit.
Several resources were leaked on both success and error paths:
- X509_NAME *nm was never freed. X509_REQ_set_subject_name() makes
an internal copy, so nm must be freed separately by the caller.
- str_san allocated via my_strndup() was never freed on either path.
- On error paths after allocation, x (X509_REQ) and exts
(STACK_OF(X509_EXTENSION)) were also leaked.
Fix this by adding proper cleanup of all allocated resources in both
the success and error paths. Also move sk_X509_EXTENSION_pop_free()
after X509_REQ_sign() so it is not skipped when sign fails, and
initialize nm to NULL to make early error paths safe.
Must be backported as far as 3.2.
There was a leftover of "activity[tid].ctr1++" in commit 7d40b3134
("MEDIUM: sched: do not run a same task multiple times in series")
that unfortunately only builds in development mode :-(
This introduces 3 new settings: tune.h2.be.max-frames-at-once and
tune.h2.fe.max-frames-at-once, which limit the number of frames that
will be processed at once for backend and frontend side respectively,
and tune.h2.fe.max-rst-at-once which limits the number of RST_STREAM
frames processed at once on the frontend.
We can now yield when reading too many frames at once, which allows to
limit the latency caused by processing too many frames in large buffers.
However if we stop due to the RST budget being depleted, it's most likely
the sign of a protocol abuse, so we make the tasklet go to BULK since
the goal is to punish it.
By limiting the number of RST per loop to 1, the SSL response time drops
from 95ms to 1.6ms during an H2 RST flood attack, and the maximum SSL
connection rate drops from 35.5k to 28.0k instead of 11.8k. A moderate
SSL load that shows 1ms response time and 23kcps increases to 2ms with
15kcps versus 95ms and 800cps before. The average loop time goes down
from 270-280us to 160us, while still doubling the attack absorption
rate with the same CPU capacity.
This patch may usefully be backported to 3.3 and 3.2. Note that to be
effective, this relies on the following patches:
MEDIUM: sched: do not run a same task multiple times in series
MINOR: sched: do not requeue a tasklet into the current queue
MINOR: sched: do not punish self-waking tasklets anymore
MEDIUM: sched: do not punish self-waking tasklets if TASK_WOKEN_ANY
MEDIUM: sched: change scheduler budgets to lower TL_BULK
Having less yielding tasks in TL_BULK and more in TL_NORMAL, we need
to rebalance these queues' priorities. Tests have shown that raising
TL_NORMAL to 40% and lowering TL_BULK to 3% seems to give about the
best tradeoffs.
Self-waking tasklets are currently punished and go to the BULK list.
However it's a problem with muxes or the stick-table purge that just
yield and wake themselves up to limit the latency they cause to the
rest of the process, because by doing so to help others, they punish
themselves. Let's check if any TASK_WOKEN_ANY flag is present on
the tasklet and stop sending tasks presenting such a flag to TL_BULK.
Since tasklet_wakeup() by default passes TASK_WOKEN_OTHER, it means
that such tasklets will no longer be punished. However, tasks which
only want a best-effort wakeup can simply pass 0.
It's worth noting that a comparison was made between going into
TL_BULK at all and only setting the TASK_SELF_WAKING flag, and
it shows that the average latencies are ~10% better when entirely
avoiding TL_BULK in this case.
Nowadays due to yield etc, it's counter-productive to permanently
punish self-waking tasklets, let's abandon this principle as it prevent
finer task priority handling.
We continue to check for the TASK_SELF_WAKING flag to place a task
into TL_BULK in case some code wants to make use of it in the future
(similarly to TASK_HEAVY), but no code sets it anymore. It could
possible make sense in the future to replace this flag with a one-shot
variant requesting low-priority.
As found by Christopher, the concept of waking a tasklet up into the
current queue is totally flawed, because if a task is in TL_BULK or
TL_HEAVY, all the tasklets it will wake up will end up in the same
queue. Not only this will clobber such queues, but it will also
reduce their quality of service, and this can contaminate other
tasklets due to the numerous wakeups there are now with the subsribe
mechanism between layers.
There's always a risk that some tasks run multiple times if they wake
each other up. Now we include the loop counter in the task struct and
stop processing the queue it's in when meeting a task that has already
run. We only pick 16 bits since that's only what remains free in the
task common part, so from time to time (once every 65536) it will be
possible to wrongly match a task as having already run and stop evaluating
its queue, but it's rare enough that we don't care, because this will
be OK on the next iteration.
This patch improves the robustness of the QPACK varint decoder and fixes
potential 1-byte out-of-bounds reads in qpack_decode_fs().
In qpack_decode_fs(), two 1-byte OOB reads were possible on truncated
streams between two varint decoding. These occurred when trying to read
the byte containing the Huffman bit <h> and the Value Length prefix
immediately following an Index or a Name Length.
Note that these OOB are limited to a single byte because
qpack_get_varint() already ensures that its input length is non-zero
before consuming any data.
The fixes in qpack_decode_fs() are:
- When decoding an index, we now verify that at least one byte remains
to safely access the following <h> bit and value length.
- When decoding a literal, we now check len < name_len + 1 to ensure
the byte starting the header value is reachable.
In qpack_get_varint(), the maximum value is now strictly capped at 2^62-1
as per RFC. This is enforced using a budget-based check:
(v & 127) > (limit - ret) >> shift
This prevents values from overflowing into the 63rd or 64th bits, which
would otherwise break subsequent signed comparisons (e.g., if (len < name_len))
by interpreting the length as a negative value, leading to false positive
tests.
Thank you to @jming912 for having reported this issue in GH #3302.
Must be backported as far as 2.6
In the ENFILE and ENOMEM cases, when accept() fails, an irrelevant
global.maxsock value was printed that doesn't reflect system limits.
Now the actconn is printed that gives a hint about the failure reasons.
Should be backported in all stable branches.
We anticipated that the do-log action should be expanded with optional
arguments at some point. Now that we heard of multiple use-cases
that could be achieved with do-log action, but that are limitated by the
fact that all do-log statements inherit from the implicit log-profile
defined on the logger, we need to provide a way for the user to specify
that custom log-profile that could be used per do-log actions individually
This is what we try to achieve in this commit, by leveraging the
prerequisite work performed by the last 2 commits.
In process_send_log(), now also consider the ctx if ctx->profile != NULL
In that case, we do as if logger->prof was set, but we consider
ctx->profile in priority over the logger one. What this means is that
it will become possible to pass ctx.profile to a profile that will be
used no matter what to generate the log payload.
This is a pre-requisite to implement optional "profile" argument for
do-log action
do_log() is just a wrapper to use do_log_ctx() with pre-filled ctx, but
we now have the low-level do_log_ctx() variant which can be used to
pass specific ctx parameters instead.
Released version 3.4-dev7 with the following main changes :
- BUG/MINOR: stconn: Increase SC bytes_out value in se_done_ff()
- BUG/MINOR: ssl-sample: Fix sample_conv_sha2() by checking EVP_Digest* failures
- BUG/MINOR: backend: Don't get proto to use for webscoket if there is no server
- BUG/MINOR: jwt: Missing 'jwt_tokenize' return value check
- MINOR: flt_http_comp: define and use proxy_get_comp() helper function
- MEDIUM: flt_http_comp: split "compression" filter in 2 distinct filters
- CLEANUP: flt_http_comp: comp_state doesn't bother about the direction anymore
- BUG/MINOR: admin: haproxy-reload use explicit socat address type
- MEDIUM: admin: haproxy-reload conversion to POSIX sh
- BUG/MINOR: admin: haproxy-reload rename -vv long option
- SCRIPTS: git-show-backports: hide the common ancestor warning in quiet mode
- SCRIPTS: git-show-backports: add a restart-from-last option
- MINOR: mworker: add a BUG_ON() on mproxy_li in _send_status
- BUG/MINOR: mworker: don't set the PROC_O_LEAVING flag on master process
- Revert "BUG/MINOR: jwt: Missing 'jwt_tokenize' return value check"
- MINOR: jwt: Improve 'jwt_tokenize' function
- MINOR: jwt: Convert EC JWK to EVP_PKEY
- MINOR: jwt: Parse ec-specific fields in jose header
- MINOR: jwt: Manage ECDH-ES algorithm in jwt_decrypt_jwk function
- MINOR: jwt: Add ecdh-es+axxxkw support in jwt_decrypt_jwk converter
- MINOR: jwt: Manage ec certificates in jwt_decrypt_cert
- DOC: jwt: Add ECDH support in jwt_decrypt converters
- MINOR: stconn: Call sc_conn_process from the I/O callback if TASK_WOKEN_MSG state was set
- MINOR: mux-h2: Rely on h2s_notify_send() when resuming h2s for sending
- MINOR: mux-spop: Rely on spop_strm_notify_send() when resuming streams for sending
- MINOR: muxes: Wakup the data layer from a mux stream with TASK_WOKEN_IO state
- MAJOR: muxes: No longer use app_ops .wake() callback function from muxes
- MINOR: applet: Call sc_applet_process() instead of .wake() callback function
- MINOR: connection: Call sc_conn_process() instead of .wake() callback function
- MEDIUM: stconn: Remove .wake() callback function from app_ops
- MINOR: check: Remove wake_srv_chk() function
- MINOR: haterm: Remove hstream_wake() function
- MINOR: stconn: Wakup the SC with TASK_WOKEN_IO state from opposite side
- MEDIUM: stconn: Merge all .chk_rcv() callback functions in sc_chk_rcv()
- MINOR: stconn: Remove .chk_rcv() callback functions
- MEDIUM: stconn: Merge all .chk_snd() callback functions in sc_chk_snd()
- MINOR: stconn: Remove .chk_snd() callback functions
- MEDIUM: stconn: Merge all .abort() callback functions in sc_abort()
- MINOR: stconn: Remove .abort() callback functions
- MEDIUM: stconn: Merge all .shutdown() callback functions in sc_shutdown()
- MINOR: stconn: Remove .shutdown() callback functions
- MINOR: stconn: Totally app_ops from the stconns
- MINOR: stconn: Simplify sc_abort/sc_shutdown by merging calls to se_shutdown
- DEBUG: stconn: Add a CHECK_IF() when I/O are performed on a orphan SC
- MEDIUM: mworker: exiting when couldn't find the master mworker_proc element
- BUILD: ssl: use ASN1_STRING accessors for OpenSSL 4.0 compatibility
- BUILD: ssl: make X509_NAME usage OpenSSL 4.0 ready
- BUG/MINOR: tcpcheck: Fix typo in error error message for `http-check expect`
- BUG/MINOR: jws: fix memory leak in jws_b64_signature
- DOC: configuration: http-check expect example typo
- DOC/CLEANUP: config: update mentions of the old "Global parameters" section
- BUG/MEDIUM: ssl: Handle receiving early data with BoringSSL/AWS-LC
- BUG/MINOR: mworker: always stop the receiving listener
- BUG/MEDIUM: ssl: Don't report read data as early data with AWS-LC
- BUILD: makefile: fix range build without test command
- BUG/MINOR: memprof: avoid a small memory leak in "show profiling"
- BUG/MINOR: proxy: do not forget to validate quic-initial rules
- MINOR: activity: use dynamic allocation for "show profiling" entries
- MINOR: tools: extend the pointer hashing code to ease manipulations
- MINOR: tools: add a new pointer hash function that also takes an argument
- MINOR: memprof: attempt different retry slots for different hashes on collision
- MINOR: tinfo: start to add basic thread_exec_ctx
- MINOR: memprof: prepare to consider exec_ctx in reporting
- MINOR: memprof: also permit to sort output by calling context
- MINOR: tools: add a function to write a thread execution context.
- MINOR: debug: report the execution context on thread dumps
- MINOR: memprof: report the execution context on profiling output
- MINOR: initcall: record the file and line declaration of an INITCALL
- MINOR: tools: decode execution context TH_EX_CTX_INITCALL
- MINOR: tools: support decoding ha_caller type exec context
- MINOR: sample: store location for fetch/conv via initcalls
- MINOR: sample: also report contexts registered directly
- MINOR: tools: support an execution context that is just a function
- MINOR: actions: store the location of keywords registered via initcalls
- MINOR: actions: also report execution contexts registered directly
- MINOR: filters: set the exec context to the current filter config
- MINOR: ssl: set the thread execution context during message callbacks
- MINOR: connection: track mux calls to report their allocation context
- MINOR: task: set execution context on task/tasklet calls
- MINOR: applet: set execution context on applet calls
- MINOR: cli: keep the info of the current keyword being processed in the appctx
- MINOR: cli: keep track of the initcall context since kw registration
- MINOR: cli: implement execution context for manually registered keywords
- MINOR: activity: support aggregating by caller also for memprofile
- MINOR: activity: raise the default number of memprofile buckets to 4k
- DOC: internals: short explanation on how thread_exec_ctx works
- BUG/MINOR: mworker: only match worker processes when looking for unspawned proc
- MINOR: traces: defer processing of "-dt" options
- BUG/MINOR: mworker: fix typo &= instead of & in proc list serialization
- BUG/MINOR: mworker: set a timeout on the worker socketpair read at startup
- BUG/MINOR: mworker: avoid passing NULL version in proc list serialization
- BUG/MINOR: sockpair: set FD_CLOEXEC on fd received via SCM_RIGHTS
- BUG/MEDIUM: stconn: Don't forget to wakeup applets on shutdown
- BUG/MINOR: spoe: Properly switch SPOE filter to WAITING_ACK state
- BUG/MEDIUM: spoe: Properly abort processing on client abort
- BUG/MEDIUM: stconn: Fix abort on close when a large buffer is used
- BUG/MEDIUM: stconn: Don't perform L7 retries with large buffer
- BUG/MINOR: h2/h3: Only test number of trailers inserted in HTX message
- MINOR: htx: Add function to truncate all blocks after a specific block
- BUG/MINOR: h2/h3: Never insert partial headers/trailers in an HTX message
- BUG/MINOR: http-ana: Swap L7 buffer with request buffer by hand
- BUG/MINOR: stream: Fix crash in stream dump if the current rule has no keyword
- BUG/MINOR: mjson: make mystrtod() length-aware to prevent out-of-bounds reads
- MEDIUM: stats-file/clock: automatically update now_offset based on shared clock
- MINOR: promex: export "haproxy_sticktable_local_updates" metric
- BUG/MINOR: spoe: Fix condition to abort processing on client abort
- BUILD: spoe: Remove unsused variable
- MINOR: tools: add a function to create a tar file header
- MINOR: tools: add a function to load a file into a tar archive
- MINOR: config: support explicit "on" and "off" for "set-dumpable"
- MINOR: debug: read all libs in memory when set-dumpable=libs
- DEV: gdb: add a new utility to extract libs from a core dump: libs-from-core
- MINOR: debug: copy debug symbols from /usr/lib/debug when present
- MINOR: debug: opportunistically load libthread_db.so.1 with set-dumpable=libs
- BUG/MINOR: mworker: don't try to access an initializing process
- BUG/MEDIUM: peers: enforce check on incoming table key type
- BUG/MINOR: mux-h2: properly ignore R bit in GOAWAY stream ID
- BUG/MINOR: mux-h2: properly ignore R bit in WINDOW_UPDATE increments
- OPTIM: haterm: use chunk builders for generated response headers
- BUG/MAJOR: h3: check body size with content-length on empty FIN
- BUG/MEDIUM: h3: reject unaligned frames except DATA
- BUG/MINOR: mworker/cli: fix show proc pagination losing entries on resume
- CI: github: treat vX.Y.Z release tags as stable like haproxy-* branches
- MINOR: freq_ctr: add a function to add values with a peak
- MINOR: task: maintain a per-thread indicator of the peak run-queue size
- MINOR: mux-h2: store the concurrent streams hard limit in the h2c
- MINOR: mux-h2: permit to moderate the advertised streams limit depending on load
- MINOR: mux-h2: permit to fix a minimum value for the advertised streams limit
- BUG/MINOR: mworker: fix sort order of mworker_proc in 'show proc'
- CLEANUP: mworker: fix tab/space mess in mworker_env_to_proc_list()
Since version 3.1, the display order of old workers in 'show proc' was
accidentally reversed. The oldest worker was shown first and the newest
last, which was not the intended behavior. This regression was introduced
during the master-worker rework.
Fix this by sorting the list during deserialization in
mworker_env_to_proc_list().
An alternative fix would have been to iterate the list in reverse order
in the show proc function, but that approach risks introducing
inconsistencies when backporting to older versions.
Must be backported to 3.1 and later.
When using rq-load on tune.h2.fe.max-concurrent-streams, it's easy to
reach a situation where only one stream is allowed. There's nothing
wrong with this but it turns out that slightly higher values do not
necessarily cause significantly higher loads and will improve the user
experience. For this reason the keyword now also supports "min" to
specify a value. Experimentation shows that values from 5 to 15 remain
very effective at protecting the run queue while allowing a great level
of parallelism that keeps a site fluid.
Global setting tune.h2.fe.max-concurrent-streams now supports an optional
"rq-load" option to pass either a target load, or a keyword among "auto"
and "ignore". These are used to quadratically reduce the advertised streams
limit when the thread's run queue size goes beyong the configured value,
and automatically reduce the load on the process from new connections.
With "auto", instead of taking an explicit value, it uses as a target the
"tune.runqueue-depth" setting (which might be automatic). Tests have shown
that values between 50 and 100 are already very effective at reducing the
loads during attacks from 100000 to around 1500. By default, "ignore"
is in effect, which means that the dynamic tuning is not enabled.
The hard limit on the number of concurrent streams is currently
determined only by configuration and returned by
h2c_max_concurrent_streams(). However this doesn't permit to
change such settings on the fly without risking to break connections,
and it doesn't allow a connection to pick a different value, which
could be desirable for example to try to slow abuse down.
Let's store a copy of h2c_max_concurrent_streams() at connection
creation time into the h2c as streams_hard_limit. This inflates
the h2c size from 1324 to 1328 (0.3%) which is acceptable for the
expected benefits.
The new field th_ctx->rq_tot_peak contains the computed peak run queue
length averaged over the last 512 calls. This is computed when entering
process_runnable_tasks. It will not take into account new tasks that are
created or woken up during this round nor those which are evicted, which
is the reason why we're using a peak measurement to increase chances to
observe transient high values. Tests have shown that 512 samples are good
to provide a relatively smooth average measurement while still fading
away in a matter of milliseconds at high loads. Since this value is
only updated once per round, it cannot be used as a statistic and
shouldn't be exposed, it's only for internal use (self-regulation).
Sometimes it's desirable to observe fading away peak values, where a new
value that is higher than the historical one instantly replaces it,
otherwise contributes to it. It is convenient when trying to observe
certain phenomenons like peak queue sizes. The new function
swrate_add_peak_local() does that to a private variable (no atomic ops
involved as it's not worth the cost since such use cases are typically
local).
Add detection of release tags matching the vX.Y.Z pattern so they use
the same stable CI configuration as haproxy-* branches, rather than the
development one.
It prevents stable tag to trigger the CI with docker images and SSL
libraries only used for development.
Must be backported in stable releases.
After commit 594408cd61 ("BUG/MINOR: mworker/cli: fix show proc
pagination using reload counter"), the old-workers pagination stores
ctx->next_reload = child->reloads on flush failure, then skips entries
with child->reloads >= ctx->next_reload on resume.
The >= comparison is direction-dependent: it assumes the list is in
descending reload order (newest first). On current master, proc_list
is in ascending order (oldest first) because mworker_env_to_proc_list()
appends deserialized entries before mworker_prepare_master() appends
the new worker. This means the skip logic is inverted and can miss
entries or loop incorrectly depending on the version.
We fix this by renaming the context field to resume_reload and changing its
semantics: it now tracks the reload count of the last *successfully
flushed* row rather than the failed one. On flush failure, resume_reload
is left unchanged so the failed row is replayed on the next call. On
resume, entries are skipped by walking the list until the marker entry is
found (exact == match), which works regardless of list direction.
Additionally, we have to handle the unlikely case where the marker entry
is deleted from proc_list between handler calls (e.g. the process exits and
SIGCHLD processing removes it). Detect this by tracking the previous
LEAVING entry's reload count during the skip phase: if two consecutive
entries straddle the skip value (one > skip, the other < skip), the
deleted entry's former position has been crossed, so skipping stops and
the current entry is emitted.
This should be backported to all stable branches. On branches where
proc_list is in descending order (2.9, 3.0), the fix applies the
same way since the skip logic is now direction-agnostic.
HTTP/3 parser cannot deal with unaligned frames, except for DATA. As it
was expected that such case would not occur, a simple BUG_ON() was
written to protect HEADERS parsing.
First, this BUG_ON() was incorrectly written due an incorrect operator
'>=' vs '>' when checking if data wraps. Thus this patch correct it.
However this correction is not sufficient as it still possible to handle
a large unaligned HEADERS frame, which would trigger this BUG_ON(). This
is very unlikely as HEADERS is the first received frame on a request
stream, but not completely impossible. As HTTP/3 frame header (type +
length) is parsed first and removed, this leaves a small gap at the
buffer beginning. If this small gap is then filled with the remaining
frame payload, it would result in unaligned data. Also, trailers are
also sensitive here as in this case a HEADERS frame is handled after
other frames.
The objective of this patch is to ensure that an unaligned frame is now
handled in a safe way. This is extend to all HTTP/3 frames (except DATA)
and not only to HEADERS type. Parsing is interrupted if frame payload is
wrapping in the buffer. This should never happen except maybe with some
weird clients, so the connection is closed with H3_EXCESSIVE_LOAD error.
This approach is considered the safest one, in particular for backport
purpose. In the future, realign operation via copy may be implemented
instead if considered as useful.
This must be backported up to 2.6.
In QUIC, a STREAM frame may be received with no data but with FIN bit
set. This situation is tedious to handle and haproxy parsing code has
changed several times to deal with this situation. Now, H3 and H09
layers parsing code are skipped in favor of the shared function
qcs_http_handle_standalone_fin() used to handle the HTX EOM emission.
However, this shortcut bypasses an important HTTP/3 validation check on
the received body size vs the announced content-length header. Under
some conditions, this could cause a desynchronization with the backend
server which could be exploited for request smuggling.
Fix HTTP/3 parsing code by adding a call to h3_check_body_size() prior
to qcs_http_handle_standalone_fin() if content-length header has been
found. If the body size is incorrect, the stream is immediately resetted
with H3_MESSAGE_ERROR code and the error is forwarded to the stream
layer.
Thanks to Martino Spagnuolo for his detailed report on this issue and
for having contacting us about it via the security mailing list.
This must be backported up to 2.6.
hstream_build_http_resp() currently uses snprintf() to build the
status code and the generated X-req/X-rsp header values.
These strings are short and are fully derived from already parsed request
state, so they can be assembled directly in the HAProxy trash buffer using
`chunk_strcat()` and `ultoa_o()`.
This keeps the generated output unchanged while removing the remaining
`snprintf()` calls from the response-building path.
No functional change is expected.
Signed-off-by: Aleksandar Lazic <al-haproxy@none.at>
The window size increments are 31 bits and the topmost bit is reserved
and should be ignored, however it was not masked, so a peer sending it
set would emit a negative value which could actually reduce the current
window instead of increasing it. Note that the window cannot reach zero
as there's already a test for this, but transfers could slow down to
the same speed as if an initial window of just a few bytes had been
advertised. Let's just mask the reserved bit before processing.
This should be backported to all stable versions.
The stream ID indicated in GOAWAY frames must have its bit 31 (R) ignored
and this wasn't the case. The effect is that if this bit was present, the
GOAWAY frame would mark the last acceptable stream as negative, which is
the default situation (unlimited), thus would basically result in this
GOAWAY frame to be ignored since it would replace a negative last_sid
with another negative one. The impact is thus basically that if a peer
would emit anything non-zero in the R bit, the GOAWAY frame would be
ignored and new streams would still be initiated on the backend, before
being rejected by the server.
Thanks to Haruto Kimura (Stella) for finding and reporting this bug.
This fix needs to be backported to all stable versions.
The key type received over the peers protocol is not checked for
validity and as a result can crash the process when passed through
peer_int_key_type[] in peer_treat_definemsg(). The risk remains
very low since only trusted peers may exchange tables, however it
represents a risk the day haproxy supports new key types, because
mixing old and new versions could then cause the old ones to crash.
Let's add the required check in peer_treat_definemsg().
It is also worth noting that in this function a few protocol identifiers
of type int read directly from a var_int via intdecode() and that some
protocol aliasing may occur (e.g. table_id, table_id_len etc). This is
not supposed to be a problem but it could hide implementation bugs and
cause interoperability issues once fixed, so these should be addressed
in a future commit that will not be marked for backporting.
Thanks to Haruto Kimura (Stella) for finding and reporting this bug.
This fix needs to be backported to all stable versions.
In pcli_prefix_to_pid(), when resolving a worker by absolute pid
(@!<pid>) or by relative pid (@1), a worker that still has PROC_O_INIT
set (i.e. not yet ready, still initializing) could be returned as a
valid target.
During a reload, if a client connects to the master CLI and sends a
command targeting a worker (e.g. @@1 or @@!<pid>), the master resolves
the target pid and attempts to forward the command by transferring a fd
over the worker's sockpair. If the worker is still initializing and has
not yet sent its READY signal, its end of the sockpair is not usable,
causing send_fd_uxst() to fail with EPIPE. This results in the
following alert being repeated in a loop:
[ALERT] (550032) : socketpair: Cannot transfer the fd 13 over sockpair@5. Giving up.
The situation is even worse if the initializing worker has already
exited (e.g. due to a bind failure) but has not yet been removed from
the process list: in that case the sockpair's remote end is already
closed, making the failure immediate and unrecoverable until the dead
worker is cleaned up.
This was not possible before 3.1 because the master's polling loop only
started once all workers were fully ready, making it impossible to
receive CLI connections while a worker was still initializing.
Fix this by skipping workers with PROC_O_INIT set in both the absolute
and relative pid resolution paths of pcli_prefix_to_pid(), so that
only fully initialized workers can be targeted.
Must be backported to 3.1 and later.
When loading libs into the core dump, let's also try to load
libthread_db.so.1 that gdb usually requires. It can significantly help
decoding the threads for systems which require it, and the file is quite
small. It can appear at a few different locations and is generally next
to libpthread.so, or alternately libc, so we first look where we found
them, and fall back to a few other common places. The file is really
small, a few tens of kB usually.
When set-dumpable=libs, let's also pick the debug symbols for the libs
we're loading. For now we only try /usr/lib/debug/<path>, which is quite
common and easy to guess. Build IDs could also be used but are more
complex to deal with, so let's stay simple for now.
This utility takes in argument the path to a core dump, and it looks
for the archive signature of libraries embedded with "set-dumpable libs",
and either emits the offset and size of stdout, or directly dumps the
contents so that the tar file can be extracted directly by piping the
output to tar xf.
When "set-dumpable" is set to "libs", in addition to marking the process
dumpable, haproxy also reads the binary and shared objects into memory as
a tar archive in a page-aligned location so that these files are easily
extractable from a future core dump. The goal here is to always have
access to the exact same binary and libs as those which caused the core
to happen. It's indeed very frequent to miss some of these, or to get
mismatching files due to a local update that didn't experience a reload,
or to get those of a host system instead of the container.
The in-memory tar file presents everything under a directory called
"core-%d" where %d corresponds to the PID of the worker process. In
order to ease the finding of these data in the core dump, the memory
area is contiguous and surrounded by PROT_NONE pages so that it appears
in its own segment in the core file. The total size used by this is a
few tens of MB, which is not a problem on large systems.
The global "set-dumpable" keyword currently is only positional. Let's
extend its syntax to support arguments. For now we support both "on"
and "off" to explicitly enable or disable it.
New function load_file_into_tar() concatenates a file into an in-memory
tar archive and grows its size. Only the base name and a provided prefix
are used to name the faile. If the file cannot be loaded, it's added as
size zero and permissions 0 to show that it failed to load. This will
be used to load post-mortem information so it needs to remain simple.
The purpose here is to create a tar file header in memory from a known
file name, prefix, size and mode. It will be used to prepare archives
of libs in use for improved debugging, but may probably be useful for
other purposes due to its simplicity.
Since 7a1382da7 ("BUG/MINOR: spoe: Fix condition to abort processing on
client abort"), the chn variable is no longer used in
spoe_process_event(). Let's remove it
This patch must be backported with the commit above, as far as 3.1.
The test to detect client aborts in the SPOE, introduced by commit b3be3b94a
("BUG/MEDIUM: spoe: Properly abort processing on client abort"), was no
correct. Producer flags must not be tested. Only the frontend SC must be
tested when the abortonclose option is set.
Because of this bug, when a client aborted, the SPOE processing was aborted
too, regardless the abortonclose option.
This patch must be backpoeted with the commit above, so as far as 3.1.
haproxy_sticktable_local_updates corresponds to the table->localupdate
counter, which is used internally by the peers protocol to identify
update messages in order to send and ack them among peers.
Here we decide to expose this information, as it is already the case in
"show peers" output, because it turns out that this value, which is
cumulative and grows in sync with the number of updates triggered on the
table due to changes initiated by the current process, can be used to
compute the update rate of the table. Computing the update rate of the
table (from the process point of view, ie: updates sent by the process and
not those received by the process), can be a great load indicator in order
to properly scale the infrastructure that is intended to handle the
table updates.
Note that there is a pitfall, which is that the value will eventually
wrap since it is stored using unsigned 32bits integer. Scripts or system
making use of this value must take wrapping into account between two
readings to properly compute the effective number of updates that were
performed between two readings. Also, they must ensure that the "polling"
rate between readings is small enough so that the value cannot wrap behind
their back.
We no longer rely on now_offset stored in the shm-stats-file. Instead
haproxy automatically computes the now_offset relative to the monotonic
clock and the shared global clock.
Indeed, the previous model based on static now_offset when monotonic
clock is available proved to be insufficient when used in
combination with shm-stats-file (that is when monotonic clock is shared
between multiple co-processes). In ideal situation co-processes would
correctly apply the offset to their local monotonic clock and end up
with consistent now_ns. But when restarting from an existing
shm-stats-file from a previous session (ie: prior to reboot), then the
local monotonic clock would no longer be consistent with the one used
to update the file previously, so applying a static offset would fail
to restore clock consistency.
For this specific issue, a workaround was brought by 09bf116
("BUG/MEDIUM: stats-file: detect and fix inconsistent shared clock when resuming from shm-stats-file")
but the solution implemented there was deemed too fragile, because there
is a 60sec window where the fix would fail to detect inconsistent clock
and would leave haproxy with a broken clock ranging from 0 to 60 seconds,
which can be huge..
By simply recomputing the now_offset each time we learn from another
process (through the shared map by reading global_now_ns), we simply
recompute our local offset (difference between OUR monotonic clock
and the SHARED one). Also, in clock_update_global_date(), we make
sure we always recompute the now_offset as now_ms may have been
updated from shared clock if shared clock was ahead of us.
Thanks to that new logic, interrupted processes, resumed processes,
processed started with shm-stats-file from previous session now
correctly recover from those various situations and multiple
co-processes with diverting clocks on startup end up converging to
the same values.
Since it is no longer relevant to save now_offset in the map, it was
removed but to prevent shm-stats-file incompatibility with previous
versions, 8-byte hole was forced, and we didn't bump the shm-stats-file
version on purpose.
This patch may be backported in 3.3 after a solid period of observation
to ensure we didn't break things.
mystrtod() was not length-aware and relied on null-termination or a
non-numeric character to stop. The fix adds a length parameter as a
strict upper bound for all pointer accesses.
The practical impact in haproxy is essentially null: all callers embed
the JSON payload inside a large haproxy buffer, so the speculative read
past the last digit lands on memory that is still within the same
allocation. ASAN cannot detect it in a normal haproxy run for the same
reason — the overread never escapes the enclosing buffer. Triggering a
detectable fault requires placing the JSON payload at the exact end of
an allocation.
Note: the 'path' buffer was using a null-terminated string so the result
of strlen is passed to it, this part was not at risk.
Thanks to Kamil Frankowicz for the original bug report.
This patch must be backported to all maintained versions.
The commit 9f1e9ee0e ("DEBUG: stream: Display the currently running rule in
stream dump") revealed a bug. When a stream is dumped, if it is blocked on a
rule, we must take care the rule has a keyword to display its name.
Indeed, some action parsings are inlined with the rule parser. In that case,
there is no keyword attached to the rule.
Because of this bug, crashes can be experienced when a stream is
dumped. Now, when there is no keyword, "?" is display instead.
This patch must be backported as far as 2.6.
When a L7 retry is performed, we should not rely on b_xfer() to swap the L7
buffer with the request buffer. When it is performed the request buffer is
not allocated. b_xfer() must not be called with an unallocated destination
buffer. The swap remains an optim. For instance, It is not performed on
buffers of different size. So the caller is responsible to provide an
allocated destination buffer with enough free space to transfer data.
However, when a L7 retry is performed, we cannot allocate a request buffer,
because we cannot yield. An error was reported, if we wait for a buffer, the
error will be handled by process_stream(). But we can swap the buffers by
hand. At this stage, we know there is no request buffer, so we can easily
swap it with the L7 buffer.
Note there is no real bug for now.
This patch could be backported to all stable versions.
In HTX, headers and trailers parts must always be complete. It is unexpected
to found header blocks without the EOH block or trailer blocks without the
EOT block. So, during H2/H3 message parsing, we must take care to remove any
HEADER/TRAILER block inserted when an error is encountered. It is mandatory
to be sure to properly report parsing error to upper layer.x
It is now performed by calling htx_truncat_blk() function on the error
path. The tail block is saved before converting any HEADERS/TRAILERS frame
to HTX. It is used to remove all inserted block on error.
This patch rely on the following one:
"MINOR: htx: Add function to truncate all blocks after a specific block"
It should be backported with the commit above to all stable versions for
the H2 part and as far as 2.8 for h3 one.
When H2 or H3 trailers are inserted in an HTX message, we must take care to
not exceed the maximum number of trailers allowed in a message (same than
the maximum number of headers, i.e tune.http.maxhdr). However, all HTX
blocks in the HTX message were considered. Only TRAILERS HTX blocks must be
considered.
To fix the issue, in h2_make_htx_trailers(), we rely on the "idx" variable
at the end of the for loop. In h3_trailers_to_htx(), we rely on the
"hdr_idx" variable.
This patch must be backported to all stables versions for the H2 part and as
far as 2.8 for the H3 one.
pouet
L7 retries are buggy when a large buffer is used on the request channel. A
memcpy is used to copy data from the request buffer into the L7 buffer. The
L7 buffer is for now always a standard buffer. So if a larger buffer is
used, this leads to a buffer overflow and crash the process.
The Best way to fix the issue is to disable L7 retries when a large buffer
was allocated for the request channel. In that case, we don't want to
allocate an extra large buffer.
No backport needed.
When a large buffer is used on a channel, once we've started to send data to
the opposite side, receives are blocked temporarily to be sure to flush the
large buffer ASAP to be able to fall back on regular buffers. This was
performed by skipping call to the endpoint (connection or applet). Howerver,
doing so, this broken the abortonclose and more generally this masked any
shut or error events reported by the lower layer.
To fix the issue, instead of skipping receives, we now try a receive but
with a requested size set to 0.
No backport needed
Client abort when abortonclose is configured was ignored when messges were
sent on event while it works properly when messages are sent via an
"send-spoe-group" action.
To fix the issue, when the SPOE filter is waiting for the SPOE applet
response, it must check if a client abort was reported and if so, must
interrupt its processing.
This patch should be backported as far as 3.1.
When the SPOE applet is created, the SPOE filter is set in SENDING_MSGS
state. When the applet has transferred data, it should switch the filter to
WAITING_ACK state. Concretly, there is no bug. At best, it could save some
useless applet wakeups.
This patch should be backported as far as 3.1
When SC's shudown callback functions were merged, a regression was
introduced. The applet was no longer woken up. Because of this bug, an
applet could remain blocked, waiting for an I/O event or a timeout.
This patch should fix the issue #3301.
No backport needed.
FDs received through recv_fd_uxst() do not have FD_CLOEXEC set.
The equivalent sock_accept_conn() already handles this correctly:
any FD accepted or received in the master must be marked close-on-exec
to avoid leaking it across the execvp() performed on soft-reload.
This is currently triggering a leak in the master since 3.1: the worker
sends a socketpair fd to the master to issue the _send_status CLI
command, and recv_fd_uxst() receive it without setting FD_CLOEXEC. If a
re-exec is emitted before the master had the chance to close that fd, it
survives execvp() and appears as an untracked unnamed AF_UNIX socket in
the new master generation.
This must be backported to all maintained branches.
Add a NULL guard for the version field. This has no functional impact
since the master process never uses this field for its own mworker_proc
element, and should be the only one impacted. This avoid seeing "(null)"
in the version field when debugging.
Must be backported to 3.1 and later.
During a soft reload, a starting worker sends sock_pair[0] to the master
via send_fd_uxst(), then reads on sock_pair[1] waiting for the master to
acknowledge receipt. Because of a documented macOS sendmsg(2) bug, the
worker must keep sock_pair[0] open until the master confirms the fd was
received by the CLI applet. This means the read() on sock_pair[1] will
never return 0 (EOF), since the worker itself still holds a reference to
sock_pair[0]. The worker can only unblock when the master actively sends
a byte back. If the master crashes before doing so, the worker blocks
indefinitely in read().
Fix this by setting a 2-second SO_RCVTIMEO on sock_pair[1] before the
read(), so the worker can unblock and continue regardless of the master's
state.
This was introduced by d7f6819161 ("BUG/MEDIUM: mworker: fix startup
and reload on macOS").
This should be backported to 3.1 and later.
In mworker_proc_list_to_env(), a typo used '&=' instead of '&' when
checking PROC_O_TYPE_WORKER in child->options. This would corrupt the
options field by clearing all bits except PROC_O_TYPE_WORKER, but since
the function is called right before the master re-execs itself during a
reload, the corruption has no actual effect: the in-memory proc_list is
discarded by the exec, and the options field is not serialized to the
environment anyway.
This should be backported to all maintained versions.
We defer processing of the "-dt" options until after the configuration
file has been read. This will be useful if we ever allow trace sources
to be registered later, for instance with LUA.
No backport needed.
In master-worker mode, when a freshly forked worker looks up its own
entry in proc_list to send its "READY" status to the master, the loop
was breaking on the first process with pid == -1 regardless of its
type. If a non-worker process (e.g. a master or program) also had
pid == -1, the wrong entry could be selected, causing send_fd_uxst()
to use an invalid ipc_fd.
Fix this by adding a PROC_O_TYPE_WORKER check to the loop condition,
and add a BUG_ON() assertion to catch any case where the loop exits
without finding a valid worker entry.
Must be backported to 3.1.
"show profiling" supports "aggr" for tasks but it was ignored for
memory. Now that we're having many more entries, it makes sense to
have it to ignore the call path and merge similar operations.
Keywords registered out of an initcall will have a TH_EX_CTX_CLI_KWL
execution context pointing to the keyword list. The report will indicate
the 5 first words of the first command of the list, e.g.:
exec_ctx: cli kwl starting with 'debug counters '
This should also work for CLI keywords registered in Lua.
Now CLI keywords registered via an initcall will be tracked during
execution, by keeping a link to their initcall location. "show threads"
now shows "exec_ctx: kw registered at @debug.c:3093" which indeed
corresponds to the initcall for the debugging commands.
Till now the CLI didn't know what keyword was being processed after it
was parsed. In order to report the execution context, we'll need to
store it. And this may even help for post-mortem analysis to know the
exact keyword being processed, so let's store the pointer in the cli_ctx
part of the appctx.
It allows to know when a thread is currnetly running inside an applet.
For example now "show threads" will show "applet '<CLI>'" for the thread
issuing this command.
It now appears almost everywhere due to callbacks (e.g. ssl_sock_io_cb).
Muxes also become visible now on memory profiling. A small test on h1+ssl
yields 838 lines of statistics. The number of buckets should definitely
be increased, and more grouping criteria should be added.
A performance test was conducted to observe the possible effect of
setting the execution context on each task switch, and it didn't change
at all, remaining at about 1.01 billion ctxsw/s on a 128-thread EPYC.
Most calls to mux ops were instrumented with a CALL_MUX_WITH_RET() or
CALL_MUX_NO_RET() macro in order to make the current thread's context
point to the called mux and be able to track its allocations. Only
a bunch of harmless mux_ctl() and ->subscribe/unsubscribe calls were
left untouched since useless. But destroy/detach/shut/init/snd_buf
and rcv_buf are now tracked.
It will not show allocations performed in IO callback via tasklet
wakeups however.
In order to ease reading of the output, cmp_memprof_ctx() knows about
muxes and sorts based on the .subscribe function address instead of
the mux_ops address so as to keep various callers grouped.
In order to be able to track memory allocation performed from message
callbacks, let's set the thread execution context to a generic function
pointing to them during their call. This allows for example to observe
the share of SSL allocations caused by ssl_sock_parse_clienthello() when
SSL captures are enabled.
The release calls are automatic from the SSL library for these, and are
registered directly via SSL_get_ex_new_index(). Maybe we should improve
the internal API to wrap that function and systematically track free
calls as well. In this case, maybe even registering the message callback
registration could take both the callback and the release function.
There are few such users however, essentially capture and keylog.
Doing this allows to report the allocations/releases performed by filters
when running with memory profiling enabled. The flt_conf pointer is kept
and the report shows the filter name.
A bit similar to what was done for sample fetch functions and converters,
we now store with each action keyword the location of the initcall when
they're registered this way. Since there are many functions only calling
a LIST_APPEND() (one per ruleset), we now implement a dedicated function
to store the context in all keywords before doing the append.
However that's not sufficient, because keywords are not mandatory for
actions, so we cannot safely rely on rule->kw. Thus we then set the
exec_ctx per rule when they are all scanned in check_action_rules(),
based on the keyword if it exists, otherwise we make a context from
the action_ptr function if it is set (it should).
Finally at all call points we now check rule->exec_ctx.
The purpose here is to be able to spot certain callbacks, such as the
SSL message callbacks, which are difficult to associate to anything.
Thus we introduce a new context type, TH_EX_CTX_FUNC, for which the
context is just the function pointed to by the void *pointer. One
difficulty with callbacks is that the allocation and release contexts
will likely be different, so the code should be properly structured
to allow proper tracking, either by instrumenting all calls, or by
making sure that the free calls are easy to spot in a report.
With the two new context types TH_EX_CTX_SMPF/CONV, we can now also
report contexts corresponding to direct calls to sample_register_fetches()
and sample_register_convs(). In this case, the first word of the keyword
list is reported.
Now keywords are registered with an exec_ctx and this one is passed
when calling ->process. The ctx is of type INITCALL when passed via
an initcall where we know the file name and line number.
This was tested with and extra "malloc(15)" added in smp_fetch_path()
which shows that it works:
$ socat /tmp/sock1 - <<< "show profiling memory"|grep via
Calls | Tot Bytes | Caller and method [via]
1893399 0 60592592 0| 0x78b2ec task_run_applet+0x3339c malloc(32) [via initcall @http_fetch.c:2416]
When the execution context is set to TH_EX_CTX_INITCALL, the pointer
points to a valid initcall, and the decoder will show "kw registered
at %s:%d" with file and line number of the initcall declaration. It's
up to the caller to make the initcall pointer point to the one that was
set during the initcall. The purpose here is to be able to preserve and
pass that knowledge of an initcall down the chain so that future calls
to functions registered via the initcall are still assigned to it.
The INITCALL macros will now store the file and line number where they
are declared into the initcall struct, and RUN_INITCALLS() will assign
them to the global caller_file and caller_line variables, and will even
set caller_initcall to the current initall so that at any instant such
functions know where their caller declared them. This will help with
error messages and traces where a bit of context will be welcome.
Now we have one extra line saying "exec_ctx: something" in thread dumps
when it's known. It may help with warnings and panics to figure what
is ongoing.
The new function chunk_append_thread_ctx() appends to a buffer the given
execution context based on its type and pointer. The goal is to easily
use it in profiling output and thread dumps. For now it only handles
TH_EX_CTX_NONE (which prints nothing) and TH_EX_CTX_OTHER (which indicates
"other ctx" followed by the pointer). It will be extended by new types as
they arrive.
By passing "byctx" to "show profiling memory", it's possible to sort by
the calling context first, which could help group certain calls by
subsystem and ease the interpretation of the output.
This now allows to report the same function in multiple bins based on the
th_ctx's exec_ctx discriminant. It's also worth noting that the context is
not atomically committed, but this shouldn't be a problem since a single
entry can get it. In the worst case, a second thread trying to create the
same context in parallel would create a different bin just for this call,
which is harmless. The same situation already exists with the caller
pointer.
We have the struct made of a type and a pointer in the th_ctx and a
function to switch it for the current thread. Two macros are provided
to enclose a callee within a temporary context. For now only type OTHER
is supported (only a generic pointer).
When two pointer hash to the same memprofile bin, we currently try again
with the same bin until we find a spare one or we reach the limit of 16.
Olivier suggested to try with a different step for different pointers so
as to limit the number of bins to visit in such a case, so let's split
the pointer hash calculation so that we keep the raw hash before reduction
and use its lowest bits as the retry step. We force lowest bit to 1 to
avoid integral multiples that would oscillate between only a few positions.
Quick tests with h1+h2 requests show that for ~744 distinct entries, we
used to have 1.17 retries per lookup before and 0.6 now so we're halving
the cost of hash collisions. A heavier workload that used to produce 920
entries with 2.01 retries per lookup now reaches 966 entries (94.3% usage
vs 89.8% before) with only 1.44 retries per lookup.
This should be safe to backport, but depends on this previous commit:
MINOR: tools: extend the pointer hashing code to ease manipulations
The purpose here is to combine two pointers and a long argument instead
of having the caller perform the mixing. Also it's cleaner and more
efficient this was as the arg is mixed after the multiplications, and
modern processors are efficient at multiplying then adding.
We'll need to further extend the pointer hashing code to pass extra
parameters and to retrieve the dropped bits, so let's first split the
part that hashes the pointer from the part that reduces the hash to
the desired size.
Historically, the data manipulated by "show profiling" were copied
onto the stack for sorting and aggregating, but not only this limits
the number of entries we can keep, but it also has an impact on CPU
usage (having to redo the whole copy+sort upon each resume) and the
output accuracy (if sorting changes lines, resume may happen from an
incorrect one).
Instead, let's dynamically allocate the work buffer and place it into
the service context. We only allocate it immediately before needing it
and release it immediately afterwards so that it doesn't stay long. It
also requires a release handler to release those allocates by interrupted
dumps, but that's all. The overall result is now much cleaner, more
accurate, faster and safer.
This patch may be backported to older LTS releases.
In check_config_validity() and proxy_finalize() we check the consistency
of all rule sets, but the quic_initial rules were not placed there. This
currently has little to no impact, however we're going to use that to
also finalize certain debugging info so better call the function. This
can be backported to 3.1 (proxy_finalize is 3.4-only).
In 3.1, per-DSO statistics were added to the memprofile output by
commit 401fb0e87a ("MINOR: activity/memprofile: show per-DSO stats").
However an strdup() is performed there on the .info field, that is
never freed when leaving the function. Let's do it each time we leave
it. Ironically, this was found thanks to "show profiling" showing
itself as an unbalanced caller of strdup().
This needs to be backported to 3.0 since that commit was backported
there.
In 3.3, the "make range" target adopted a test command via the TEST_CMD
variable, with commit 90b70b61b1 ("BUILD: makefile: implement support
for running a command in range"). However now it breaks the script when
TEST_CMD is not set due to the shell expansion leaving two '||' operators
side by side. Let's fix this by passing the contents of the makefile
variable in positional arguments before executing them.
To read early data with AWS-LC (and BoringSSL), we have to use
SSL_read(). But SSL_read() will also try to do the handshake if it
hasn't been done yet, and at some point will do the handshake and will
return data that are actually not early data. So use SSL_in_early_data()
to make sure that the data we received are actually early data, and only
if so add the CO_FL_EARLY_DATA flag. Otherwise any data first received will be
considered early, and a Early-data header will be added.
As this bug was introduced by 76ba026548,
it should be backported with it.
Upon _send_status, always stop the listener from which the request
was received, rather than looking it up from the proc_list entry via
fdtab[proc->ipc_fd[0]].owner.
A BUG_ON is added to verify that the listener which received the
request is the one expected for the reported PID.
This means it is no longer possible to send "_send_status READY XXX"
manually through the master CLI for testing, as that would trigger
the BUG_ON.
Must be backported as far as 3.1.
The API for early data is a bit different with BoringSSL and AWS-LC than
it is for OpenSSL. As it was implemented, early data would be accepted,
but would not be processed until the handshake is done. Change that by
doing something similar to what OpenSSL does, and, if 0RTT has been
enabled on the listener, use SSL_read() to try to get early data before
starting the handshake, and if there's any, provide them to the mux the
same way it is done for OpenSSL.
That replaces a bunch of #ifdef SSL_READ_EARLY_DATA_SUCCESS by
something specific to OpenSSL has to be done.
This should be backported to 3.3.
The name of "Global section" was changed only in the summary, not in the
text itself. The names of some related refs were also updated.
Should be backported as far as 3.2.
EVP_MD_CTX is allocated using EVP_MD_CTX_new() but was never freed.
ctx should be initialized to NULL otherwise EVP_MD_CTX_free(ctx) could
segfault.
Must be backported as far as 3.2.
With a config:
backend bk_app
http-check expect status 200 string "status: ok"
This now correctly emits the error:
config : parsing [./patch.cfg:2] : 'http-check expect' : only one pattern expected.
This line containing the typo is unchanged since at least HAProxy 2.2, the
patch should be backported into all supported branches.
Starting with OpenSSL 4.0, X509_get_subject_name(), X509_get_issuer_name(),
and X509_CRL_get_issuer() return a const-qualified X509_NAME pointer.
Similarly, X509_NAME_get_entry() returns a const X509_NAME_ENTRY *, and
X509_NAME_ENTRY_get_data() returns a const ASN1_STRING *.
Introduce the __X509_NAME_CONST__ macro (defined to 'const' for OpenSSL
>= 4.0.0, empty for WolfSSL and older OpenSSL version which lacks const
on these APIs) and use it to qualify X509_NAME * variables and the
parameters of the three DN helper functions ssl_sock_get_dn_entry(),
ssl_sock_get_dn_formatted(), and ssl_sock_get_dn_oneline(). This avoids
both const-qualifier warnings on OpenSSL 4.0 and discarded-qualifier
warnings on WolfSSL, without needing explicit casts at call sites.
In ssl_sock.c (ssl_get_client_ca_file) and ssl_gencert.c
(ssl_sock_do_create_cert), a __X509_NAME_CONST__ X509_NAME * variable was
being reused to store the result of X509_NAME_dup() and then passed to
mutating functions (X509_NAME_add_entry_by_txt, X509_NAME_free). Introduce
separate X509_NAME * variables (xn_dup, subject) to hold the mutable
duplicate.
Original patch from Alexandr Nedvedicky <sashan@openssl.org>:
https://www.mail-archive.com/haproxy@formilux.org/msg46696.html
In OpenSSL 4.0, the ASN1_STRING struct was made opaque and direct access
to its members (->data, ->length, ->type) no longer compiles. Replace
these accesses in ssl_sock_get_serial(), ssl_sock_get_time(), and
asn1_generalizedtime_to_epoch() with the proper accessor functions
ASN1_STRING_get0_data(), ASN1_STRING_length(), and ASN1_STRING_type().
The old direct access is preserved under USE_OPENSSL_WOLFSSL since
WolfSSL does not provide these accessor functions.
Original patch from Alexandr Nedvedicky <sashan@openssl.org>:
https://www.mail-archive.com/haproxy@formilux.org/msg46696.html
When a master process is reloading, the HAPROXY_PROCESSES variable is
deserialized. In older version of the master-worker (< 1.9), no master
element was existing in this variable.
This is not suppose to happen anymore, and could have provoked problems
in the master anyway.
This patch changes the behavior by exiting the master with an alert if
mp master element was found in this variable.
When no endpoint is attached to a SC, it is unexpected to have I/O (receive
or send). But we honestly don't know if it happens or not. So a CHECK_IF()
is added to be able to track such calls.
Calls to se_shutdown were no the same between applets and mux endpoints.
Only the SHUTW flag was not the same. However, on the multiplexers are
sensitive to the true SHUTW flag. The applets handle all of them the same
way. So calls to se_shutdown() from sc_abort() and sc_shutdown() can be
merged to always use the multiplexer version.
wake_srv_chk() function is now only used by srv_chk_io_cb(), the
health-checl I/O callback function. So let's remove it. The code of the
function was moved in srv_chk_io_cb().
At we fail to create a mux, in conn_create_mux(), instead of calling the
app_ops .wake() callback function, we can directly call sc_conn_process().
At this stage, we know we are using an connection, so it is safe to do so.
At the end of task_run_applet() and task_process_applet(), instead of
calling the app_ops .wake() callback function, we can directly call
sc_applet_process(). At this stage, we know we are using an applet, so it is
safe to do so.
Thanks to previous commits, it is now possible to wake the data layer up,
via a tasklet_wakeup, instead of using the app_ops .wake() callback
function.
When a data layer must be notified of a mux event (an error for instance),
we now always perform a tasklet_wakeup(). TASK_WOKEN_MSG state is used by
default. TASK_WOKEN_IO is eventually added if the data layer was subscribed
to receives or sends.
Changes are not trivial at all. We replaced a synchronous call to the
sc_conn_process() function by a tasklet_wakeup().
Now, when a mux stream is waking its data layer up for receives or sends, it
uses the TASK_WOKEN_IO state. The state is not used by the stconn I/O
callback function for now.
In spop_resume_each_sending_spop_strm(), there was exactly the same code
than spop_strm_notify_send(). So let's use spop_strm_notify_send() instead
of duplicating code.
It is the first commit of a series to refactor the SC app_ops. The first
step is to remove the .wake() callback function from the app_ops to replace
all uses by a wakeup of the SC tasklet.
Here, when the SC is woken up, the state is now tested and if TASK_WOKEN_MSG
is set, sc_conn_process() is called.
When ECDH-ES algorithm is used in a JWE token, no cek is provided and
one must be built in order to decrypt the contents of the token. The
decrypting key is built by deriving a temporary key out of a public key
provided in the token and the private key provided by the user and
performing a concatKDF operation.
When the encoding is of the ECDH family, the optional "apu" and "apv"
fields of the JOSE header must be parsed, as well as the mandatory "epk"
field that contains an EC public key used to derive a key that allows
either to decrypt the contents of the token (in case of ECDH-ES) or to
decrypt the content encoding key (cek) when using ECDH-ES+AES Key Wrap.
Convert a JWK with the "EC" key type ("kty") into an EVP_PKEY. The JWK
can either represent a public key if it only contains the "x" and "y"
fields, or a private key if it also contains the "d" field.
The 'jwt_tokenize' function that can be used to split a JWT token into
its subparts can either fully process the token (from beginning to end)
when we need to check its signature, or only partially when using the
jwt_header_query or jwt_member_query converters. In this case we relied
on the fact that the return value of the 'jwt_tokenize' function was not
checked because a '-1' was returned (which was not actually an error).
In order to make this logic more explicit, the 'jwt_tokenize' function
now has a way to warn the caller that the token was invalid (less
subparts than the specified 'item_num') or that the token was not
processed in full (enough subparts found without parsing the token all
the way).
The function will now only return 0 if we found strictly the same number
of subparts as 'item_num'.
The master process in the proc_list mustn't set the PROC_O_LEAVING flag
since the reload doesn't mean the master will leave.
Could be backported as far as 3.1.
mproxy_li is supposed to be used in _send_status to stop the sockpair FD
between the master and the new worker, being a listener.
This can only work if the listener has been stored in the fdtab owner,
and there's no reason it shouldn't be here.
It's always a bit tricky to avoid already backported patches when they
just got a different ID (e.g. a critical fix in a topic branch). Most
often with stable topic branches we just want to pick all stable commits
since the last backported one. New option -L instead of -m does exactly
this: it enumerates only commits that were added to the reference branch
after its most recent backport.
The -vv option used --verbose as its long form, which was identical to
the long form of -v. Since the case statement matches top-to-bottom,
--verbose would always trigger -v (VERBOSE=2), making -vv unreachable
via its long option. The long form is renamed to --verbose=all to avoid
the conflict, and the usage string is updated accordingly.
Must be backported to 3.3.
The script relied on a bash-specific process substitution (< <(...)) to
feed socat's output into the read loop. This is replaced with a standard
POSIX pipe into a command group.
The response parsing is also simplified: instead of iterating over each
line with a while loop and echoing them individually, the status line is
read first, the "--" separator consumed, and the remaining output is
streamed to stderr or discarded as a whole depending on the verbosity
level.
Could be backported to 3.3 as it makes it more portable, but introduce a
slight change in the error format.
socat was used with the ${MASTER_SOCKET} variable directly, letting it
auto-detect the network protocol. However, when given a plain filename
that does not point to a UNIX socket, socat would create a file at that
path instead of reporting an error.
To fix this, the address type is now determined explicitly: if
MASTER_SOCKET points to an existing UNIX socket file (checked with -S),
UNIX-CONNECT: is used; if it matches a <host>:<port> pattern, TCP: is
used; otherwise an error is reported. The socat_addr variable is also
properly scoped as local to the reload() function.
Could be backported in 3.3.
no need to have duplicated comp_ctx and comp_algo for request vs response
in comp_state struct, because thanks to previous commit compression filter
is either oriented on the request or the response, and 2 distinct filters
are instanciated when we need to handle both requests and responses
compression.
Thus we can save us from duplicated struct members and related operations.
Existing "compression" filter is a multi-purpose filter that will try
to compress both requests and responses according to "compression"
settings, such as "compression direction".
One of the pre-requisite work identified to implement decompression
filter is that we needed a way to manually define the sequence of
enabled filters to chain them in the proper order to make
compression and decompression chains work as expected in regard
to the intended use-case.
Due to the current nature of the "compression" filter this was not
possible, because the filter has a combined action as it will try
to compress both requests and responses, and as we are about to
implement "filter-sequence" directive, we will not be able to
change the order of execution of the compression filter between
requests and responses.
A possible solution we identified to solve this issue is to split the
existing "compression" filter into 2 distinct filters, one which is
request-oriented, "comp-req", and another one which is response-oriented
"comp-res". This is what we are doing in this commit. Compression logic
in itself is unchanged, "comp-req" will only aim to compress the request
while "comp-res" will try to compress the response. Both filters will
still be invoked on request and responses hooks, but they only do their
part of the job.
From now on, to compress both requests and responses, both filters have
to be enabled on the proxy. To preserve original behavior, the "compression"
filter is still supported, what it does is that it instantiates both
"comp-req" and "comp-res" filters implicitly, as the compression filter is
now effectively split into 2 separate filters under the hood.
When using "comp-res" and "comp-req" filters explicitly, the use of the
"compression direction" setting is not relevant anymore. Indeed, the
compression direction is assumed as soon as one or both filters are
enabled. Thus "compression direction" is kept as a legacy option in
order to configure the "compression" generic filter.
Documentation was updated.
proxy_get_comp() function can be used to retrieve proxy->comp options or
allocate and initialize it if missing
For now, it is solely used by parse_compression_options(), but the goal is
to be able to use this helper from multiple origins.
There was a "jwt_tokenize" call whose return value was not checked.
This was found by coverity and raised in GitHub #3277.
This patch can be backported to all stable branches.
In connect_server(), it is possible to have no server defined (dispatch mode
or transparent backend). In that case, we must be carefull to check the srv
variable in all calls involving the server. It was not perform at one place,
when the protocol to use for websocket is retrieved. This must not be done
when there is no server.
This patch should fix the first report in #3144. It must be backported to
all stable version.
In sample_conv_sha2(), calls to EVP_Digest* can fail. So we must check
return value of each call and report a error on failure and release the
digest context.
This patch should fix the issue #3274. It should be backported as far as
2.6.
Released version 3.4-dev6 with the following main changes :
- CLEANUP: acme: remove duplicate includes
- BUG/MINOR: proxy: detect strdup error on server auto SNI
- BUG/MINOR: server: set auto SNI for dynamic servers
- BUG/MINOR: server: enable no-check-sni-auto for dynamic servers
- MINOR: haterm: provide -b and -c options (RSA key size, ECDSA curves)
- MINOR: haterm: add long options for QUIC and TCP "bind" settings
- BUG/MINOR: haterm: missing allocation check in copy_argv()
- BUG/MINOR: quic: fix counters used on BE side
- MINOR: quic: add BUG_ON() on half_open_conn counter access from BE
- BUG/MINOR: quic/h3: display QUIC/H3 backend module on HTML stats
- BUG/MINOR: acme: acme_ctx_destroy() leaks auth->dns
- BUG/MINOR: acme: wrong labels logic always memprintf errmsg
- MINOR: ssl: clarify error reporting for unsupported keywords
- BUG/MINOR: acme: fix incorrect number of arguments allowed in config
- CLEANUP: haterm: remove unreachable labels hstream_add_data()
- CLEANUP: haterm: avoid static analyzer warnings about rand() use
- CLEANUP: ssl: Remove a useless variable from ssl_gen_x509()
- CI: use the latest docker for QUIC Interop
- CI: remove redundant "halog" compilation
- CLENAUP: cfgparse: accept-invalid-http-* does not support "no"/"defaults"
- BUG/MEDIUM: spoe: Acquire context buffer in applet before consuming a frame
- MINOR: traces: always mark trace_source as thread-aligned
- MINOR: ncbmbuf: improve itbmap_next() code
- MINOR: proxy: improve code when checking server name conflicts
- MINOR: quic: add a new metric for ncbuf failures
- BUG/MINOR: haterm: cannot reset default "haterm" mode
- BUG/MEDIUM: cpu-topo: Distribute CPUs fairly across groups
- BUG/MINOR: quic: missing app ops init during backend 0-RTT sessions
- CLEANUP: ssl: remove outdated comments
- MINOR: mux-h2: also count glitches on invalid trailers
- MINOR: mux-h2: add a new setting, "tune.h2.log-errors" to tweak error logging
- BUG/MEDIUM: mux-h2: make sure to always report pending errors to the stream
- BUG/MINOR: server: adjust initialization order for dynamic servers
- CLEANUP: tree-wide: drop a few useless null-checks before free()
- CLEANUP: quic-stats: include counters from quic_stats
- REORG: stats/counters: move extra_counters to counters not stats
- CLEANUP: stats: drop stats.h / stats-t.h where not needed
- MEDIUM: counters: change the fill_stats() API to pass the module and extra_counters
- CLEANUP: counters: only retrieve zeroes for unallocated extra_counters
- MEDIUM: counters: add a dedicated storage for extra_counters in various structs
- MINOR: counters: store a tgroup step for extra_counters to access multiple tgroups
- MEDIUM: counters: store the number of thread groups accessing extra_counters
- MINOR: counters: add EXTRA_COUNTERS_BASE() to retrieve extra_counters base storage
- MEDIUM: counters: return aggregate extra counters in ->fill_stats()
- MEDIUM: counters: make EXTRA_COUNTERS_GET() consider tgid
- BUG/MINOR: call EXTRA_COUNTERS_FREE() before srv_free_params() in srv_drop()
- MINOR: promex: test applet resume in stress mode
- BUG/MINOR: promex: fix server iteration when last server is deleted
- BUG/MINOR: proxy: add dynamic backend into ID tree
- MINOR: proxy: convert proxy flags to uint
- MINOR: server: refactor srv_detach()
- MINOR: proxy: define a basic "del backend" CLI
- MINOR: proxy: define proxy watcher member
- MINOR: stats: protect proxy iteration via watcher
- MINOR: promex: use watcher to iterate over backend instances
- MINOR: lua: use watcher for proxies iterator
- MINOR: proxy: add refcount to proxies
- MINOR: proxy: rename default refcount to avoid confusion
- MINOR: server: take proxy refcount when deleting a server
- MINOR: lua: handle proxy refcount
- MINOR: proxy: prevent backend removal when unsupported
- MINOR: proxy: prevent deletion of backend referenced by config elements
- MINOR: proxy: prevent backend deletion if server still exists in it
- MINOR: server: mark backend removal as forbidden if QUIC was used
- MINOR: cli: implement wait on be-removable
- MINOR: proxy: add comment for defaults_px_ref/unref_all()
- MEDIUM: proxy: add lock for global accesses during proxy free
- MEDIUM: proxy: add lock for global accesses during default free
- MINOR: proxy: use atomic ops for default proxy refcount
- MEDIUM: proxy: implement backend deletion
- REGTESTS: add a test on "del backend"
- REGTESTS: complete "del backend" with unnamed defaults ref free
- BUG/MINOR: hlua: fix return with push nil on proxy check
- BUG/MEDIUM: stream: Handle TASK_WOKEN_RES as a stream event
- MINOR: quic: use signed char type for ALPN manipulation
- MINOR: quic/h3: reorganize stream reject after MUX closure
- MINOR: mux-quic: add function for ALPN to app-ops conversion
- MEDIUM: quic/mux-quic: adjust app-ops install
- MINOR: quic: use server cache for ALPN on BE side
- BUG/MEDIUM: hpack: correctly deal with too large decoded numbers
- BUG/MAJOR: qpack: unchecked length passed to huffman decoder
- BUG/MINOR: qpack: fix 1-byte OOB read in qpack_decode_fs_pfx()
- BUG/MINOR: quic: fix OOB read in preferred_address transport parameter
- BUG/MEDIUM: qpack: correctly deal with too large decoded numbers
- BUG/MINOR: hlua: Properly enable/disable line receives from HTTP applet
- BUG/MEDIUM: hlua: Fix end of request detection when retrieving payload
- BUG/MINOR: hlua: Properly enable/disable receives for TCP applets
- MINOR: htx: Add a function to retrieve the HTTP version from a start-line
- MINOR: h1-htx: Reports non-HTTP version via dedicated flags
- BUG/MINOR: h1-htx: Be sure that H1 response version starts by "HTTP/"
- MINOR: http-ana: Save the message version in the http_msg structure
- MEDIUM: http-fetch: Rework how HTTP message version is retrieved
- MEDIUM: http-ana: Use the version of the opposite side for internal messages
- DEBUG: stream: Display the currently running rule in stream dump
- MINOR: filters: Use filter API as far as poissible to break loops on filters
- MINOR: filters: Set last_entity when a filter fails on stream_start callback
- MINOR: stream: Display the currently running filter per channel in stream dump
- DOC: config: Use the right alias for %B
- BUG/MINOR: channel: Increase the stconn bytes_in value in channel_add_input()
- BUG/MINOR: sample: Fix sample to retrieve the number of bytes received and sent
- BUG/MINOR: http-ana: Increment scf bytes_out value if an haproxy error is sent
- BUG/MAJOR: fcgi: Fix param decoding by properly checking its size
- BUG/MAJOR: resolvers: Properly lowered the names found in DNS response
- BUG/MEDIUM: mux-fcgi: Use a safe loop to resume each stream eligible for sending
- MINOR: mux-fcgi: Use a dedicated function to resume streams eligible for sending
- CLEANUP: qpack: simplify length checks in qpack_decode_fs()
- MINOR: counters: Introduce COUNTERS_UPDATE_MAX()
- MINOR: listeners: Update the frequency counters separately when needed
- MINOR: proxies: Update beconn separately
- MINOR: stats: Add an option to disable the calculation of max counters
Add a new option, "stats calculate-max-counters [on|off]".
It makes it possible to disable the calculation of max counters, as they
can have a performance cost.
Update beconn separately from the call to COUNTERS_UPDATE_MAX(), as soon
there will be an option to get COUNTERS_UPDATE_MAX() to do nothing, and
we still want beconn to be properly updated, as it is used for other
purposes.
Update the frequency counters that are exported to the stats page
outside of the call to COUNTERS_UPDATE_MAX(), so that they will
happen even if COUNTERS_UPDATE_MAX() ends up doing nothing.
Introduce COUNTERS_UPDATE_MAX(), and use it instead of using
HA_ATOMIC_UPDATE_MAX() directly.
For now it just calls HA_ATOMIC_UPDATE_MAX(), but will later be modified
so that we can disable max calculation.
This can be backported up to 2.8 if the usage of COUNTERS_UPDATE_MAX()
generates too many conflicts.
This patch simplifies the decoding loop by merging the variable-length
integer truncation check (len == -1) with the subsequent buffer
availability check (len < length).
This removes redundant code blocks and improves readability without
changing the decoding logic.
Note that the second removal is correct, as the check was duplicate and
unnecessary."
At the end of fcgi_send(), if the connection is not full anymore, we loop on
the send list to resume FCGI stream for sending. But a streams may be
removed from the this list during the loop. So a safe loop must be used.
This patch should be backported to all stable versions.
Names found in DNS responses are lowered to be compared. A name is composed
of several labels, strings precedeed by their length on one byte. For
instance:
3www7haproxy3org
There is an bug when labels are lowered. The label length is not skipped and
tolower() function is called on it. So for label length in the range [65-90]
(uppercase char), 32 is added to the label length due to the conversion of a
uppercase char to lowercase. This bugs can lead to OOB read later in the
resolvers code.
The fix is quite obvious, the label length must be skipped when the label is
lowered.
Thank you to Kamil Frankowicz for having reported this.
This patch must be backported to all stable versions.
In functions used to decode a FCGI parameter, the test on the data length
before reading the parameter's name and value did not consider the offset
value used to skip already parsed data. So it was possible to read more data
than available (OOB read). To do so, a malicious FCGI server must send a
forged GET_VALUES_RESULT record containing a parameter with wrong name/value
length.
Thank you to Kamil Frankowicz for having reported this.
This patch must be backported to all stable versions.
There was an issue in the if/else statement in smp_fetch_bytes() function.
When req.bytes_in or req.bytes_out was requested, res.bytes_in was always
returned. It is now fixed.
This patch must be backported to 3.3.
This function is no longer used. So it is not really an bug. But it is still
available and could be used by legacy applets. In that case, we must take
care to increment the stconn bytes_in value accordingly when input data are
inserted.
This patch must be backported to 3.3.
In custom log format part, %[req.bytes_in] was erroneously documented as the
alias of %B. The good alias is %[res.bytes_in]. It is now fixed.
This patch must be backported to 3.3.
Since the 3.1, when stream's info are dump, it is possible to print the
yielding filter on each channel, if any. It was useful to detect buggy
filter on spinning loop. But it is not possible to detect a filter consuming
too much CPU per-execution. We can see a filter was executing in the
backtrace reported by the watchdog, but we are unable to spot the specific
one.
Thanks to this patch, it is now possible. When a dump is emitted, the
running or yield filters on each channel are now displayed with their
current state (RUNNING or YIELDING).
This patch could be backported as far as 3.2 because it could be useful to
spot issues. But the filter API was slightly refactored in 3.4, so this
patch should be adapted.
On the stream, the last_entity should reference the last rule or the last
filter evaluated during the stream processing. However, this info was not
saved when a filter failed on strem_start callback function. It is now
fixed.
This patch could be backported as far as 3.1.
When the filters API was refactored to improve loops on filters, some places
were not updated (or not fully updated). Some loops were not relying on
resume_filter_list_break() while it was possible. So let's do so with this
patch.
Since the 2.5, when stream's info are dump, it is possible to print the
yielding rule, if any. It was useful to detect buggy rules on spinning
loop. But it is not possible to detect a rule consuming too much CPU
per-execution. We can see a rule was executing in the backtrace reported by
the watchdog, but we are unable to spot the specific rule.
Thanks to this patch, it is now possible. When a dump is emitted, the
running or yield rule is now displayed with its current state (RUNNING or
YIELDING).
This patch could be backported as far as 3.2 because it could be useful to
spot issues.
When the response is send by HAProxy, from a applet or for the analyzers,
The request version is used for the response. The main reason is that there
is not real version for the response when this happens. "HTTP/1.1" is used,
but it is in fact just an HTX response. So the version of the request is
used.
In the same manner, when the request is sent from an applet (httpclient),
the response version is used, once available.
The purpose of this change is to return the most accurate version from the
user point of view.
Thanks to previous patches, we can now rely on the version stored in the
http_msg structure to get the request or the response version.
"req.ver" and "res.ver" sample fetch functions returns the string
representation of the version, without the prefix, so "<major>.<minor>", but
only if the version is valid. For the response, "res.ver" may be added from
a health-check context, in that case, the HTX message is used.
"capture.req.ver" and "capture.res.ver" does the same but the "HTTP/" prefix
is added to the result. And "capture.res.ver" cannot be called from a
health-check.
To ease the version formatting and avoid code duplication, an helper
function was added. So these samples are now relying on "get_msg_version()".
When the request or the response is received, the numerical value of the
message version is now saved. To do so, the field "vsn" was added in the
http_msg structure. It is an unsigned char. The 4 MSB bits are used for the
major digit and the 4 LSB bits for the minor one.
Of couse, the version must be valid. the HTX_SL_F_NOT_HTTP flag of the
start-line is used to be sure the version is valid. But because this flag is
quite new, we also take care the string representation of the version is 8
bytes length. 0 means the version is not valid.
When the response is parsed, we test the version to be sure it is
valid. However, the protocol was not tested. Now we take care that the
response version starts by "HTTP/", otherwise an error is returned.
Of course, it is still possible to by-pass this test with
"accept-unsafe-violations-in-http-response" option.
This patch could be backported to all stable versions.
Now, when the HTTP version format is not strictly valid, flags are set on
the h1 parser and the HTX start-line. H1_MF_NOT_HTTP is set on the H1 parser
and HTX_SL_F_NOT_HTTP is set on the HTX start-line. These flags were
introduced to avoid parsing again and again the version to know if it is a
valid version or not, escpecially because it is most of time valid.
htx_sl_vsn() function can now be used to retrieve the ist string
representing the HTTP version from a start-line passed as parameter. This
function takes care to return the right part of the start-line, depending on
its type (request or response).
From a lua TCP applet, in functions used to retrieve data (receive,
try_receive and getline), we must take care to disable receives when data
are returned (or on failure) and to restart receives when these functions
are called again. In addition, when an applet execution is finished, we must
restart receives to properly drain the request.
This patch should be backported to 3.3. On older version, no bug was
reported so we can wait a report first.
When the lua HTTP applet was refactored to use its own buffers, a bug was
introduced in receive() and getline() function. We rely on HTX_FL_EOM flag
to detect the end of the request. But there is nothing preventing extra
calls to these function, when the whole request was consumed. If this
happens, the call will yield waiting for more data with no way to stop it.
To fix the issue, APPLET_REQ_RECV flag was added to know the whole request
was received.
This patch should fix#3293. It must be backported to 3.3.
From a lua HTTP applet, in the getline() function, we must take care to
disable receives when a line is retrieved and to restart receives when the
function is called again. In addition, when an applet execution is finished,
we must restart receives to properly drain the request.
This patch could help to fix#3293. It must be backported to 3.3. On older
version, no bug was reported so we can wait a report first. But in that
case, hlua_applet_http_recv() should also be fixed (on 3.3 it was fixed
during the applets refactoring).
Same fix as this one for hpack:
7315428615 ("BUG/MEDIUM: hpack: correctly deal with too large decoded numbers")
Indeed, the encoding of integers for QPACK is the same as for HPACK but for 64 bits
integers.
Must be backported as far as 2.6.
This bug impacts only the QUIC backend. A QUIC server does receive
a server preferred address transport parameter.
In quic_transport_param_dec_pref_addr(), the boundary check for the
connection ID was inverted and incorrect. This could lead to an
out-of-bounds read during the following memcpy.
This patch fixes the comparison to ensure the buffer has enough input data
for both the CID and the mandatory Stateless Reset Token.
Thank you to Kamil Frankowicz for having reported this.
Must be backported to 3.3.
In qpack_decode_fs_pfx(), if the first qpack_get_varint() call
consumes the entire buffer, the code would perform a 1-byte
out-of-bounds read when accessing the sign bit via **raw.
This patch adds an explicit length check at the beginning of
qpack_get_varint(), which systematically secures all other callers
against empty inputs. It also adds a necessary check before the
second varint call in qpack_decode_fs_pfx() to ensure data is still
available before dereferencing the pointer to extract the sign bit,
returning QPACK_RET_TRUNCATED if the buffer is exhausted.
Thank you to Kamil Frankowicz for having reported this.
Must be backported as far as 2.6.
A call to huffman decoder function (huff_dec()) is made from qpack_decode_fs()
without checking the buffer length passed to this function, leading to OOB read
which can crash the process.
Thank you to Kamil Frankowicz for having reported this.
Must be backport as far as 2.6.
The varint hpack decoder supports unbounded numbers but returns 32-bit
results. This means that possible truncation my happen on some field
lengths or indexes that would be emitted as quantities that do not fit
in a 32-bit number. The final value will also depend on how the left
shift operation behaves on the target architecture (e.g. whether bits
are lost or used modulo 31). This could lead to a desynchronization of
the HPACK stream decoding compared to what an external observer would
see (e.g. from a network traffic capture). However, there isn't any
impact between streams, HPACK is performed at the connection level,
not at the stream level, so no stream may try to leverage this
limitation to have any effect on another one.
For the fix, instead of adding checks everywhere in the loop and for
the final stage, let's rewrite the decoder to compare the read value
to a max value that is shifted by 7 bits for every 7 bits read. This
allows a sender to continue to emit zeroes for higher bits without
being blocked, while detecting that a received value would overflow.
The loop is now simpler as it deals both with values with the higher
bit set and the final ones, and stops once the final value was recorded.
A test on non-zero before performing the shift was added to please
ubsan, though in practice zero shifted by any quantity remains zero.
But the test is cheap so that's OK.
Thanks to Guillaume Meunier, Head of Vulnerability Operations Center
France at Orange Cyberdefense, for reporting this bug.
This should be backported to all stable versions.
On the backend side, QUIC MUX may be started preemptively before the
ALPN negotiation. This is useful notably for 0-RTT implementation.
However, this was a source of crashes. ALPN was expected to be retrieved
from the server cache, however QUIC MUX still used the ALPN from the
transport layer. This could cause a crash, especially when several
connections runs in parallel as the server cache is shared among
threads.
Thanks to the previous patch which reworks QUIC MUX init, this solution
can now be fixed. Indeed, if conn_get_alpn() is not successful, MUX can
look at the server cache again to use the expected value.
Note that this could still prevent the MUX to work as expected if the
server cache is resetted between connect_server() and MUX init. Thus,
the ultimate solution would be to copy the cached ALPN into the
connection. This problem is not specific to QUIC though, and must be
fixed in a separate patch.
This patch reworks the installation of app-ops layer by QUIC MUX.
Previously, app_ops field was stored directly into the quic_conn
structure. Then the MUX reused it directly during its qmux_init().
This patch removes app_ops field from quic_conn and replaces it with a
copy of the negotiated ALPN. By using quic_alpn_to_app_ops(), it ensures
it remains compatible with a known application layer.
On the MUX layer, qcc_install_app_ops() now uses the standard
conn_get_alpn() to retrieve the ALPN from the transport layer. This is
done via the newly defined <get_alpn> QUIC xprt callback.
This new architecture should be cleaner as it better highlights the
responsibility of each layers in the ALPN/app negotiation.
Extract the conversion from ALPN to qcc_app_ops type from quic_conn
source file into QUIC MUX. The newly created function is named
quic_alpn_to_app_ops(). This will serve as a central point to identify
which ALPNs are currently supported in our QUIC stack.
This patch is purely a small refactoring. It will be useful for the next
one which rework MUX app-ops layer init. The current cleanup allows
notably to remove H3/hq-interop headers from quic_conn source file.
The QUIC MUX layer is closed after its transport counterpart. This may
be necessary then to reject any new streams opened by the remote peer.
This operation is dependent however from the application protocol.
Previously, a function qc_h3_request_reject() was directly implemented
in quic_conn source file for use when HTTP/3 was previously negotiated.
However, this solution was not evolutive and broke layering.
This patch introduces a new proper separation with a <strm_reject>
callback defined in quic_conn structure. When set, it will be used to
preemptively close any new stream. QUIC MUX is responsible to set it
just before its closure.
No functional change. This patch is purely a refactoring with a better
architecture design. Especially, H3 specific code from transport layer
is now completely removed.
In most of haproxy code, ALPN is used as a signed char pointer. In QUIC
code instead, it is manipulated as unsigned.
Unifies this by using signed type in QUIC code. This allows to remove a
bunch of unnecessary casts.
The conversion of TASK_WOKEN_RES to a stream event was missing. Among other
things, this wakeup reason is used when a stream is dequeued. So it was
possible to skip the connection establishment if the stream was also woken
up for a timer reason. When this happened, the stream was blocked till the
queue timeout expiration.
Converting TASK_WOKEN_RES to STRM_EVT_RES fixes the issue.
This patch should fix the issue #3290. It must be backported as far as 3.2.
hlua_check_proxy() may now return NULL if the target proxy instance has
been flagged for deletion. Thus, proxies method have been adjusted and
may push nil to report such case.
This patch fixes these error paths. When nil is pushed, 1 must be
returned instead of 0. This represents the count of pushed values on the
stack which can be retrieved by the caller.
No need to backport.
Complete delete backend regtests by checking deletion of a proxy with a
reference on an unnamed defaults instance. This operation is sensible as
the defaults refcount is decremented, and when the last backend is
removed, the defaults is also freed.
Add a reg-tests to test "del backend" CLI command. First, checks are
performed to ensure a backend cannot be deleted if not in the expected
state.
Then, a "del backend" success is tested. Stats are dumped to ensure the
backend instance is indeed removed.
This patch finalizes "del backend" handler by implementing the proper
proxy deletion.
After ensuring backend deletion can be performed, several steps are
executed. First, any watcher elements are updated to point on the next
proxy instance. The backend is then removed from ID and name global
trees and is finally detached from proxies_list.
Once the backend instance is removed from proxies_list, the backend
cannot be found by new elements. Thread isolation is lifted and
proxy_drop() is called, which will purge the proxy if its refcount is
null. Thanks to recently introduced PROXIES_DEL_LOCK, proxy_drop() is
thread safe.
Default proxy refcount <def_ref> is used to comptabilize reference on a
default proxy instance by standard proxies. Currently, this is necessary
when a default proxy defines TCP/HTTP rules or a tcpcheck ruleset.
Transform every access on <def_ref> so that atomic operations are now
used. Currently, this is not strictly needed as default proxies
references are only manipulated at init or deinit in single thread mode.
However, when dynamic backends deletion will be implemented, <def_ref>
will be decremented at runtime also.
This patch is similar to the previous one, but this time it deals with
functions related to defaults proxies instances. Lock PROXIES_DEL_LOCK
is used to protect accesses on global collections.
This patch will be necessary to implement dynamic backend deletion, even
if defaults won't be use as direct target of a "del backend" CLI.
However, a backend may have a reference on a default instance. When the
backend is freed, this references is released, which can in turn cause
the freeing of the default proxy instance. All of this will occur at
runtime, outside of thread isolation.
Define a new lock with label PROXIES_DEL_LOCK. Its purpose is to protect
operations performed on global lists or trees while a proxy is freed.
Currently, this lock is unneeded as proxies are only freed on
single-thread init or deinit. However, with the incoming dynamic backend
deletion, this operation will be also performed at runtime, outside of
thread isolation.
Implement be-removable argument to CLI wait. This is implemented via
be_check_for_deletion() invokation, also used by "del backend" handler.
The objective is to test whether a backend instance can be removed. If
this is not the case, the command may returns immediately if the target
proxy is incompatible with dynamic removal or if a user action is
required. Else, the command will wait until the temporary restriction is
lifted.
Currenly, quic_conn on the backend side may access their parent proxy
instance during their lifetime. In particular, this is the case for
counters update, with <prx_counters> field directly referencing a proxy
memory zone.
As such, this prevents safe backend removal. One solution would be to
check if the upper connection instance is still alive, as a proxy cannot
be removed if connection are still active. However, this would
completely prevent proxy counters update via
quic_conn_prx_cntrs_update(), as this is performed on quic_conn release.
Another solution would be to use refcount, or a dedicated counter on the
which account for QUIC connections on a backend instance. However,
refcount is currently only used by short-term references, and it could
also have a negative impact on performance.
Thus, the simplest solution for now is to disable a backend removal if a
QUIC server is/was used in it. This is considered acceptable for now as
QUIC on the backend side is experimental.
Ensure a backend instance cannot be removed if there is still server in
it. This is checked via be_check_for_deletion() to ensure "del backend"
cannot be executed. The only solution is to use "del server" to remove
on the servers instances.
This check only covers servers not yet targetted via "del server". For
deleted servers not yet purged (due to their refcount), the proxy
refcount is incremented but this does not block "del backend"
invokation.
Define a new proxy flag PR_FL_NON_PURGEABLE. This is used to mark every
proxy instance explicitely referenced in the config. Such instances
cannot be deleted at runtime.
Static use_backend/default_backend rules are handled in
proxy_finalize(). Also, sample expression proxy references are protected
via smp_resolve_args().
Note that this last case also incidentally protects any proxies
referenced via a CLI "set var" expression. This should not be the case
as in this case variable value is instantly resolved so the proxy
reference is not needed anymore. This also affects dynamic servers.
Prevent removal of a backend which relies on features not compatible
with dynamic backends. This is the case if either dispatch or
transparent option is used, or if a stick-table is declared.
These limitations are similar to the "add backend" ones.
Implement proxy refcount for Lua proxy class. This is similar to the
server class.
In summary, proxy_take() is used to increment refcount when a Lua proxy
is instantiated. proxy_drop() is called via Lua garbage collector. To
ensure a deleted backend is released asap, hlua_check_proxy() now
returns NULL if PR_FL_DELETED is set.
This approach is directly dependable on Lua GC execution. As such, it
probably suffers from the same limitations as the ones already described
in the previous commit. With the current patch, "del backend" is not
directly impacted though. However, the final proxy deinit may happen
after a long period of time, which could cause memory pressure increase.
One final observations regarding deinit : it is necessary to delay a
BUG_ON() which checks that defaults proxies list is empty. Now this must
be executed after Lua deinit (called via post_deinit_list). This should
guarantee that all proxies and their defaults refcount are null.
When a server is deleted via "del server", increment refcount of its
parent backend. This is necessary as the server is not referenced
anymore in the backend, but can still access it via its own <proxy>
member. Thus, backend removal must not happen until the complete purge
of the server.
The proxy refcount is released in srv_drop() if the flag SRV_F_DELETED
is set, which indicates that "del server" was used. This operation is
performed after the complete release of the server instance to ensure no
access will be performed on the proxy via itself. The refcount must not
be decremented if a server is freed without "del server" invokation.
Another solution could be for servers to always increment the refcount.
However, for now in haproxy refcount usage is limited, so the current
approach is preferred. It should also ensure that if the refcount is
still incremented, it may indicate that some servers are not completely
purged themselves.
Note that this patch may cause issues if "del backend" are used in
parallel with LUA scripts referencing servers. Currently, any servers
referenced by LUA must be released by its garbage collector to ensure it
can be finally freed. However, it appeas that in some case the gc does
not run for several minutes. At least this has been observed with Lua
version 5.4.8. In the end, this will result in indefinitely blocking of
"del backend" commands.
Rename proxy conf <refcount> to <def_ref>. This field only serves for
defaults proxy instances. The objective is to avoid confusion with the
newly introduced <refcount> field used for dynamic backends.
As an optimization, it could be possible to remove <def_ref> and only
use <refcount> also for defaults proxies usage. However for now the
simplest solution is implemented.
This patch does not bring any functional change.
Implement refcount notion into proxy structure. The objective is to be
able to increment refcount on proxy to prevent its deletion temporarily.
This is similar to the server refcount : "del backend" is not blocked
and will remove the targetted instance from the global proxies_list.
However, the final free operation is delayed until the refcount is null.
As stated above, the API is similar to servers. Proxies are initialized
with a refcount of 1. Refcount can be incremented via proxy_take(). When
no longer useful, refcount is decremented via proxy_drop() which
replaces the older free_proxy(). Deinit is only performed once refcount
is null.
This commit also defines flag PR_FL_DELETED. It is set when a proxy
instance has been removed via a "del backend" CLI command. This should
serve as indication to modules which may still have a refcount on the
target proxy so that they can release it as soon as possible.
Note that this new refcount is completely ignored for a default proxy
instance. For them, proxy_take() is pure noop. Free is immediately
performed on first proxy_drop() invokation.
Ensures proxies iteration in promex applet is safe via a new watcher
member. The principle is similar to the one already used for servers
iteration.
Note that ctx.p[0] is not updated anymore at the end of a function, as
this is automatically done via the watcher itself.
Define a new <px_watch> watcher member in stats applet context. It is
used to register the applet on a proxy when iterating over the proxies
list. <obj1> is automatically updated via the watcher interaction.
Watcher is first initialized prior to stats_dump_proxies() invocation.
This guarantees that stats dump is safe even if applet yields and a
backend is removed in parallel.
Define a new member watcher_list in proxy. It will be used to register
modules which iterate over the proxies list. This will ensure that the
operation is safe even if a backend is removed in parallel.
Add "del backend" handler which is restricted to admin level. Along with
it, a new function be_check_for_deletion() is used to test if the
backend is removable.
Correct documentation for srv_detach() which previously stated that this
function could be called for a server even if not stored in its proxy
list. In fact there is a BUG_ON() which detects this case.
Proxy flags member were of type char. This will soon enough not be
sufficient as new flags will be defined. As such, convert flags member
to unsigned int type.
Add missing proxy_index_id() call in "add backend" handler. This step is
responsible to store the newly created proxy instance in the
used_proxy_id global tree.
No need to backport.
Servers iteration via promex is now resilient to server runtime deletion
thanks to the watcher mechanism. However, the watcher was not correctly
initialized which could cause duplicate metrics reporting.
This issue happens when promex dump yielded when manipulating the last
server of a proxy. If this server is removed in parallel, <sv> pointer
will be set to NULL when promex resumes. Instead of switching to another
proxy, the code would reuse the same one and iterate again on the same
server list.
To fix this issue, <sv> pointer must not be reinitialized just after a
resumption point. Instead, this is now performed before
promex_dump_srv_metrics(), or just after switching to another proxy
instance. Thus, on resumption, if promex_dump_srv_metrics() is started
with <sv> as NULL, it means that the server was deleted and the end of
the current proxy list is reached, hence iteration is restarted on the
next proxy instance.
Note that ctx.p[1] does not need to be manually updated at the end of
promex_dump_srv_metrics() as srv_watch already does that.
This patch must be backported up to 3.0.
Implement a stress mode with force yield for promex applet each time a
metric is displayed. This is implemented by returning 0 in
promex_dump_ts() each time the output buffer is not empty.
To test this, haproxy must be compiled with DEBUG_STRESS and the
following configuration must be used :
global
stress-level 1
As seen with the last changes to counters allocation, the move of the
counters storage to the thread group as operated in commit 04a9f86a85
("MEDIUM: counters: add a dedicated storage for extra_counters in various
structs") causes some random errors when using ASAN, because the extra
counters are freed in srv_drop() after calling srv_free_params(), which
is responsible for freeing the per-thread group storage.
For the proxies however it's OK because free calls are made before the
call to deinit_proxy() which frees the per_tgrp area.
No backport is needed, this is purely 3.4-dev.
Now we store and retrieve only counters for the current tgid when more
than one is supported. This allows to significantly reduce contention
on shared stats. The haterm utility saw its performance increase from
4.9 to 5.8M req/s in H1, and 6.0 to 7.6M for H2, both with 5 groups of
16 threads, showing that we don't necessarily need insane amounts of
groups.
Now thanks to new macro EXTRA_COUNTERS_AGGR() we can iterate over all
thread groups storages when returning the data for a given metric. This
remains convenient and mostly transparent. The caller continues to pass
the pointer to the metric in the first group, and offsets are calculated
for all other groups and data summed. For now all groups except the
first one contain only zeroes but reported values are nevertheless
correct.
The goal is to always retrieve the storage address of the first thread
group for the given module. This will be used to iterate over all thread
groups. For now it returns the same value as EXTRA_COUNTERS_GET().
In order to be able to properly allocate all storage and retrieve data
from there, we'll need to know how many thread groups are supposed to
access it. Let's store the number of thread groups at init time. If the
tgrp_step is zero, there's always only one tg though.
Now EXTRA_COUNTERS_ALLOC() takes this number of thread groups in argument
and stores it in the structure. It also allocates as many areas as needed,
incrementing the datap pointer by the step for each of them.
EXTRA_COUNTERS_FREE() uses this info to free all allocated areas.
EXTRA_COUNTERS_INIT() initializes all allocated areas, this is used
elsewhere to clear/preset counters, e.g. in proxy_stats_clear_counters().
It involves a memcpy() call for each array, which is normally preset to
something empty but might also be used to preset certain non-scalar
fields such as an instance name.
We'll need to permit any user to update its own tgroup's extra counters
instead of the global ones. For this we now store the per-tgroup step
between two consecutive data storages, for when they're stored in a
tgroup array. When shared (e.g. resolvers or listeners), we just store
zero to indicate that it doesn't scale with tgroups. For now only the
registration was handled, it's not used yet.
Servers, proxies, listeners and resolvers all use extra_counters. We'll
need to move the storage to per-tgroup for those where it matters. Now
we're relying on an external storage, and the data member of the struct
was replaced with a pointer to that pointer to data called datap. When
the counters are registered, these datap are set to point to relevant
locations. In the case of proxies and servers, it points to the first
tgrp's storage. For listeners and resolvers, it points to a local
storage. The rationale here is that listeners are limited to a single
group anyway, and that resolvers have a low enough load so that we do
not care about contention there.
Nothing should change for the user at this point.
Since version 2.4 with commit 7f8f6cb926 ("BUG/MEDIUM: stats: prevent
crash if counters not alloc with dummy one") we can afford to always
update extra_counters because we know they're always either allocated
or linked to a dedicated trash. However, the ->fill_stats() callbacks
continue to access such values, making it technically possible to
retrieve random counters from this trash, which is not really clean.
Let's implement an explicit test in the ->fill_stats() functions to
only return 0 for the metric when not allocated like this. It's much
cleaner because it guarantees that we're returning an empty counter
in this case rather than random values.
The situation currently happens for dummy servers like the ones used
in Lua proxies as well as those used by rings (e.g. used for logging
or traces). Normally, none of the objects retrieved via stats or
Prometheus is concerned by this unallocated extra_counters situation,
so this is more about a cleanup than a real fix.
We'll soon need to iterate over thread groups in the fill_stats() functions,
so let's first pass the extra_counters and stats_module pointers to the
fill_stats functions. They now call EXTRA_COUNTERS_GET() themselves with
these elements in order to retrieve the required pointer. Nothing else
changed, and it's getting even a bit more transparent for callers.
This doesn't change anything visible however.
A number of C files include stats.h or stats-t.h, many of which were
just to access the counters. Now those which really need counters rely
on counters.h or counters-t.h, which already reduces the amount of
preprocessed code to be built (~3000 lines or about 0.05%).
It was always difficult to find extra_counters when the rest of the
counters are now in counters-t.h. Let's move the types to counters-t.h
and the macros to counters.h. Stats include them since they're used
there. But some users could be cleaned from the stats definitions now.
There's something a bit awkward in the way stats counters are inherited
through the QUIC modules: quic_conn-t includes quic_stats-t.h, which
declares quic_stats_module as extern from a type that's not known from
this file. And anyway externs should not be exported from type defintions
since they're not part of the ABI itself.
This commit moves the declaration to quic_stats.h which now takes care
to include stats-t.h to get the definition of struct stats_module. The
few users who used to learn it through quic_conn-t.h now include it
explicitly. As a bonus this reduces the number of preprocessed lines
by 5000 (~0.1%).
By the way, it looks like struct stats_module could benefit from being
moved off stats-t.h since it's only used at places where the rest of
the stats is not needed. Maybe something to consider for a future
cleanup.
We only support platforms where free(NULL) is a NOP so that
null checks are useless before free(). Let's drop them to keep
the code clean. There were a few in cfgparse-global, flt_trace,
ssl_sock and stats.
It appears that in cli_parse_add_server(), we're calling srv_alloc_lb()
and stats_allocate_proxy_counters_internal() before srv_preinit() which
allocates the thread groups. LB algos can make use of the per_tgrp part
which is initialized by srv_preinit(). Fortunately for now no algo uses
both tgrp and ->server_init() so this explains why this remained
unnoticed to date. Also, extra counters will soon require per_tgrp to
already be initialized. So let's move these between srv_preinit() and
srv_postinit(). It's possible that other parts will have to be moved
in between.
This could be backported to recent versions for the sake of safety but
it looks like the current code cannot tell the difference.
Some stream parsing errors that do not affect the connection result in
the parsed block not being transferred from the rx buffer to the channel
and not being reported upstream in rcv_buf(), causing the stconn to time
out. Let's detect this condition, and propagate term flags anyway since
no more progress will be made otherwise.
This should be backported at least till 3.2, probably even 2.8.
The H2 mux currently logs whenever some decoding fails. Most of the errors
happen at the connection level, but some are even at the stream level,
meaning that multiple logs can be emitted for a given connection, which
can quickly use some resource for little value. This new setting allows
to tweak this and decide to only log errors that affect the connection,
or even none at all.
This should be backported at least as far as 3.2.
Two cases were not causing glitches to be incremented:
- invalid trailers
- trailers on closed streams
This patch addresses this. It could be backported, at least to 3.2.
ssl_sock_srv_try_reuse_sess() was modified by this commit to no longer
fail (it now returns void), but the related comments remained:
BUG/MINOR: quic: missing app ops init during backend 0-RTT sessions
This patch cleans them up.
The QUIC mux requires "application operations" (app ops), which are a list
of callbacks associated with the application level (i.e., h3, h0.9) and
derived from the ALPN. For 0-RTT, when the session cache cannot be reused
before activation, the current code fails to reach the initialization of
these app ops, causing the mux to crash during its initialization.
To fix this, this patch restores the behavior of
ssl_sock_srv_try_reuse_sess(), whose purpose was to reuse sessions stored
in the session cache regardless of whether 0-RTT was enabled, prior to
this commit:
MEDIUM: quic-be: modify ssl_sock_srv_try_reuse_sess() to reuse backend
sessions (0-RTT)
With this patch, this function now does only one thing: attempt to reuse a
session, and that's it!
This patch allows ignoring whether a session was successfully reused from
the cache or not. This directly fixes the issue where app ops
initialization was skipped upon a session cache reuse failure. From a
functional standpoint, starting a mux without reusing the session cache
has no negative impact; the mux will start, but with no early data to
send.
Finally, there is the case where the ALPN is reset when the backend is
stopped. It is critical to continue locking read access to the ALPN to
secure shared access, which this patch does. It is indeed possible for the
server to be stopped between the call to connect_server() and
quic_reuse_srv_params(). But this cannot prevent the mux to start
without app ops. This is why a 'TODO' section was added, as a reminder that a
race condition regarding the ALPN reset still needs to be fixed.
Must be backported to 3.3
Make sure CPUs are distributed fairly across groups, in case the number
of groups to generate is not a divider of the number of CPUs, otherwise
we may end up with a few groups that will have no CPU bound to them.
This was introduced in 3.4-dev2 with commit 56fd0c1a5c ("MEDIUM: cpu-topo:
Add an optional directive for per-group affinity"). No backport is
needed unless this commit is backported.
When "mode haterm" was set in a "defaults" section, it could not be
overridden in subsequent sections using the "mode" keyword. This is because
the proxy stream instantiation callback was not being reset to the
default stream_new() value.
This could break the stats URI with a configuration such as:
defaults
mode haterm
# ...
frontend stats
bind :8181
mode http
stats uri /
This patch ensures the ->stream_new_from_sc() proxy callback is reset
to stream_new() when the "mode" keyword is parsed for any mode other
than "haterm".
No need to backport.
During proxy_finalize(), a lookup is performed over the servers by name
tree to detect any collision. Only the first conflict for each server
instance is reported to avoid a combinatory explosion with too many
alerts shown.
Previously, this was written using a for loop without any iteration.
Replace this by a simple if statement as this is cleaner.
This should fix github issue #3276.
itbmap_next() advances an iterator over a ncbmbuf buffer storage. When
reaching the end of the buffer, <b> field is set to NULL, and the caller
is expected to stop working with the iterator.
Complete this part to ensure that itbmap type is fully initialized in
case null iterator value is returned. This is not strictly required
given the above description, but this is better to avoid any possible
future mistake.
This should fix coverity issue from github #3273.
This could be backported up to 2.8.
Some perf profiles occasionally show that reading the trace source's
state can take some time, which is not expected at all. It just happens
that the trace_source is not cache-aligned so depending on linkage, it
may share a cache line with a more active variable, thereby inducing a
slow down to all threads trying to read the variable.
Let's always mark it aligned to avoid this. For now the problem was not
observed again.
Changes brought to support large buffers revealed a bug in the SPOE applet
when a frame is copied in the SPOE context buffer. A b_xfer() was performed
without allocating the SPOE context buffer. It is not expected. As stated in
the function documentation, the caller is responsible for ensuring there is
enough space in the destination buffer. So first of all, it must ensure this
buffer was allocated.
With recent changes, we are able to hit a BUG_ON() because the swap is no
longer possible if source and destination buffers size are not the same.
This patch should fix the issue #3286. It could be backported as far as 3.1.
Some options do not support "no" nor "defaults" and they're placed after
the check for their absence. However, "accept-invalid-http-request" and
"accept-invalid-http-response" still used to check for the flags that
come with these prefixes, but Coverity noticed this was dead code in
github issue #3272. Let's just drop the test.
No backport needed as it's just dead code.
This function was recently created by moving code from acme_gen_tmp_x509()
(in acme.c) to ssl_gencrt.c (ssl_gen_x509()). The <ctmp> variable was
initialized and then freed without ever being used. This was already the
case in the original acme_gen_tmp_x509() function.
This patch removes these useless statements.
Reported in GH #3284
Avoid such a warnings from coverity:
CID 1645121: (#1 of 1): Calling risky function (DC.WEAK_CRYPTO)
dont_call: random should not be used for security-related applications,
because linear congruential algorithms are too easy to break.
Reported in GH #3283 and #3285
This patch changes the registration of the following keywords to be
unconditional:
- ssl-dh-param-file
- ssl-engine
- ssl-propquery, ssl-provider, ssl-provider-path
- ssl-default-bind-curves, ssl-default-server-curves
- ssl-default-bind-sigalgs, ssl-default-server-sigalgs
- ssl-default-bind-client-sigalgs, ssl-default-server-client-sigalgs
Instead of excluding them at compile time via #ifdef guards in the keyword
registration table, their parsing functions now check feature availability
at runtime and return a descriptive error when the feature is missing.
For features controlled by the SSL library (providers, curves, sigalgs,
DH), the error message includes the actual OpenSSL version string via
OpenSSL_version(OPENSSL_VERSION), so users can immediately identify which
library they are running rather than seeing cryptic internal macro names.
For ssl-dh-param-file, the message also includes "(no DH support)" as a
hint, since OPENSSL_NO_DH can be set either by an OpenSSL build or by
HAProxy itself in certain configurations.
For ssl-engine, which depends on a HAProxy build-time flag (USE_ENGINE),
the message retains the flag name as it is more actionable for the user.
This addresses issue https://github.com/haproxy/haproxy/issues/3246.
In acme_req_finalize(), acme_req_challenge(), acme_req_neworder(),
acme_req_account(), and acme_post_as_get(), the success path always
calls unconditionally memprintf(errmsg, ...).
This may result in a leak of errmsg.
Additionally, acme_res_chkorder(), acme_res_finalize(), acme_res_auth(),
and acme_res_neworder() had unused 'out:' labels that were removed.
Must be backported as far as 3.2.
365a696 ("MINOR: acme: emit a log for DNS-01 challenge response")
introduces the auth->dns member which is istdup(). But this member is
never free, instead auth->token was freed twice by mistake.
Must be backported to 3.2.
QUIC is now implemented on the backend side. Complete definitions for
QUIC/H3 stats module to add STATS_PX_CAP_BE capability.
This change is necessary to display QUIC/H3 counters on backend lines
for HTML stats page.
This should be backported up to 3.3.
half_open_conn is a proxy counter used to account for quic_conn in
half-open state : this represents a connection whose address is not yet
validated (handshake successful, or via token validation).
This counter only has sense for the frontend side. Currently, code is
safe as access is only performed if quic_conn is not yet flagged with
QUIC_FL_CONN_PEER_VALIDATED_ADDR, which is always set for backend
connections.
To better reflect this, add a BUG_ON() when half_open_conn is
incremented/decremented to ensure this never occurs for backend
connections.
quic_conn is initialized with a pointer to its proxy counters. These
counters are then updated during the connection lifetime.
Counters pointer was incorrect for backend quic_conn, as it always
referenced frontend counters. For pure backend, no stats would be
updated. For listen instances, this resulted in incorrect stats
reporting.
Fix this by correctly set proxy counters based on the connection side.
This must be backported up to 3.3.
This is a very minor bug with a very low probability of occurring.
However, it could be flagged by a static analyzer or result in a small
contribution, which is always time-consuming for very little gain.
Add the --quic-bind-opts and --tcp-bind-opts long options to append
settings to all QUIC and TCP bind lines. This requires modifying the argv
parser to first process these new options, ensuring they are available
during the second argv pass to be added to each relevant "bind" line.
Add -b and -c options to the haterm argv parser. Use -b to specify the RSA
private key size (in bits) and -c to define the ECDSA certificate curves.
These self-signed certificates are required for haterm SSL bindings.
Allows server keyword "no-check-sni-auto" for dynamic servers. This may
be necessary to users who do not want to benefit from auto SNI for
checks.
Keyword "check-sni-auto" is still deactivated for dynamic servers, for
the same reason as "sni-auto" (cf the previous patch for a complete
explanation).
This must be backported up to 3.3.
Auto SNI configuration is configured during check config validity.
However, nothing was implemented for dynamic servers.
Fix this by implementing auto SNI configuration during "add server" CLI
handler. Auto SNI configuration code is moved in a dedicated function
srv_configure_auto_sni() called both for static and dynamic servers.
Along with this, allows the keyword "no-sni-auto" on dynamic servers, so
that this process can be deactivated if wanted. Note that "sni-auto"
remains unavailable as it only makes sense with default-servers which
are never used for dynamic server creation.
This must be backported up to 3.3.
There was no check on the result of strdup() used to setup auto SNI on a
server instance during check config validity. In case of failure, the
error would be silently ignored as the following server_parse_exprs()
does nothing when <sni_expr> server field is NULL. Hence, no SNI would
be used on the server, without any error nor warning reported.
Fix this by adding a check on strdup() return value. On error, ERR_ABORT
is reported along with an alert, parsing should be interrupted as soon
as possible.
This must be backported up to 3.3. Note that the related code in this
case is present in cfgparse.c source file.
Released version 3.4-dev5 with the following main changes :
- DOC: internals: addd mworker V3 internals
- BUG/MINOR: threads: Initialize maxthrpertgroup earlier.
- BUG/MEDIUM: threads: Differ checking the max threads per group number
- BUG/MINOR: startup: fix allocation error message of progname string
- BUG/MINOR: startup: handle a possible strdup() failure
- MINOR: cfgparse: validate defaults proxies separately
- MINOR: cfgparse: move proxy post-init in a dedicated function
- MINOR: proxy: refactor proxy inheritance of a defaults section
- MINOR: proxy: refactor mode parsing
- MINOR: backend: add function to check support for dynamic servers
- MINOR: proxy: define "add backend" handler
- MINOR: proxy: parse mode on dynamic backend creation
- MINOR: proxy: parse guid on dynamic backend creation
- MINOR: proxy: check default proxy compatibility on "add backend"
- MEDIUM: proxy: implement dynamic backend creation
- MINOR: proxy: assign dynamic proxy ID
- REGTESTS: add dynamic backend creation test
- BUG/MINOR: proxy: fix clang build error on "add backend" handler
- BUG/MINOR: proxy: fix null dereference in "add backend" handler
- MINOR: net_helper: extend the ip.fp output with an option presence mask
- BUG/MINOR: proxy: fix default ALPN bind settings
- CLEANUP: lb-chash: free lb_nodes from chash's deinit(), not global
- BUG/MEDIUM: lb-chash: always properly initialize lb_nodes with dynamic servers
- CLEANUP: haproxy: fix bad line wrapping in run_poll_loop()
- MINOR: activity: support setting/clearing lock/memory watching for task profiling
- MEDIUM: activity: apply and use new finegrained task profiling settings
- MINOR: activity: allow to switch per-task lock/memory profiling at runtime
- MINOR: startup: Add the SSL lib verify directory in haproxy -vv
- BUG/MINOR: ssl: SSL_CERT_DIR environment variable doesn't affect haproxy
- CLEANUP: initcall: adjust comments to INITCALL{0,1} macros
- DOC: proxy-proto: underline the packed attribute for struct pp2_tlv_ssl
- MINOR: queues: Check minconn first in srv_dynamic_maxconn()
- MINOR: servers: Call process_srv_queue() without lock when possible
- BUG/MINOR: quic: ensure handshake speed up is only run once per conn
- BUG/MAJOR: quic: reject invalid token
- BUG/MAJOR: quic: fix parsing frame type
- MINOR: ssl: Missing '\n' in error message
- MINOR: jwt: Convert an RSA JWK into an EVP_PKEY
- MINOR: jwt: Add new jwt_decrypt_jwk converter
- REGTESTS: jwt: Add new "jwt_decrypt_jwk" tests
- MINOR: startup: Add HAVE_WORKING_TCP_MD5SIG in haproxy -vv
- MINOR: startup: sort the feature list in haproxy -vv
- MINOR: startup: show the list of detected features at runtime with haproxy -vv
- SCRIPTS: build-vtest: allow to set a TMPDIR and a DESTDIR
- MINOR: filters: rework RESUME_FILTER_* macros as inline functions
- MINOR: filters: rework filter iteration for channel related callback functions
- MEDIUM: filters: use per-channel filter list when relevant
- DEV: gdb: add a utility to find the post-mortem address from a core
- BUG/MINOR: deviceatlas: add missing return on error in config parsers
- BUG/MINOR: deviceatlas: add NULL checks on strdup() results in config parsers
- BUG/MEDIUM: deviceatlas: fix resource leaks on init error paths
- BUG/MINOR: deviceatlas: fix off-by-one in da_haproxy_conv()
- BUG/MINOR: deviceatlas: fix cookie vlen using wrong length after extraction
- BUG/MINOR: deviceatlas: fix double-checked locking race in checkinst
- BUG/MINOR: deviceatlas: fix resource leak on hot-reload compile failure
- BUG/MINOR: deviceatlas: fix deinit to only finalize when initialized
- BUG/MINOR: deviceatlas: set cache_size on hot-reloaded atlas instance
- MINOR: deviceatlas: check getproptype return and remove pprop indirection
- MINOR: deviceatlas: increase DA_MAX_HEADERS and header buffer sizes
- MINOR: deviceatlas: define header_evidence_entry in dummy library header
- MINOR: deviceatlas: precompute maxhdrlen to skip oversized headers early
- CLEANUP: deviceatlas: add unlikely hints and minor code tidying
- DEV: gdb: use unsigned longs to display pools memory usage
- BUG/MINOR: ssl: lack crtlist_dup_ssl_conf() declaration
- BUG/MINOR: ssl: double-free on error path w/ ssl-f-use parser
- BUG/MINOR: ssl: fix leak in ssl-f-use parser upon error
- BUG/MINOR: ssl: clarify ssl-f-use errors in post-section parsing
- BUG/MINOR: ssl: error with ssl-f-use when no "crt"
- MEDIUM: backend: make "balance random" consider tg local req rate when loads are equal
- BUG/MAJOR: Revert "MEDIUM: mux-quic: add BUG_ON if sending on locally closed QCS"
- BUG/MEDIUM: h3: reject frontend CONNECT as currently not implemented
- MINOR: mux-quic: add BUG_ON_STRESS() when draining data on closed stream
- REGTESTS: fix quoting in feature cmd which prevents test execution
- BUG/MEDIUM: mux-h2/quic: Stop sending via fast-forward if stream is closed
- BUG/MEDIUM: mux-h1: Stop sending vi fast-forward for unexpected states
- BUG/MEDIUM: applet: Fix test on shut flags for legacy applets (v2)
- DEV: term-events: Fix hanshake events decoding
- BUG/MINOR: flt-trace: Properly compute length of the first DATA block
- MINOR: flt-trace: Add an option to limit the amount of data forwarded
- CLEANUP: compression: Remove unused static buffers
- BUG/MEDIUM: shctx: Use the next block when data exactly filled a block
- BUG/MINOR: http-ana: Stop to wait for body on client error/abort
- MINOR: stconn: Add missing SC_FL_NO_FASTFWD flag in sc_show_flags
- REORG: stconn: Move functions related to channel buffers to sc_strm.h
- BUG/MEDIUM: jwe: fix timing side-channel and dead code in JWE decryption
- MINOR: tree-wide: Use the buffer size instead of global setting when possible
- MINOR: buffers: Swap buffers of same size only
- BUG/MINOR: config: Check buffer pool creation for failures
- MEDIUM: cache: Don't rely on a chunk to store messages payload
- MEDIUM: stream: Limit number of synchronous send per stream wakeup
- MEDIUM: compression: Be sure to never compress more than a chunk at once
- MEDIUM: mux-h1/mux-h2/mux-fcgi/h3: Disable 0-copy for buffers of different size
- MEDIUM: applet: Disable 0-copy for buffers of different size
- MINOR: h1-htx: Disable 0-copy for buffers of different size
- MEDIUM: stream: Offer buffers of default size only
- BUG/MEDIUM: htx: Fix function used to change part of a block value when defrag
- MEDIUM: htx: Refactor transfer of htx blocks to merge DATA blocks if possible
- MEDIUM: htx: Refactor htx defragmentation to merge data blocks
- MEDIUM: htx: Improve detection of fragmented/unordered HTX messages
- MINOR: http-ana: Do a defrag on unaligned HTX message when waiting for payload
- MINOR: http-fetch: Use pointer to HTX DATA block when retrieving HTX body
- MEDIUM: dynbuf: Add a pool for large buffers with a configurable size
- MEDIUM: chunk: Add support for large chunks
- MEDIUM: stconn: Properly handle large buffers during a receive
- MEDIUM: sample: Get chunks with a size dependent on input data when necessary
- MEDIUM: http-fetch: Be able to use large chunks when necessary
- MINPR: htx: Get large chunk if necessary to perform a defrag
- MEDIUM: http-ana: Use a large buffer if necessary when waiting for body
- MINOR: dynbuf: Add helpers to know if a buffer is a default or a large buffer
- MINOR: config: reject configs using HTTP with large bufsize >= 256 MB
- CI: do not use ghcr.io for Quic Interop workflows
- BUG/MEDIUM: ssl: SSL backend sessions used after free
- CI: vtest: move the vtest2 URL to vinyl-cache.org
- CI: github: disable windows.yml by default on unofficials repo
- MEDIUM: Add connect/queue/tarpit timeouts to set-timeout
- CLEANUP: mux-h1: Remove unneeded null check
- DOC: remove openssl no-deprecated CI image
- BUG/MINOR: acme: fix X509_NAME leak when X509_set_issuer_name() fails
- BUG/MINOR: backend: check delay MUX before conn_prepare()
- OPTIM: backend: reduce contention when checking MUX init with ALPN
- DOC: configuration: add the ACME wiki page link
- MINOR: ssl/ckch: Move EVP_PKEY and cert code generation from acme
- MINOR: ssl/ckch: certificates generation from "load" "crt-store" directive
- MINOR: trace: add definitions for haterm streams
- MINOR: init: allow a fileless init mode
- MEDIUM: init: allow the redefinition of argv[] parsing function
- MINOR: stconn: stream instantiation from proxy callback
- MINOR: haterm: add haterm HTTP server
- MINOR: haterm: new "haterm" utility
- MINOR: haterm: increase thread-local pool size
- BUG/MEDIUM: stats-file: fix shm-stats-file recover when all process slots are full
- BUG/MINOR: stats-file: manipulate shm-stats-file heartbeat using unsigned int
- BUG/MEDIUM: stats-file: detect and fix inconsistent shared clock when resuming from shm-stats-file
- CI: github: only enable OS X on development branches
Don't use the macOS job on maintenance branches, it's mainly use for
development and checking portability, but we don't support actively
macOS on stable branches.
When leveraging shm-stats-file, global_now_ms and global_now_ns are stored
(and thus shared) inside the shared map, so that all co-processes share
the same clock source.
Since the global_now_{ns,ms} clocks are derived from now_ns, and given
that now_ns is a monotonic clock (hence inconsistent from one host to
another or reset after reboot) special care must be taken to detect
situations where the clock stored in the shared map is inconsistent
with the one from the local process during startup, and cannot be
relied upon anymore. A common situation where the current implementation
fails is resuming from a shared file after reboot: the global_now_ns stored
in the shm-stats-file will be greater than the local now_ns after reboot,
and applying the shared offset doesn't help since it was only relevant to
processes prior to rebooting. Haproxy's clock code doesn't expect that
(once the now offset is applied) global_now_ns > now_ns, and it creates
ambiguous situation where the clock computations (both haproxy oriented
and shm-stats-file oriented) are broken.
To fix the issue, when we detect that the clock stored in the shm is
off by more than SHM_STATS_FILE_HEARTBEAT_TIMEOUT (60s) from the
local now_ns, since this situation is not supposed to happen in normal
environment on the host, we assume that the shm file was previously used
on a different system (or that the current host rebooted).
In this case, we perform a manually adjustment of the now offset so that
the monotonic clock from the current host is consistent again with the
global_now_ns stored in the file. Doing so we can ensure that clock-
dependent objects (such as freq_counters) stored within the map will keep
working as if we just (re)started where we left off when the last process
stopped updating the map.
Normally it is not expected that we update the now offset stored in the
map once the map was already created (because of concurrent accesses to
the file when multiple processes are attached to it), but in this specific
case, we know we are the first process on this host to start working
(again) on the file, thus we update the offset as if we created the
file ourself, while keeping existing content.
It should be backported in 3.3
shm-stats-file heartbeat is derived from now_ms with an extra time added
to it, thus it should be handled using the same time as now_ms is.
Until now, we used to handle heartbeat using signed integer. This was not
found to cause severe harm but it could result in improper handling due
to early wrapping because of signedness for instance, so let's better fix
that before it becomes a real issue.
It should be backported in 3.3
Amaury reported that when the following warning is reported by haproxy:
[WARNING] (296347) : config: failed to get shm stats file slot for 'haproxy.stats', all slots are occupied
haproxy would segfault right after during clock update operation.
The reason for the warning being emitted is not the object of this commit
(all shm-stats-file slots occupied by simultaneous co-processes) but since
it was is intended that haproxy is able to keep working despite that
warning (ignoring the use of shm-stats-file), we should fix the crash.
The crash is caused by the fact that we detach from the shared memory while
the global_now_ns and global_now_ms clock pointers still point to the shared
memory. Instead we should revert to using our local clock instead before
detaching from the map.
It should be backported in 3.3
QUIC uses many objects and the default pool size causes a lot of
thrashing at the current request rate, taking ~12% CPU in pools.
Let's increase it to 3MB, which allows us to reach around 11M
req/s on a 80-core machine.
haterm_init.c is added to implement haproxy_init_args() which overloads
the one defined by haproxy.c. This way, haterm program uses its own argv[]
parsing function. It generates its own configuration in memory that is
parsed during boot and executed by the common code.
Contrary to haproxy, httpterm does not support all the HTTP protocols.
Furthermore, it has become easier to handle inbound/outbound
connections / streams since the rework done at conn_stream level.
This patch implements httpterm HTTP server services into haproxy. To do
so, it proceeds the same way as for the TCP checks which use only one
stream connector, but on frontend side.
The makefile is modified to handle haterm.c in additions to all the C
files for haproxy to build new haterm program into haproxy, the haterm
server also instantiates a haterm stream (hstream struct) attached to a
stream connector for each incoming connection without backend stream
connector. This is the role of sc_new_from_endp() called by the muxes to
instantiate streams/hstreams.
As for stream_new(), hstream_new() instantiates a task named
process_hstream() (see haterm.c) which has the same role as
process_stream() but for haterm streams.
haterm into haproxy takes advantage of the HTTP muxes and HTX API to
support all the HTTP protocols supported by haproxy.
Add a pointer to function to proxies as ->stream_new_from_sc proxy
struct member to instantiate stream from connection as this is done by
all the muxes when they call sc_new_from_endp(). The default value for
this pointer is obviously stream_new() which is exported by this patch.
This patches allows the argv[] parsing function to be redefined from
others C modules. This is done extracting the function which really
parse the argv[] array to implement haproxy_init_args(). This function
is declared as a weak symbol which may be overloaded by others C module.
Same thing for copy_argv() which checks/cleanup/modifies the argv array.
One may want this function to be redefined. This is the case when other
C modules do not handle the same command line option. Copying such
argv[] would lead to conflicts with the original haproxy argv[] during
the copy.
This patch provides the possibility to initialize haproxy without
configuration file. This may be identified by the new global and exported
<fileless_mode> and <fileless_cfg> variables which may be used to
provide a struct cfgfile to haproxy by others means than a physical
file (built in memory).
When enabled, this fileless mode skips all the configuration files
parsing.
Add definitions for haterm stream as arguments to be used by the TRACE API.
This will be used by the haterm module to come which will have to handle
hstream struct objects (in place of stream struct objects).
Add "generate-dummy" on/off type keyword to "load" directive to
automatically generate dummy certificates as this is done for ACME from
ckch_conf_load_pem_or_generate() function which is called if a "crt"
keyword is also provide for this directive.
Also implement "keytype" to specify the key type used for these
certificates. Only "RSA" or "ECDSA" is accepted. This patch also
implements "bits" keyword for the "load" directive to specify the
private key size used for RSA. For ECDSA, a new "curves" keyword is also
provided by this patch to specify the curves to be used for the EDCSA
private keys generation.
ckch_conf_load_pem_or_generate() is modified to use these parameters
provided by "keytype", "bits" and "curves" to generate the private key
with ssl_gen_EVP_PKEY() before generating the X509 certificate calling
ssl_gen_x509().
Move acme_EVP_PKEY_gen() implementation to ssl_gencrt.c and rename it to
ssl_EVP_PKEY_gen(). Also extract from acme_gen_tmp_x509() the generic
part to implement ssl_gen_x509() into ssl_gencrt.c.
To generate a self-signed expired certificate ssl_EVP_PKEY_gen() must be
used to generate the private key. Then, ssl_gen_x509() must be called
with the private key as argument. acme_gen_tmp_x509() is also modified
to called these two functions to generate a temporary certificate has
done before modifying this part.
Such an expired self-signed certificate should not be use on the field
but only during testing and development steps.
In connect_server(), MUX initialization must be delayed if ALPN
negotiation is configured, unless ALPN can already be retrieved via the
server cache.
A readlock is used to consult the server cache. Prior to this patch, it
was always taken even if no ALPN is configured. The lock was thus used
for every new backend connection instantiation.
Rewrite the check so that now the lock is only used if ALPN is
configured. Thus, no lock access is done if SSL is not used or if ALPN
is not defined.
In practice, there will be no performance gain, as the read lock should
never block if ALPN is not configured. However, the code is cleaner as
it better reflect that only access to server nego_alpn requires the
path_params lock protection.
In connect_server(), when a new connection must be instantiated, MUX
initialization is delayed if an ALPN setting is present on the server
line configuration, as negotiation must be performed to select the
correct MUX. However, this is not the case if the ALPN can already be
retrieved on the server cache.
This check is performed too late however and may cause issue with the
QUIC stack. The problem can happen when the server ALPN is not yet set.
In the normal case, quic_conn layer is instantiated and MUX init is
delayed until the handshake completion. When the MUX is finally
instantiated, it reused without any issue app_ops from its quic_conn,
which is derived from the negotiated ALPN.
However, there is a race condition if another QUIC connection populates
the server ALPN cache. If this happens after the first quic_conn init
but prior to the MUX delay check, the MUX will thus immediately start in
connect_server(). When app_ops is retrieved from its quic_conn, a crash
occurs in qcc_install_app_ops() as the QUIC handshake is not yet
finalized :
#0 0x000055e242a66df4 in qcc_install_app_ops (qcc=0x7f127c39da90, app_ops=0x0) at src/mux_quic.c:1697
1697 if (app_ops->init && !app_ops->init(qcc)) {
[Current thread is 1 (Thread 0x7f12810f06c0 (LWP 25758))]
To fix this, MUX delay check is moved up in connect_server(). It is now
performed prior conn_prepare() which is responsible for the quic_conn
layer instantiation. Thus, it ensures consistency for the QUIC stack :
MUX init is always delayed if the quic_conn does not reuses itself the
SSL session and ALPN server cache (no quic_reuse_srv_params()).
This must be backported up to 3.3.
In acme_gen_tmp_x509(), if X509_set_issuer_name() fails, the code
jumped to the mkcert_error label without freeing the previously
allocated X509_NAME object. The other error paths after X509_NAME_new()
(X509_NAME_add_entry_by_txt and X509_set_subject_name) already properly
freed the name before jumping to mkcert_error, but this one was missed.
Fix this by freeing name before the goto, consistent with the other
error paths in the same function.
Must be backported as far as 3.3.
Since 3.1 a task is always created when H1 connections initialize, so
the later null check before task_queue() became unneeded.
Could be backported with 3c09b3432 (BUG/MEDIUM: mux-h1: Fix how timeouts
are applied on H1 connections).
Add the ability to set connect, queue and tarpit timeouts from the
set-timeout action. This is especially useful when using set-dst to
dynamically connect to servers.
This patch also adds the relevant fe_/be_/cur_ sample fetches for these
timeouts.
Disable the windows job for repository in repositories that are not in
the "haproxy" organization. This is mostly used for portability during
development and only making noise during the maintenance cycle.
Must be backported in every branches.
This bug impacts only the backends. The sessions cached could be used after been
freed because of a missing write lock into ssl_sock_handle_hs_error() when freeing
such objects. This issue could be rarely reproduced and only with QUIC with
difficulties (random CRYPTO data corruptions and instrumented code).
Must be backported as far as 2.6.
due to some (yet unknown) changes in ghcr.io we are not able to pull
images from it anymore. Lets temporarily switch to "local only" images
storage.
no functional change
b_is_default() and b_is_large() can now be used to know if a buffer is a
default buffer or a large one. _b_free() now relies on it.
These functions are also used when possible (stream_free(),
stream_release_buffers() and http_wait_for_msg_body()).
Thanks to previous patches, it is now possible to allocate a large buffer to
store the message payload in the context of the "wait-for-body" action. To
do so, "use-large-buffer" option must be set.
It means now it is no longer necessary to increase the regular buffer size
to be able to get message payloads of some requests or responses.
The function used to fetch params was update to get a chunk accordingly to
the parameter size. The same was also performed when the message body is
retrieved.
The function used to duplicate a sample was update to support large buffers. In
addition several converters were also reviewed to deal with large buffers. For
instance, base64 encoding and decoding must use chunks of the same size than the
sample. For some of them, a retry is performed to enlarge the chunk is possible.
TODO: Review reg_sub, concat and add_item to get larger chunk if necessary
While large buffers are still unused internally, functions receiving data
from endpoint (connections or applets) were updated to block the receives
when channels are using large buffer and the data forwarding was started.
The goal of this patch is to be able to flush large buffers at the end of
the analyzis stage to return asap on regular buffers.
Because there is now a memory pool for large buffers, we must also add the
support for large chunks. So, if large buffers are configured, a dedicated
memory pool is created to allocate large chunks. alloc_large_trash_chunk()
must be used to allocate a large chunk. alloc_trash_chunk_sz() can be used to
allocate a chunk with the best size. However free_trash_chunk() remains the
only way to release a chunk, regular or large.
In addition, large trash buffers are also created, using the same mechanism
than for regular trash buffers. So three thread-local trash buffers are
created. get_large_trash_chunk() must be used to get a large trash buffer.
And get_trash_chunk_sz() may be used to get a trash buffer with the best
size.
Add the support for large bufers. A dedicated memory pool is added. The size
of these buffers must be explicitly configured by setting
"tune.bufsize.large" directive. If it is not set, the pool is not
created. In addition, if the size for large buffers is the same than for
regular buffer, the feature is automatically disable.
For now, large buffers remain unused.
In sample fetch functions retrieving the message payload (req.body,
res.body...), instead of copying the payload in a trash buffer, we know
directely return a pointer the payload in the HTX message. To do so, we must
be sure there is only one HTX DATA block. Thanks to previous patches, it is
possible. However, we must take care to perform a defragmentation if
necessary.
When we are waiting for the request or response payload, it is usually
because the payload will be analyzed in a way or another. So, perform a
defrag if necessary. This should ease payload analyzis.
First, an HTX flags was added to know when blocks are unordered. It may
happen when a header is added while part of the payload was already received
or when the start-line is replaced by an new one. In these cases, the blocks
indexes are in the right order but not the blocks payload. Knowing a message
is unordered can be useful to trigger a defragmentation, mainly to be able
to append data properly for instance.
Then, detection of fragmented messages was improved, especially when a
header or a start-line is replaced by a new one.
Finally, when data are added in a message and cannot be appended into the
previous DATA block because the message is not aligned, a defragmentation is
performed to realign the message and append data.
When an HTX message is defragmented, the HTX DATA blocks are now merged into
one block. Just like the previous commit, this will help all payload
analysis, if any. However, there is an exception when the reference on a
DATA block must be preserved, via the <blk> parameter. In that case, this
DATA block is not merged with previous block.
htx_replace_blk_value() is buggy when a defrag is performed. It only happens
on data expension. But in that case, because a defragmentation is performed,
the blocks data are moved and old data of the updated block are no longer
accessible.
To fix the bug, we now use a chunk to temporarily copy the new data of the
block. This way we can safely perform the HTX defragmentation and then
recopy the data from the chunk to the HTX message.
It is theorically possible to hit this bug but concretly it is pretty hard.
This patch should be backported to all stable versions.
Check the channels buffers size on release before trying to offer it to
waiting entities. Only normal buffers must be considered. This will be
mandatory when the large buffers support on channels will be added.
When a message payload is parsed, it is possible to swap buffers. We must
only take care both buffers have same size. It will be mandatory when the
large buffers support on channels will be added.
Just like the previous commit, we must take care to never swap buffers of
different size when data are exchanged between an applet and a SC. it will
be mandatory when the large buffers support on channels will be added.
Today, it is useless to check the buffers size before performing a 0-copy in
muxes when data are sent, but it will be mandatory when the large buffers
support on channels will be added. Indeed, muxes will still rely on normal
buffers, so we must take care to never swap buffers of different size.
When the compression is performed, a trash chunk is used. So be sure to
never compression more data than the trash size. Otherwise the commression
could fail. Today, this cannot happen. But with the large buffers support on
channels, it could be an issue.
Note that this part should be reviewed to evaluate if we should use a larger
chunk too to perform the compression, maybe via an option.
It is not a bug fix, because there is no way to hit the issue for now. But
there is nothing preventing a loop of synchronous sends in process_stream().
Indead, when a synchronous send is successfully performed, we restart the
SCs evaluation and at the end another synchronous send is attempted. So with
an endpoint consuming data bit by bit or with a filter fowarding few bytes
at each call, it is possible to loop for a while in process_stream().
Because it is not expected, we now limit the number of synchronous send per
wakeup to two calls. In a nominal case, it should never be more. This commit
is mandatory to be able to handle large buffers on channels
There is no reason to backport this commit except if the large buffers
support on channels are backported.
When the response payload is stored in the cache, we can avoid to use a
trash chunk as temporary space area before copying everything in the cache
in one call. Instead we can directly write each HTX block in the cache, one
by one. It should not be an issue because, most of time, there is only one
DATA block.
This commit depends on "BUG/MEDIUM: shctx: Use the next block when data
exactly filled a block".
The call to init_buffer() during the worker startup may fail. In that case,
an error message is displayed but the error was not properly handled. So
let's add the proper check and exit on error.
At many places, we rely on global.tune.bufsize value instead of using the buffer
size. For now, it is not a problem. But if we want to be able to deal with
buffers of different sizes, it is good to reduce as far as possible dependencies
on the global value. most of time, we can use b_size() or c_size()
functions. The main change is performed on the error snapshot where the buffer
size was added into the error_snapshot structure.
Fix two issues in JWE token processing:
- Replace memcmp() with CRYPTO_memcmp() for authentication tag
verification in build_and_check_tag() to prevent timing
side-channel attacks. Also add a tag length validation check
before the comparison to avoid potential buffer over-read when
the decoded tag length doesn't match the expected HMAC half.
- Remove unreachable break statement after JWE_ALG_A256GCMKW case
in decrypt_cek_aesgcmkw().
sc_have_buff(), sc_need_buff(), sc_have_room() and sc_need_room() are
related to the buffer's channel. So we can move them in sc_strm.h header
file. In addition, this will be mandatory for the next commit.
During the message analysis, we must take care to stop wait for the message
body if an error was reported on client side or an abort was detected with
abort-on-close configured (by default now).
The bug was introduced when the "wait-for-body" action was added. Only the
producer state was tested. So, when we were waiting for the request payload,
there was no issue. But when we were waiting for the response payload, error
or abort on client side was not considered.
This patch should be backported to all stable versions.
When the hot list was removed in 3.0, a regression was introduced.
Theorically, it is possible to override data in a block when new data are
appended. It happens when data are copied. If the data size is a multiple of
the block size, all data are copied and the last used block is full. But
instead of saving a reference on the next block as the restart point for the
next copies, we keep a reference on the last full one. On the next read, we
reuse this block and old data are crushed. To hit the bug, no new blocks
should be reserved between the two data copy attempts.
Concretely, for now, it seems not possible to hit the bug. But with a block
size set to 1024, if more than 1024 bytes are reseved, with a first copy of
1024 bytes and a second one with remaining data, data in the first block
will be crushed.
So to fix the bug, the reference of the last block used to write data (which
is in fact the next one to use to perform the next copy) is only updated
when a block is full. In that case the next block is used.
This patch should be backported as far as 3.0 after a period of observation.
Since the legacy HTTP code was removed, the global and thread-local buffers,
tmpbuf and zbuf, are no longer used. So let's removed them.
This could be backported, theorically to all supported versions. But at
least it could be good to do so as far as 3.2 as it saves 2 buffers
per-thread.
This bug is quite old. When the length of the first DATA block is computed,
the offset is used instead of the block length minus the offset. It is only
used with random forwarding and there is a test just after to prevent any
issue, so there is no effect.
It could be backported to all stable versions.
Handshakes events were not properly decoded. Only send errors were decoded
as expected, other events were reported with a '-'. It is now fixes.
This patch could be backported as far as 3.2.
The previous fix was wrong. When shut flags are tested for legacy applets,
to know if the I/O handler can be called or not, we must be sure shut for
reads and for writes are both set to skip the applet I/O handler.
This bug introduced regression, at least for the peer applet and for the DNS
applet.
This patch must be backported with abc1947e1 ("BUG/MEDIUM: applet: Fix test
on shut flags for legacy applets"), so as far as 3.0.
If a producer tries to send data via the fast-forward mechanism while the
message is in an unexpected state from the consumer point of view, the
fast-forward is now disabled. Concretely, we now take care that the message
is in its data/tunnel stage to proceed in h1_nego_ff().
By disabling fast-forward in that case, we will automatically fall back on
the regular sending path and be able to handle the error in h1_snd_buf().
This patch should be backported as far as 3.0
If is illegal to send data if the stream is already closed. The case is
properly handled when data are sent via snd_buf(), by draining the data. But
it was still possible to process these data via nego_ff().
So, in this patch, both for the H2 and QUIC multiplexers, the fast-forward
is disabled if the stream is closed and nothing is performed. Doing so, we
will automatically fall back on the regular sending path and be able to
drain data in snd_buf().
Thanks to Mike Walker for his investigation on the subject.
This patch should be backported as far as 3.0.
Remove extra quote in feature cmd used to test SSL compatibility with
set_ssl_cafile QUIC regtest. Due to this syntax error, the test was
never executed.
No need to backport.
HTTP/3 CONNECT transcoding is not properly implemented on the frontend
side. Neither tunnel mode of application nor extended connect are
currently functional.
Clarify this situation by rejecting any CONNETC attempts on the frontend
side. The stream is thus now closed via a RESET_STREAM with error code
REQUEST_REJECTED.
This should be backported to every stable versions.
This reverts commit 235e8f1afd.
Prior to the above commit, snd_buf callback for QUIC MUX was able to
deal with data even after stream closure. The excess was simply
discarded, as no STREAM frame can be emitted after FIN/RESET_STREAM.
This code was later removed and replaced by a BUG_ON() to ensure snd_buf
is never called after stream closure.
However, this approach is too strict. Indeed, there is nothing in the
haproxy stream architecture which forbids this scheduling, in part
because QUIC MUX is the sole responsible of the stream closure. As such,
it is preferable to revert to the old code to prevent any triggering of
a BUG_ON() failure.
Note that nego_ff does not implement data draining if called after
stream closure. This will be done in a future patch.
Thanks to Mike Walker for his investigation on the subject.
This must be backported up to 2.8.
This is a follow up to b6bdb2553 ("MEDIUM: backend: make "balance random"
consider req rate when loads are equal")
In the above patch, we used the global sess_per_sec metric to choose which
server we should be using. But the original intent was to use the per
thread group statistic.
No backport needed, the previous patch already improved the situation in
3.3, so let's not take the risk of breaking that.
ssl-f-use lines tries to load a crt file, but the "crt" keyword is not
mandatory. That could lead to crtlist_load_crt() being called with a
NULL path, and trying to do a stat.
In this particular case we don't need to try anything and it's better to
leave with an actual error.
Must be backported as far as 3.2.
crtlist_load_crt() in post_section_frontend_crt_init() won't give
details about the line being parsed, this should be done by the caller.
Modify post_section_frontend_crt_init() to ouput the right error format.
Must be backported to 3.2.
In post_section_frontend_crt_init(), the crt_entry is populated by the
ssl_conf fromt the cfg_crt_node. On error path, the crt_list is
completely freed, including the ssl_conf structure. But the ssl_conf
structure was already freed when freeing the cfg_crt_node.
Fix the issue by doing a crtlist_dup_ssl_conf(n->ssl_conf) in the
crtlist_entry instead of an assignation.
Fix issue #3268.
Need to be backported as far as 3.2. The previous patch which adds the
crtlist_dup_ssl_conf() declaration is needed.
The pools memory usage calculation was done using ints by default, making
it harder to identify large ones. Let's switch to unsigned long for the
size calculations.
Add unlikely() hints on error paths in init, conv and fetch functions.
Remove unnecessary zero-initialization of local buffers that are
always written before use. Fix indentation in da_haproxy_checkinst()
and remove unused loop variable initialization.
Precompute the maximum header name length from the atlas evidence
headers at init and hot-reload time. Use it in da_haproxy_fetch() to
skip headers early that cannot match any known DeviceAtlas evidence
header, avoiding unnecessary string copies and comparisons.
Add the struct header_evidence_entry definition to the dummy dac.h
to accommodate the ongoing deviceatlas module update which now
iterates over atlas header_priorities to precompute maxhdrlen.
The struct was already referenced by struct da_atlas but lacked
a definition in the dummy header.
Increase DA_MAX_HEADERS from 24 to 32 and hbuf from 24 to 64 to
accommodate current DeviceAtlas data files which may use more headers
and longer header names.
Check the return value of da_atlas_getproptype() and skip the property
on failure instead of using an uninitialized proptype. Also remove the
unnecessary pprop pointer indirection, using prop directly.
When hot-reloading the atlas in da_haproxy_checkinst(), the configured
cache_size was not applied to the new instance, causing it to use the
default value.
This should be backported to lower branches.
da_fini() was called unconditionally in deinit_deviceatlas() even when
da_init() was never called. Move it inside the daset check. Also remove
the erroneous shm_unlink() call which could affect the dadwsch shared
memory used by the scheduling process.
This should be backported to lower branches.
In da_haproxy_checkinst(), when da_atlas_compile() failed, the cnew
buffer was leaked. Add a free(cnew) in the else branch.
This should be backported to lower branches.
In da_haproxy_checkinst(), base[0] was checked before acquiring the
lock but not re-checked after. Another thread could have already
processed the reload between the initial check and the lock
acquisition, leading to a race condition.
This should be backported to lower branches.
In da_haproxy_fetch(), vlen was set from v.len (the raw header value
length) instead of the truncated copy length. Also the cookie-specific
vlen calculation used an incorrect subtraction instead of the actual
extracted cookie value length (pl) returned by
http_extract_cookie_value().
This should be backported to lower branches.
The user-agent string copy had an off-by-one error: the buffer size
limit did not account for the null terminator, and the memcpy length
used i-1 which truncated the last character of the user-agent string.
This should be backported to lower branches.
When da_atlas_compile() or da_atlas_open() failed in init_deviceatlas(),
atlasimgptr was leaked and da_fini() was never called. Also add a NULL
check on strdup() for the default cookie name with proper cleanup of
the atlas and image pointer on failure.
This should be backported to lower branches.
Add missing NULL checks after strdup() for the json file path in
da_json_file() and the cookie name in da_properties_cookie().
This should be backported to lower branches.
da_log_level() and da_cache_size() were missing a return -1 on error,
causing fall-through to the normal return 0 path when invalid values
were provided.
This should be backported to lower branches.
More and more often, core dumps retrieved on systems that build with
-fPIE by default are becoming unexploitable. Even functions and global
symbols get relocated and gdb cannot figure their final position.
Ironically the post_mortem struct lying in its own section that was
meant to ease its finding is not exempt from this problem.
The only remaining way is to inspect the core to search for the
post-mortem magic, figure its offset from the file and look up the
corresponding virtual address with objdump. This is quite a hassle.
This patch implements a simple utility that opens a 64-bit core dump,
scans the program headers looking for a data segment which contains
the post-mortem magic, and prints it on stdout. It also places the
"pm_init" command alone on its own line to ease copy-pasting into the
gdb console. With this, at least the other commands in this directory
work again and allow to inspect the program's state. E.g:
$ ./getpm core.57612
Found post-mortem magic in segment 5:
Core File Offset: 0xfc600 (0xd5000 + 0x27600)
Runtime VAddr: 0x5613e52b6600 (0x5613e528f000 + 0x27600)
Segment Size: 0x28000
In gdb, copy-paste this line:
pm_init 0x5613e52b6600
It's worth noting that the program has so few dependencies that it even
builds with nolibc, allowing to upload a static executable into containers
being debugged and lacking development tools and compilers. The build
procedure is indicated inthe source code.
In the historical implementation, all filter related information where
stored at the stream level (using struct strm_flt * context), and filters
iteration was performed at the stream level also.
We identified that this was not ideal and would make the implementation of
future filters more complex since filters ordering should be handled in
a different order during request and response handling for decompression
for instance.
To make such thing possible, in this commit we migrate some channel
specific filter contexts in the channel directly (request or response),
and we implement 2 additional filter lists, one on the request channel
and another on the response channel. The historical stream filter list
is kept as-is because in some contexts only the stream is available and
we have to iterate on all filters. But for functions where we only are
interested in request side or response side filters, we now use dedicated
channel filters list instead.
The only overhead is that the "struct filter" was expanded by two "struct
list".
For now, no change of behavior is expected.
Multiple channel related functions have the same construction: they use
list_for_each_entry() to work on a given filter from the stream+channel
combination. In future commits we will try to use filter list from
dedicated channel list instead of the stream one, thus in this patch we
need as a prerequisite to implement and use the flt_list_{start,next} API
to iterate over filter list, giving the API the responsibility to iterate
over the correct list depending on the context, while the calling function
remains free to use the iteration construction it needs. This way we will
be able to easily change the way we iterate over filter list without
duplicating the code for requests and responses.
There is no need to have those helpers defined as macro, and since it
is not mandatory, code maintenance is much easier using functions,
thus let's switch to function definitions.
Also, we change the way we iterate over the list so that the calling
function now has a pseudo API to get and iterate over filter pointers
while keeping control on how they implement the iterating logic.
One benefit of this is that we will also be able to switch between lists
depending on the channel type, which is a prerequisite for upcoming
rework that split the filter list over request and response channels
(commit will follow)
No change of behavior is expected.
Implement a way to set a destination directory using DESTDIR, and a tmp
directory using TMPDIR.
By default:
- DESTDIR is ../vtest like it was done previously
- TMPDIR is mktemp -d
Only the vtest binary is copied in DESTDIR.
Example:
TMPDIR=/dev/shm/ DESTDIR=/home/user/.local/bin/ ./scripts/build-vtest.sh
Features prefixed by "HAVE_WORKING_" in the haproxy -vv feature list,
are features that are detected during runtime.
This patch splits these features on another line in haproxy -vv. This
line is named "Detected feature list".
The feature list in haproxy -vv is partly generated from the Makefile
using the USE_* keywords, but it's also possible to add keywords in the
feature list using hap_register_feature(), which adds the keyword at the
end of list. When doing so, the list is not correctly sorted anymore.
This patch fixes the problem by splitting the string using an array of
ist and applying a qsort() on it.
the TCP_MD5SIG ifdef is not enough to check if the feature is usable.
The code might compile but the OS could prevent to use it.
This patch tries to use the TCP_MD5SIG setsockopt before adding
HAVE_WORKING_TCP_MD5SIG in the feature list. so it would prevent to
start reg-tests if the OS can't run it.
Test the new "jwt_decrypt_jwk" converter that takes a JWK as argument,
either as a string or in a variable.
Only "RSA" and "oct" types are managed for now.
This converter takes a private key in the JWK format (RFC7517) that can
be provided as a string of via a variable.
The only keys managed for now are of type 'RSA' or 'oct'.
Add helper functions that take a JWK (JSON representation of an RSA
private key) into an EVP_PKEY (containing the private key).
Those functions are not used yet, they will be used in the upcoming
'jwt_decrypt_jwk' converter.
QUIC frame type is encoded as a varint. Initially, haproxy parsed it as
a single byte, which was enough to cover frames defined in RFC9000.
The code has been extended recently to support multi-bytes encoded
value, in anticipation of QUIC frames extension support. However, there
was no check on the varint format. This is interpreted erroneously as a
PADDING frame as this serves as the initial value. Thus the rest of the
packet is incorrectly handled, with various resulting effects, including
infinite loops and/or crashes.
This patch fixes this by checking the return value of quic_dec_int(). If
varint cannot be parsed, the connection is immediately closed.
This issue is assigned to CVE-2026-26080 report.
This must be backported up to 3.2.
Reported-by: Asim Viladi Oglu Manizada <manizada@pm.me>
Token parsing code on INITIAL packet for the NEW_TOKEN format is not
robust enough and may even crash on some rare malformed packets.
This patch fixes this by adding a check on the expected length of the
received token. The packet is now rejected if the token does not match
QUIC_TOKEN_LEN. This check is legitimate as haproxy should only parse
tokens emitted by itself.
This issue has been introduced with the implementation of NEW_TOKEN
tokens parsing required for 0-RTT support.
This issue is assigned to CVE-2026-26081 report.
This must be backported up to 3.0.
Reported-by: Asim Viladi Oglu Manizada <manizada@pm.me>
When a duplicated CRYPTO frame is received during handshake, a server
may consider that there was a packet loss and immediately retransmit its
pending CRYPTO data without having to wait for PTO expiration. However,
RFC 9002 indicates that this should only be performed at most once per
connection to avoid excessive packet transmission.
QUIC connection is flagged with QUIC_FL_CONN_HANDSHAKE_SPEED_UP to mark
that a fast retransmit has been performed. However, during the
refactoring on CRYPTO handling with the storage conversion from ncbuf to
ncbmbuf, the check on the flag was accidentely removed. The faulty patch
is the following one :
commit f50425c021
MINOR: quic: remove received CRYPTO temporary tree storage
This patch adds again the check on QUIC_FL_CONN_HANDSHAKE_SPEED_UP
before initiating fast retransmit. This ensures this is only performed
once per connection.
This must be backported up to 3.3.
In srv_dynamic_maxconn(), we'll decide that the max number of connection
is the server's maxconn if 1) the proxy's number of connection is over
fullconn, or if minconn was not set.
Check if minconn is not set first, as it will be true most of the time,
and as the proxy's "beconn" variable is in a busy cache line, it can be
costly to access it, while minconn/maxconn is in a cache line that
should very rarely change.
Oto Valek rightfully reported in issue #3262 that the proxy-protocol
doc makes no mention of the packed attribute on struct pp2_tlv_ssl,
which is mandatory since fields are not type-aligned in it. Let's
add it in the definition and make an explicit mention about it to
save implementers from wasting their time trying to debug this.
It can be backported.
The documentation of @system-ca specifies that one can overwrite the
value provided by the SSL Library using SSL_CERT_DIR.
However it seems like X509_get_default_cert_dir() is not affected by
this environment variable, and X509_get_default_cert_dir_env() need to
be used in order to get the variable name, and get the value manually.
This could be backported in every stable branches. Note that older
branches don't have the memprintf in ssl_sock.c.
SSL libraries built manually might lack the right
X509_get_default_cert_dir() value.
The common way to fix the problem is to build openssl with
./configure --openssldir=/etc/ssl/
In order to verify this setting, output it with haproxy -vv.
Given that we already have "set profiling task", it's easy to permit to
enable/disable the lock and/or memory profiling at run time. However, the
change will only be applied next time the task profiling will be switched
from off/auto to on.
The patch is very minor and is best viewed with git show -b because it
indents a whole block that moves in a "if" clause.
This can be backported to 3.3 along with the two previous patches.
In continuity of previous patch, this one makes use of the new profiling
flags. For this, based on the global "profiling" setting, when switching
profiling on, we set or clear two flags on the thread context,
TH_FL_TASK_PROFILING_L and TH_FL_TASK_PROFILING_M to indicate whether
lock profiling and/or malloc profiling are desired when profiling is
enabled. These flags are checked along with TH_FL_TASK_PROFILING to
decide when to collect time around a lock or a malloc. And by default
we're back to the behavior of 3.2 in that neither lock nor malloc times
are collected anymore.
This is sufficient to see the CPU usage spent in the VDSO to significantly
drop from 22% to 2.2% on a highly loaded system.
This should be backported to 3.3 along with the previous patch.
Damien Claisse reported in issue #3257 a performance regression between
3.2 and 3.3 when task profiling is enabled, more precisely in relation
with the following patches were merged:
98cc815e3e ("MINOR: activity: collect time spent with a lock held for each task")
503084643f ("MINOR: activity: collect time spent waiting on a lock for each task")
9d8c2a888b ("MINOR: activity: collect CPU time spent on memory allocations for each task")
The issue mostly comes from the first patches. What happens is that the
local time is taken when entering and leaving each lock, which costs a
lot on a contended system. The problem here is the lack of finegrained
settings for lock and malloc profiling.
This patch introduces a better approach. The task profiler goes back to
its default behavior in on/auto modes, but the configuration now accepts
new extra options "lock", "no-lock", "memory", "no-memory" to precisely
indicate other timers to watch for each task when profiling turns on.
This is achieved by setting two new flags HA_PROF_TASKS_LOCK and
HA_PROF_TASKS_MEM in the global "profiling" variable.
This patch only parses the new values and assigns them to the global
variable from the config file for now. The doc was updated.
Commit 3674afe8a0 ("BUG/MEDIUM: threads: Atomically set TH_FL_SLEEPING
and clr FL_NOTIFIED") accidentally left a strange-looking line wrapping
making one think of an editing mistake, let's fix it and keep it on a
single line given that even indented wrapping is almost as large.
This can be backported with the fix above till 2.8 to keep the patch
context consistent between versions.
An issue was introduced in 3.0 with commit faa8c3e024 ("MEDIUM: lb-chash:
Deterministic node hashes based on server address"): the new server_key
field and lb_nodes entries initialization were not updated for servers
added at run time with "add server": server_key remains zero and the key
used in lb_node remains the one depending only on the server's ID.
This will cause trouble when adding new servers with consistent hashing,
because the hash-key will be ignored until the server's weight changes
and the key difference is detected, leading to its recalculation.
This is essentially caused by the poorly placed lb_nodes initialization
that is specific to lb-chash and had to be replicated in the code dealing
with server addition.
This commit solves the problem by adding a new ->server_init() function
in the lbprm proxy struct, that is called by the server addition code.
This also allows to abandon the complex check for LB algos that was
placed there for that purpose. For now only lb-chash provides such a
function, and calls it as well during initial setup. This way newly
added servers always use the correct key now.
While it should also theoretically have had an impact on servers added
with the "random" algorithm, it's unlikely that the difference between
proper server keys and those based on their ID could have had any visible
effect.
This patch should be backported as far as 3.0. The backport may be eased
by a preliminary backport of previous commit "CLEANUP: lb-chash: free
lb_nodes from chash's deinit(), not global", though this is not strictly
necessary if context is manually adjusted.
There's an ambuity on the ownership of lb_nodes in chash, it's allocated
by chash but freed by the server code in srv_free_params() from srv_drop()
upon deinit. Let's move this free() call to a chash-specific function
which will own the responsibility for doing this instead. Note that
the .server_deinit() callback is properly called both on proxy being
taken down and on server deletion.
For "add backend" implementation, postparsing code in
check_config_validity() from cfgparse.c has been extracted in a new
dedicated function named proxy_finalize() into proxy.c.
This has caused unexpected compilation issue as in the latter file
TLSEXT_TYPE_application_layer_protocol_negotiation macro may be
undefined, in particular when building without QUIC support. Thus, code
related to default ALPN on binds is discarded after the preprocessing
stage.
Fix this by including openssl-compat header file into proxy source file.
This should be sufficient to ensure SSL related defines are properly
included.
This should fix recent issues on SSL regtests.
No need to backport.
Emeric suggested that it's sometimes convenient to instantly know if a
client has advertised support for window scaling or timestamps for
example. While the info is present in the TCP options output, it's hard
to extract since it respects the options order.
So here we're extending the 56-bit fingerprint with 8 extra bits that
indicate the presence of options 2..8, and any option above 9 for the
last bit. In practice this is sufficient since higher options are not
commonly used. Also TCP option 5 is normally not sent on the SYN (SACK,
only SACK_perm is sent), and echo options 6 & 7 are no longer used
(replaced with timestamps). These fields might be repurposed in the
future if some more meaningful options are to be mapped (e.g. MPTCP,
TFO cookie, auth).
When a backend is created at runtime, the new proxy instance is inserted
at the end of proxies_list. This operation is buggy if this list is
empty : the code causes a null dereference which will lead to a crash.
This causes the following compilation error :
CC src/proxy.o
src/proxy.c: In function 'cli_parse_add_backend':
src/proxy.c:4933:36: warning: null pointer dereference [-Wnull-dereference]
4933 | proxies_list->next = px;
| ~~~~~~~~~~~~~~~~~~~^~~~
This patch fixes this issue. Note that in reality it cannot occur at
this moment as proxies_list cannot be empty (haproxy requires at least
one frontend to start, and the list also always contains internal
proxies).
No need to backport.
This patch fixes the following compilation error :
src/proxy.c:4954:12: error: format string is not a string literal
(potentially insecure) [-Werror,-Wformat-security]
4954 | ha_notice(msg);
| ^~~
No need to backport.
Add a new regtests to validate backend creation at runtime. A server is
then added and requests made to validate the newly created instance
before (with force-be-switch) and after publishing.
Implement proxy ID generation for dynamic backends. This is performed
through the already function existing proxy_get_next_id().
As an optimization, lookup will performed starting from a global
variable <dynpx_next_id>. It is initialized to the greatest ID assigned
after parsing, and updated each time a backend instance is created. When
backend deletion will be implemented, it could be lowered to the newly
available slot.
Implement the required operations for "add backend" handler. This
requires a new proxy allocation, settings copy from the specified
default instance and proxy config finalization. All handlers registered
via REGISTER_POST_PROXY_CHECK() are also called on the newly created
instance.
If no error were encountered, the newly created proxy is finally
attached in the proxies list.
This commits completes "add backend" handler with some checks performed
on the specified default proxy instance. These are additional checks
outside of the already existing inheritance rules, specific to dynamic
backends.
For now, a default proxy is considered not compatible if it is not in
mode TCP/HTTP. Also, a default proxy is rejected if it references HTTP
errors. This limitation may be lifted in the future, when HTTP errors
are partiallay reworked.
Add an optional "mode" argument to "add backend" CLI command. This
argument allows to specify if the backend is in TCP or HTTP mode.
By default, it is mandatory, unless the inherited default proxy already
explicitely specifies the mode. To differentiate if TCP mode is implicit
or explicit, a new proxy flag PR_FL_DEF_EXPLICIT_MODE is defined. It is
set for every defaults instances which explicitely defined their mode.
Define a basic CLI handler for "add backend".
For now, this handler only performs a parsing of the name argument and
return an error if a duplicate already exists. It runs under thread
isolation, to guarantee thread safety during the proxy creation.
This feature is considered in development. CLI command requires to set
experimental-mode.
Move backend compatibility checks performed during 'add server' in a
dedicated function be_supports_dynamic_srv(). This should simplify
addition of future restriction.
This function will be reused when implementing backend creation at
runtime.
Define a new utility function str_to_proxy_mode() which is able to
convert a string into the corresponding proxy mode if possible. This new
function is used for the parsing of "mode" configuration proxy keyword.
This patch will be reused for dynamic backend implementation, in order
to parse a similar "mode" argument via a CLI handler.
If a proxy is referencing a defaults instance, some checks must be
performed to ensure that inheritance will be compatible. Refcount of the
defaults instance may also be incremented if some settings cannot be
copied. This operation is performed when parsing a new proxy of defaults
section which references a defaults, either implicitely or explicitely.
This patch extracts this code into a dedicated function named
proxy_ref_defaults(). This in turn may call defaults_px_ref()
(previously called proxy_ref_defaults()) to increment its refcount.
The objective of this patch is to be able to reuse defaults inheritance
validation for dynamic backends created at runtime, outside of the
parsing code.
A lot of proxies initialization code is delayed on post-parsing stage,
as it depends on the configuration fully parsed. This is performed via a
loop on proxies_list.
Extract this code in a dedicated function proxy_finalize(). This patch
will be useful for dynamic backends creation.
Note that for the moment the code has been extracted as-is. With each
new features, some init code was added there. This has become a giant
loop with no real ordering. A future patch may provide some cleanup in
order to reorganize this.
Default proxies validation occurs during post-parsing. The objective is
to report any tcp/http-rules which could not behave as expected.
Previously, this was performed while looping over standard proxies list,
when such proxy is referencing a default instance. This was enough as
only named referenced proxies were kept after parsing. However, this is
not the case anymore in the context of dynamic backends creation at
runtime.
As such, this patch now performs validation on every named defaults
outside of the standard proxies list loop. This should not cause any
behavior difference, as defaults are validated without using the proxy
which relies on it.
Along with this change, PR_FL_READY proxy flag is now removed. Its usage
was only really needed for defaults, to avoid validating a same instance
multiple times. With the validation of defaults in their own loop, it is
now redundant.
Fix unhandled strdup() failure when initializing global.log_tag.
Bug was introduced with the fix UAF for global progname pointer from
351ae5dbe. So it must be backported as far as 3.1.
Initially when init_early was introduced the progname string was a local
used for temporary storage of log_tag. Now it's global and detached from
log_tag enough. Thus, in the past we could inform that log_tag
allocation has been failed but not now.
Must be backported since the progname string became global, that is
v3.1-dev9-96-g49772c55e
Differ checking the max threads per group number until we're done
parsing the configuration file, as it may be set after a "thread-group-
directive. Otherwise the default value of 64 will be used, even if there
is a max-threads-per-group directive.
This should be backported to 3.3.
Give global.maxthrpertgroup its default value at global creation,
instead of later when we're trying to detect the thread count.
It is used when verifying the configuration file validity, and if it was
not set in the config file, in a few corner cases, the value of 0 would
be used, which would then reject perfectly fine configuration files.
This should be backported to 3.3.
Document the mworker V3 implementation introduced in HAProxy 3.1.
Explains the rationale behind moving configuration parsing out of the
master process to improve robustness.
Could be backported to 3.1.
2026-02-04 16:39:44 +01:00
403 changed files with 43468 additions and 6971 deletions
{{FLT_OTEL_CLI_CMD,"debug",NULL},FLT_OTEL_CLI_CMD" debug [level] : set the OTEL filter debug level (default: get current debug level)",flt_otel_cli_parse_debug,NULL,NULL,NULL,ACCESS_LVL_ADMIN},
#endif
{{FLT_OTEL_CLI_CMD,"disable",NULL},FLT_OTEL_CLI_CMD" disable : disable the OTEL filter",flt_otel_cli_parse_disabled,NULL,NULL,(void*)1,ACCESS_LVL_ADMIN},
{{FLT_OTEL_CLI_CMD,"enable",NULL},FLT_OTEL_CLI_CMD" enable : enable the OTEL filter",flt_otel_cli_parse_disabled,NULL,NULL,(void*)0,ACCESS_LVL_ADMIN},
{{FLT_OTEL_CLI_CMD,"logging",NULL},FLT_OTEL_CLI_CMD" logging [state] : set logging state (default: get current logging state)",flt_otel_cli_parse_logging,NULL,NULL,NULL,ACCESS_LVL_ADMIN},
{{FLT_OTEL_CLI_CMD,"rate",NULL},FLT_OTEL_CLI_CMD" rate [value] : set the rate limit (default: get current rate value)",flt_otel_cli_parse_rate,NULL,NULL,NULL,ACCESS_LVL_ADMIN},
{{FLT_OTEL_CLI_CMD,"status",NULL},FLT_OTEL_CLI_CMD" status : show the OTEL filter status",flt_otel_cli_parse_status,NULL,NULL,NULL,0},