OpenTracing support has long been best-effort and was deprecated in 3.3
with removal planned in 3.5. Let's clean it up now.
This commit removes addons/ot, the build script, ARGC_OT, USE_OT and
OT_* variables in the Makefile, and replaces the config section with a
mention for the OpenTelemetry filter instead.
For more info, see GH issues #1640 and #2782, as well as the wiki's
"breaking changes" page.
Since a connection's target may no longer be a proxy and is necessarily
a server, let's simplify such checks. This is essentially in mux install
code and in the debugging code.
These ones were deprecated in 3.3-dev2 with commits 5c15ba5eff ("MEDIUM:
proxy: mark the "dispatch" directive as deprecated") and e93f3ea3f8
("MEDIUM: proxy: deprecate the "transparent" and "option transparent"
directives"), and were planned for removal in 3.5. See also:
https://github.com/orgs/haproxy/discussions/2921
as well as the wiki page about breaking changes.
They've lived their lives and always cause internal limitations
(exceptions between connecting to server or connecting to proxy), and
are even confusing to some extents (especially "transparent" which users
often get wrong).
This commit removes the ability to configure them, tests based on them
and all the doc related to them. The keywords remain detected by the
parser and indicate how to proceed instead.
It's likely that other deeper parts will be changed as well (e.g.
conn->target will no longer be of OBJ_TYPE_PROXY). This will be done
over the long term.
This adds new class TL_RT, which is processed before other queues for
one (and only one) tasklet featuring the TASK_RT flag. This is meant to
process real time wakeups under load with even less latency. We only
process one entry to make sure it will not be abused for unimportant
stuff, and if tune.sched.low-latency is set, we also avoid picking more
tasks from the current run queues and looping after the first call to
run_tasks_from_list().
Measurements under a load of 10k concurrent conns injection at 10 Gbps
(~58k 20kB objects/s) on 4 threads and with task profiling enabled shows
that the average wakeup latency for wakeups every 10ms dropped from 220
microseconds to 1.8 microsecond, and even ~550 nanoseconds when
tune.sched.low-latency is set, or 400 times less.
The doc was updated, including the schematics.
For some very rare tasks that need to be woken up at an exact date (right
now the only known use case is haload's periodic stats collection), it's
currently difficult to guarantee the wake up date on a heavily loaded
run queue.
This patch introduces TASK_RT for real-time tasks. Right now, all it does
is modify __task_wakeup() to immediately switch to __tasklet_wakeup_*()
and effectively bypass the priority-based run queue. Doing it here has
the benefit of making sure that it automatically applies to tasks found
in the wait queue, and that it will also work for _task_drop_running().
For now nothing uses it. The doc was updated.
The ambiguity in usage for __tasklet_wakeup_on() is now gone. All known
callers that used to be able to pass a negative value now call
__tasklet_wakeup_here(), and remaining ones always pass an explicit
thread number. This means that we can remove the "if (thr<0)" branch,
but still leave a BUG_ON_HOT() to catch any possibly missed case. The
comment around tasklet_wakeup_on() not supporting remotely waking a
tasklet whose tid<0 was also removed since it was addressed long ago.
This patch moves the tid check upper in the chain, in task_instant_wakeup()
so as to branch to _tasklet_wakeup_here() for run-anywhere tasks, or
_tasklet_wakeup_on() for designated threads.
At this point there is no longer any direct caller of __tasklet_wakeup_on()
passing a negative thread value.
This patch moves the tid check upper in the chain, in tasklet_wakeup()
so as to branch to _tasklet_wakeup_here() for run-anywhere tasklets, or
_tasklet_wakeup_on() for designated threads. The tid is retrieved via
__task_get_current_owner() so that the call remains compatible with
tasklets that would have a super-negative tid due to being tasks used
as tasklets.
The current tasklet_wakeup() call relies on tasklet_wakeup_on(tl->tid),
which was already quite ambiguous till now due to the sole reliance on
tid being negative or not to decide to run locally, but it no longer
works correctly if used to wake tasks up since the new set of possible
negative values for ->tid (particularly if some code calls
__tasklet_wakeup_on() on a task as is done in task_instant_wakeup()).
The problem is that it is not possible in the current API to explicitly
say that we want a task/tasklet to run locally or remotely without having
to play games with a thread number. The chosen approach to address this
is to change tasklet_wakeup_on() to always be remote and have
tasklet_wakeup_here() which will always be local, with tasklet_wakeup()
choosing one or the other depending on the tid, for backwards compat
only.
This patch implements tasklet_wakeup_here() to __tasklet_wakeup_here()
that reimplement the part of __tasklet_wakeup_on() that used to deal
with the local thread only (negative tid). No other change was made.
For now it remains unused.
The doc was updated.
The checks on TH_FL_TASK_PROFILING that are used to decide whether or not
to set t->wake_date from now_mono_time() used to be made in callers of
__tasklet_wakeup_on() and __tasklet_wakeup_after(), but not only this
needlessly inflates code by placing this in every caller (~4kB), it also
renders the design fragile since each caller needs to blindly copy-paste
that statement.
Let's move the operation in the callees instead. As a bonus, it allows
to check the flag on the target thread and not on the calling thread
(which was arguably a bug though without a noticeable effect since for
now profiling is for all threads or none).
Refactor the Lua HTTP client to defer initialization. core.httpclient()
no longer initializes the internal HTTP client immediately. Instead,
initialization now occurs within hlua_httpclient_send() when a request
method (e.g., get, put, head) is invoked.
The HTTPClient class now serves as a factory for accessing methods, while
a new class, HTTPClientRequest, has been introduced to represent individual
requests and manage the HTTP client lifecycle.
This change allows multiple requests to be executed using a single
HTTP client instance:
local hc = core.httpclient()
local res1 = hc:get({url = "...", headers = ...})
local res2 = hc:post({url = "...", headers = ...})
local res3 = hc:put({url = "...", headers = ...})
This refactor maintains backward compatibility, as existing scripts that
instantiate a new core.httpclient() for every request will continue to
work as expected.
Move the lua httpclient code from hlua.c to http_client.c
The code is almost the same but the registering of the class which is
done in hlua_http_client_init_state(), from REGISTER_HLUA_STATE_INIT()
check_args() calls have been replaced by hlua_check_args().
hlua_httpclient_destroy_all() is exported so it can be called in hlua.c.
hlua_httpclient_table_to_hdrs() is made static.
hlua_pusherror() and check_args() are being exported.
check_args() is now a macro to hlua_check_args() so it's not confusing
when called outside hlua.c.
Now that there is no longer a shared wake queue, chances are if a shared task
is scheduled, it will always end up on the same thread. In
wake_expired_tasks(), when a task has to be waken up, randomly look to
three other threads, and if the runqueue of the current thread is at least
two time bigger than the runqueue of one of the other threads, then give
that task to that thread, so that our load gets reduced.
If we're giving the task to another thread, then we have to add the
TASK_RUNNING flag until we waked it up, otherwise the other thread could
just run it, if it gets waken up from another path, and free it while
we're still not done with it.
2 times has been chosen somewhat arbitrarily, and may be tweaked at a
later date if deemed not optimal.
Modify task_instant_wakeup() to use __task_set_state_and_tid().
It uses the new ownership behavior, but that's okay because
task_instant_wakeup() was not used anywhere.
Totally remove the per-thread group wait queue. This was potentially a
source of contention, because there were only a global lock for all
those wait queues.
Instead, for shared tasks, there is now the concept of ownership for the
task. When a task is in the wait queue, run queue, or is running on that
particular thread, the task's tid is set to -2 - thread_tid, and only
that thread will be responsible for it until it is no longer running,
and in none of its queue.
When a shared task is scheduled to be run at a later time, if its
current tid is -1, then the current thread will take ownership, and put
it in its own wait queue. If it is already owned, then TASK_WOKEN_WQ is
added to the task's state, and a task_wakeup() is done, so that the
owner thread will add it in its wait queue.
If there is any owner, then a task_wakeup() will just add the task to
the owner's runqueue, otherwise the current thread will become the
owner.
Introduce a new function, __task_get_current_owner, that returns the
owner of a task based on its current tid.
-1 means there is no current owner, otherwise either the tid is >= 0, in
which case it will just return it, or it's < -1, in which case it will
return -2 - tid, the tid of the thread with the current ownership.
Introduce __task_get_new_tid_field(), that provides the tid to be used
for a task.
For shared task, to mark temporary ownership of a task, instead of -1,
the tid will be set to -2-tid, tid being the tid of the current thread.
Introduce a new function, __task_set_state_and_tid, that atomically can
set a task's state and its tid. This will be used later, as the tid will
be used to indicate task ownership even for shared tasks.
Add EVENT_HDL_SUB_ACME_DEPLOY to the ACME family. It is published in
the dns-01 challenge path after the TXT record information has been
prepared, carrying the certificate store name, domain, account
thumbprint, dns_record value, and optionally the provider and vars
strings.
Lua subscribers using core.event_sub() receive the event data as an
AcmeEvent object, which is the same class used for ACME_NEWCERT and
carries the fields relevant to the event type.
Add a new EVENT_HDL_SUB_ACME_NEWCERT event type in the ACME family.
It is published after a new certificate has been successfully fetched
and installed. The event carries the certificate store name, allowing
subscribers to act on newly available certificates.
Lua subscribers using core.event_sub() receive the event data as an
AcmeEvent object with a crtname field containing the certificate store
name.
Right now the only way to report info that is only displayed in diag
mode with -dD is to use ha_diag_warning(). The problem is that this is
then counted as a warning and may result in errors when combined with
-dW, as happens for the CPU topology info:
$ printf "global\nstats socket /tmp/sock1\n" | ./haproxy -dD -dW -c -f /dev/stdin; echo $?
[NOTICE] (10406) : haproxy version is 3.5-dev0-5091ac-35
[NOTICE] (10406) : path to executable is ./haproxy
[DIAG] (10406) : Created 20 threads split into 2 groups
[ALERT] (10406) : Some warnings were found and 'zero-warning' is set. Aborting.
1
We need another level. This commit introduces ha_diag_notice() which only
emits a notification that doesn't count as a warning. Note that we could
even introduce an info level and revisit various messages so that notice
only reports certain events while info is for anything (like versions
above). That could be a future improvement.
The Linux tls module requires a socket to be in TCP_ESTABLISHED state
before we can enable the TLS ULP on the socket, if the socket is in any
other state, then the setsockopt() call will fail, and we won't use
kTLS on that socket.
To make sure we're not doing it too early, defer it until the TLS
handshake is done, which means the TCP connection is established.
This should be backported up to 3.3.
Signed-off-by: Karol Kucharski <kkucharski@fastlogic.pl>
__LJMP, WILL_LJMP() and MAY_LJMP() were defined locally in hlua.c,
making them unavailable to other modules that implement Lua bindings.
Move them to include/haproxy/hlua.h so they can be used outside of
hlua.c.
Add a registration mechanism so that modules outside of hlua.c can hook
into each lua_State creation. Modules call hap_register_hlua_state_init()
(or the REGISTER_HLUA_STATE_INIT() macro) with a callback of the form:
int my_init(lua_State *L, char **errmsg);
The callback returns an ERR_* code. ERR_ALERT and ERR_WARN trigger
ha_alert()/ha_warning() respectively; any other non-zero errmsg is
emitted via ha_notice(). ERR_FATAL or ERR_ABORT cause exit(1).
Registered entries are freed in hlua_deinit().
Extract the challenge-readiness logic from cli_acme_chall_ready_parse()
into a new acme_challenge_ready(crt, dns) function so it can be called
from other contexts such as Lua event handlers.
It slightly changes the messages on the CLI.
Having a single task to take care of idle connection cleanup across all
servers leads to high contention. It uses a lock to maintain its tree of
servers to track, and then can acquire the idle_conns lock for each thread.
Instead, have one task per thread. Each thread will maintain its own
tree, so there will be no need for any lock, and it will just acquire
its own idle_conns lock, so it will lead to less contention.
This is a performance improvement, so backporting is optional, but may be
considered if it is worth it. That would require backporting commit
6f8dab2583 too.
There's a corner case with get_trash_chunk_sz() combined with the use
of small bufs: if some incoming data is going to be inflated by a
converter in a non-predictable way (say url_enc etc) then there are
two possibilities:
- either we try to allocate a size that corresponds to the data, but
we risk to allocate a small buf to convert a 900B chunk, that will
now fail if it contains too many non-printable chars;
- or we try to allocate 3x the size to be conservative, but without
large bufs we'd fail to transcode any chunk larger than 5.3kB, even
if it contains only printable chars.
The approach should definitely be refined and it is not 100% reliable
for now. Better temporarily ignore the small buffers for these particular
cases where the savings are not relevant, and see how to pass the knowledge
of the expected size ranges deeper down the API in 3.5. We may possibly rely
on the current trash size (instead of contents) or other mechanisms that
are yet to be specified. alloc_small_trash_chunk() gets the same change
BTW for the same reasons.
The comment for get_trash_chunk_sz() was updated to restate the importance
of being conservative when requesting a size.
No backport is needed.
Historically, we considered a channel cannot send before the connection was
established. This was useful to know if the reserve should still be
respected for the receives. This was because it was possible to rewrite the
request on connection retry (because of http-send-name-header option).
However noadays, it is a useless limitation. Once data forwarding is
started, there is no longer rewrites on the request at the stream layer
(http-send-name-header option is handled by the muxes). And, since it is
possible to use small buffers to queue requests, it could be an issue,
because the reserve and the small buffer size are the same by default. Once
a small request was finally dequeued, the receives on client side were not
re-armed because we should still respect the reserve on receives
(channel_recv_limit() was returning 0 in that case).
To fix the issue, we must consider a channel may send since the underlying
stconn has reached the SC_ST_REQ state, instead of SC_ST_EST. Doing so, we
are able to ignore the reserve earlier and the receives can be re-armed even
with small buffers.
There is no reason to backport this patch, except if an issue is reported,
because only the 3.4 is concerned. But it could theorically be backported to
all stable versions.
This adds "-dA[file]" on the command line, which dumps an archive of all
dependencies detected at runtime into the designated file in tar format.
This is equivalent to "set-dumpable libs", but instead of keeping the libs
in memory, it dumps them into a file. This may be used after a core dump,
in order to provide all necessary libraries to developers to permit them
to exploit the core. This may not be available on all operating systems.
When shared libs were loaded via "set-dumpable libs", better release
them upon deinit, it will make valgrind happier. For this we now have
a new function free_collected_libs() in tools.c and call it in deinit().
In in46un_to_addr(), when copying a struct sockaddr_in6, copy the
sin6_flowinfo and sin6_scope_id, as they are part of the structure too.
They are unlikely to be of any use for us, but this is more correct
anyway.
Allocating and freeing an OpenSSL EVP_PKEY_CTX context via
EVP_PKEY_CTX_new_id() and EVP_PKEY_CTX_free() on every HKDF cryptographic
operation (such as during stateless reset token generation) induces
unnecessary memory allocation overhead.
Optimize this by introducing a global per-thread context array
'quic_tls_hkdf_ctxs'. These contexts are allocated and initialized once
at startup via a POST_CHECK hook (quic_tls_alloc_hkdf_ctxs) and are
properly freed at exit via a POST_DEINIT hook (quic_tls_dealloc_hkdf_ctxs).
The functions quic_hkdf_extract(), quic_hkdf_expand(), and
quic_hkdf_extract_and_expand() now reuse the pre-allocated context
corresponding to the current thread ID ('tid'), removing dynamic
allocations from these frequent execution paths.
As a cleanup, quic_hkdf_expand() is now static and unexported from the
header file.
Should be easily backported to all versions for optimization purposes.
QPACK_LFL_WLN_BIT and related encoded field line bitmasks were defined
in both qpack-enc.c and qpack-dec.c. Moved them to qpack-t.h where
they are shared between encoder and decoder, eliminating the duplicate
definitions.
Should be backported to ease any further commit to come.
Although qpack_idx_to_name and qpack_idx_to_value are currently only
called within uncompiled debug code, they contained an index bug. They
passed absolute indexes directly to qpack_get_dte instead of relative
dynamic table indexes.
This patch fixes the logic by subtracting QPACK_SHT_SIZE and guarding
against static table index lookups.
Should be easily backported to all versions.
Thanks to previous patches, the request messages are now sanitized to
properly handle Upgrade requests. Now, if a 'connection: upgrade' header
value was found while no 'Upgrade' header, the 'upgrade' values is removed
from the 'connection' header. Conversely the opposite is also performed. If
'Upgrade' header was found, but no "conneciotn: upgrade" header value, all
occurrences of 'Upgrade' header are refused.
This patch depends on following ones:
* MINOR: h1: Add a H1M flag to specify a non-empty 'Upgrade:' header was parsed
* MINOR: http: Add function to remove all occurrences of a value in a header
It should fix the issue 3397. But the H2 part should be reviewed too, and
probably the H1 response parsing, to be consistent with this change.
The series should be backported as far as 2.4.
http_remove_header_value() function was added to parse a header value and
remove all occurrences of a specific value.
This patch is mandatory to fix a bug.
The current PRNG is xoroshiro128**, it was introduced in 2.2 with
commit 52bf83939 ("BUG/MEDIUM: random: implement a thread-safe and
process-safe PRNG"). It features a 2^128 sequence and can perform
2^64 or 2^96 jumps, though only the 2^96 jump is implemented. It
was initially designed to support both processes and threads, and
implements a shared state between threads instead of allocating
distinct sequences based on PID and thread numbers.
Since then, the PRNG's usage grew and processes have disappeared,
but the lock or the DWCAS are still there due to its shared nature,
and it's possible to trigger watchdog warnings by issuing 100 UUIDs
in a single log-format string.
Also, UUID and QUIC retry tokens now consume 128 bits from the PRNG
in two 64-bit calls, and used to weaken the PRNG by rapidly disclosing
its internal state on reasonably idle systems. This indicates that
most of the time we now need 128 bits.
This patch modernizes the internal generator by switching to xoshiro256**,
which has comparable properties (it's even faster), and features even
longer 2^256 periods, still returning 64 bits per call. It can be
initialized with 2^128 and 2^192 jumps. More details here:
https://prng.di.unimi.it/https://prng.di.unimi.it/xoshiro256starstar.c
Here we implement a thread-local state instead of the old shared one,
so there is no more need for synchronization. The state is seeded at
boot, and each thread performs as many 2^192 jumps as their TID is
large. The master process performs a 2^128 jump where it used to
perform a 2^96 jump so that it doesn't overlap with any worker thread.
However a cleaner approach could be to perform a 2^128 jump for each
fork() (here the worker) and 2^192 for each thread. This might be for
a future improvement.
ha_random64_internal() is now the new PRNG, so that everything else
remains totally transparent. _ha_random64_pair_hashed() continues to
hash the first 128 bits of the state.
A simple config generating 100 UUID on 20 threads jumps from 135k to
1.25M req/s, which translates to a bump from 13.5M to 125M UUID/s,
or 9 times faster. And there is no more DWCAS can be seen anymore
in perf top:
Before: 13.5M/s
Overhead Shared Object Symbol
99.04% haproxy [.] ha_random64_internal
0.66% haproxy [.] _ha_random64_pair_hashed
0.03% libc-2.42.so [.] __printf_buffer
0.02% [kernel] [k] _raw_spin_lock
0.01% libc-2.42.so [.] __strchrnul_avx2
0.01% [kernel] [k] ktime_get
0.01% [kernel] [k] lapic_next_deadline
0.01% haproxy [.] sample_process
0.01% haproxy [.] chunk_printf
0.01% libc-2.42.so [.] __printf_buffer_write
0.01% [kernel] [k] hrtimer_active
0.01% libc-2.42.so [.] __memmove_avx_unaligned_erms
0.01% libc-2.42.so [.] _itoa_word
After: 125M/s
18.84% libc-2.42.so [.] __printf_buffer
9.84% haproxy [.] sample_process
8.33% libc-2.42.so [.] __strchrnul_avx2
6.61% libc-2.42.so [.] __memmove_avx_unaligned_erms
6.06% libc-2.42.so [.] __printf_buffer_write
4.43% haproxy [.] strlcpy2
4.09% libc-2.42.so [.] _itoa_word
2.62% haproxy [.] sess_build_logline_orig
2.12% haproxy [.] _ha_random64_pair_hashed
1.28% haproxy [.] pool_put_to_cache
1.06% haproxy [.] __pool_alloc
1.00% haproxy [.] smp_fetch_uuid
0.93% haproxy [.] lf_text_len
0.82% haproxy [.] ha_generate_uuid_v4
A lot of places call two ha_random64() in a row to generate a 128-bit
random. While it's now safe against linear analysis thanks to the XXH64
call, it's still particularly expensive due to the lock.
Here we introduce a new function ha_random64_pair_hashed(), that feeds
two uint64_t with a hash of the PRNG's internal state, and make it
advance. This will cut in half the number of calls to ha_random64()
and should recover a part of the performance lost in the lock. For
now it's not used.