Commit graph

65 commits

Author SHA1 Message Date
Aurelien DARRAGON
c91d93ed1c MINOR: stats-file: introduce shm-stats-file directive
add initial support for the "shm-stats-file" directive and
associated "shm-stats-file-max-objects" directive. For now they are
flagged as experimental directives.

The shared memory file is automatically created by the first process.
The file is created using open() so it is up to the user to provide
relevant path (either on regular filesystem or ramfs for performance
reasons). The directive takes only one argument which is path of the
shared memory file. It is passed as-is to open().

The maximum number of objects per thread-group (hard limit) that can be
stored in the shm is defined by "shm-stats-file-max-objects" directive,

Upon initial creation, the main shm stats file header is provisioned with
the version which must remains the same to be compatible between processes
and defaults to 2k. which means approximately 1mb max per thread group
and should cover most setups. When the limit is reached (during startup)
an error is reported by haproxy which invites the user to increase the
"shm-stats-file-max-objects" if desired, but this means more memory will
be allocated. Actual memory usage is low at start, because only the mmap
(mapping) is provisionned with the maximum number of objects to avoid
relocating the memory area during runtime, but the actual shared memory
file is dynamically resized when objects are added (resized by following
half power of 2 curve when new objects are added, see upcoming commits)

For now only the file is created, further logic will be implemented in
upcoming commits.
2025-09-03 15:59:22 +02:00
Valentine Krasnobaeva
0c63883be1 MINOR: debug: add distro name and version in postmortem
Since 2012, systemd compliant distributions contain
/etc/os-release file. This file has some standardized format, see details at
https://www.freedesktop.org/software/systemd/man/latest/os-release.html.

Let's read it in feed_post_mortem_linux() to gather more info about the
distribution.

(cherry picked from commit f1594c41368baf8f60737b229e4359fa7e1289a9)
Signed-off-by: Willy Tarreau <w@1wt.eu>
2025-07-11 11:48:19 +02:00
Willy Tarreau
e049bd00ab MEDIUM: config: change default limits to 1024 threads and 32 groups
A test run on a dual-socket EPYC 9845 (2x160 cores) showed that we'll
be facing new limits during the lifetime of 3.2 with our current 16
groups and 256 threads max:

  $ cat test.cfg
  global
      cpu-policy perforamnce

  $ ./haproxy -dc -c -f test.cfg
  ...
  Thread CPU Bindings:
    Tgrp/Thr  Tid        CPU set
    1/1-32    1-32       32: 0-15,320-335
    2/1-32    33-64      32: 16-31,336-351
    3/1-32    65-96      32: 32-47,352-367
    4/1-32    97-128     32: 48-63,368-383
    5/1-32    129-160    32: 64-79,384-399
    6/1-32    161-192    32: 80-95,400-415
    7/1-32    193-224    32: 96-111,416-431
    8/1-32    225-256    32: 112-127,432-447

Raising the default limit to 1024 threads and 32 groups is sufficient
to buy us enough margin for a long time (hopefully, please don't laugh,
you, reader from the future):

  $ ./haproxy -dc -c -f test.cfg
  ...
  Thread CPU Bindings:
    Tgrp/Thr  Tid        CPU set
    1/1-32    1-32       32: 0-15,320-335
    2/1-32    33-64      32: 16-31,336-351
    3/1-32    65-96      32: 32-47,352-367
    4/1-32    97-128     32: 48-63,368-383
    5/1-32    129-160    32: 64-79,384-399
    6/1-32    161-192    32: 80-95,400-415
    7/1-32    193-224    32: 96-111,416-431
    8/1-32    225-256    32: 112-127,432-447
    9/1-32    257-288    32: 128-143,448-463
    10/1-32   289-320    32: 144-159,464-479
    11/1-32   321-352    32: 160-175,480-495
    12/1-32   353-384    32: 176-191,496-511
    13/1-32   385-416    32: 192-207,512-527
    14/1-32   417-448    32: 208-223,528-543
    15/1-32   449-480    32: 224-239,544-559
    16/1-32   481-512    32: 240-255,560-575
    17/1-32   513-544    32: 256-271,576-591
    18/1-32   545-576    32: 272-287,592-607
    19/1-32   577-608    32: 288-303,608-623
    20/1-32   609-640    32: 304-319,624-639

We can change this default now because it has no functional effect
without any configured cpu-policy, so this will only be an opt-in
and it's better to do it now than to have an effect during the
maintenance phase. A tiny effect is a doubling of the number of
pool buckets and stick-table shards internally, which means that
aside slightly reducing contention in these areas, a dump of tables
can enumerate keys in a different order (hence the adjustment in the
vtc).

The only really visible effect is a slightly higher static memory
consumption (29->35 MB on a small config), but that difference
remains even with 50k servers so that's pretty much acceptable.

Thanks to Erwan Velu for the quick tests and the insights!
2025-05-13 18:15:33 +02:00
Willy Tarreau
8a96216847 MEDIUM: sock-inet: re-check IPv6 connectivity every 30s
IPv6 connectivity might start off (e.g. network not fully up when
haproxy starts), so for features like resolvers, it would be nice to
periodically recheck.

With this change, instead of having the resolvers code rely on a variable
indicating connectivity, it will now call a function that will check for
how long a connectivity check hasn't been run, and will perform a new one
if needed. The age was set to 30s which seems reasonable considering that
the DNS will cache results anyway. There's no saving in spacing it more
since the syscall is very check (just a connect() without any packet being
emitted).

The variables remain exported so that we could present them in show info
or anywhere else.

This way, "dns-accept-family auto" will now stay up to date. Warning
though, it does perform some caching so even with a refreshed IPv6
connectivity, an older record may be returned anyway.
2025-05-09 15:45:44 +02:00
Olivier Houchard
388539faa3 MEDIUM: stick-tables: defer adding updates to a tasklet
There is a lot of contention trying to add updates to the tree. So
instead of trying to add the updates to the tree right away, just add
them to a mt-list (with one mt-list per thread group, so that the
mt-list does not become the new point of contention that much), and
create a tasklet dedicated to adding updates to the tree, in batchs, to
avoid keeping the update lock for too long.
This helps getting stick tables perform better under heavy load.
2025-05-02 15:27:55 +02:00
Amaury Denoyelle
0f9b3daf98 MEDIUM: quic: limit global Tx memory
Define a new settings tune.quic.frontend.max-tot-window. It contains a
size argument which can be used to set a limit on the sum of all QUIC
connections congestion window. This is applied both on
quic_cc_path_set() and quic_cc_path_inc().

Note that this limitation cannot reduce a congestion window more than
the minimal limit which is set to 2 datagrams.
2025-04-29 15:19:32 +02:00
Willy Tarreau
12c7189bc8 MEDIUM: thread: set DEBUG_THREAD to 1 by default
Setting DEBUG_THREAD to 1 allows recording the lock history for each
thread. Tests have shown that (as predicted) the cost of updating a
single thread-local variable is not perceptible in the noise, especially
when compared to the cost of obtaining a lock. Since this can provide
useful value when debugging deadlocks, let's enable it by default when
threads are enabled.
2025-04-28 16:50:34 +02:00
Willy Tarreau
903a6b14ef MINOR: threads: prepare DEBUG_THREAD to receive more values
We now default the value to zero and make sure all tests properly take
care of values above zero. This is in preparation for supporting several
degrees of debugging.
2025-04-28 16:50:34 +02:00
Willy Tarreau
61d633a3ac DEBUG: rename DEBUG_GLITCHES to DEBUG_COUNTERS and enable it by default
Till now the per-line glitches counters were only enabled with the
confusingly named DEBUG_GLITCHES (which would not turn glitches off
when disabled). Let's instead change it to DEBUG_COUNTERS and make sure
it's enabled by default (though it can still be disabled with
-DDEBUG_GLITCHES=0 just like for DEBUG_STRICT). It will later be
expanded to cover more counters.
2025-04-14 19:02:13 +02:00
Olivier Houchard
9fe72bba3c MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).

The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.

Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-04-01 18:05:30 +02:00
Aurelien DARRAGON
0846638f7f MEDIUM: stream: interrupt costly rulesets after too many evaluations
It is not rare to see configurations with a large number of "tcp-request
content" or "http-request" rules for instance. A large number of rules
combined with cpu-demanding actions (e.g.: actions that work on content)
may create thread contention as all the rules from a given ruleset are
evaluated under the same polling loop if the evaluation is not interrupted

Thus, in this patch we add extra logic around "tcp-request content",
"tcp-response content", "http-request" and "http-response" rulesets, so
that when a certain number of rules are evaluated under the single polling
loop, we force the evaluating function to yield. As such, the rule which
was about to be evaluated is saved, and the function starts evaluating
rules from the save pointer when it returns (in the next polling loop).

We use task_wakeup(task, TASK_WOKEN_MSG) to explicitly wake the task so
that no time is wasted and the processing is resumed ASAP. TASK_WOKEN_MSG
is mandatory here because process_stream() expects TASK_WOKEN_MSG for
explicit analyzers re-evaluation.

rules_bcount stream's attribute was added to count how manu rules were
evaluated since last interruption (yield). Also, SF_RULE_FYIELD flag
was added to know that the s->current_rule was assigned due to forced
yield and not regular yield.

By default haproxy will enforce a yield every 50 rules, this behavior
can be configured using the "tune.max-rules-at-once" global keyword.

There is a limitation though: for now, if the ACT_OPT_FINAL flag is set
on act_opts, we consider it is not safe to yield (as it is already the
case for automatic yield). In this case instead of yielding an taking
the risk of not being called back, we skip the yield and hope it will
not create contention. This is something we should ideally try to
improve in order to yield in all conditions.
2025-02-03 17:09:48 +01:00
Olivier Houchard
26b3e5236f MEDIUM: servers/proxies: Switch to using per-tgroup queues.
For both servers and proxies, use one connection queue per thread-group,
instead of only one. Having only one can lead to severe performance
issues on NUMA machines, it is actually trivial to get the watchdog to
trigger on an AMD machine, having a server with a maxconn of 96, and an
injector that uses 160 concurrent connections.
We now have one queue per thread-group, however when dequeueing, we're
dequeuing MAX_SELF_USE_QUEUE (currently 9) pendconns from our own queue,
before dequeueing one from another thread group, if available, to make
sure everybody is still running.
2025-01-28 12:49:41 +01:00
Valentine Krasnobaeva
fb7bef781d MINOR: defaults: update MASTER_MAXCONN description
This is a one of the commits to prepare the removal of MODE_MWORKER_WAIT
support, as it became redundant with MODE_MWORKER due to moving master-worker
fork in init().
2024-10-16 22:02:39 +02:00
Amaury Denoyelle
d0d8e57d47 MINOR: quic: define sbuf pool
Define a new buffer pool reserved to allocate smaller memory area. For
the moment, its usage will be restricted to QUIC, as such it is declared
in quic_stream module.

Add a new config option "tune.bufsize.small" to specify the size of the
allocated objects. A special check ensures that it is not greater than
the default bufsize to avoid unexpected effects.
2024-08-20 18:12:27 +02:00
Valentine Krasnobaeva
8b1dfa9def MINOR: cfgparse: limit file size loaded via /dev/stdin
load_cfg_in_mem() can continuously reallocate memory in order to load an
extremely large input from /dev/stdin, until it fails with ENOMEM, which means
that process has consumed all available RAM. In case of containers and
virtualized environments it's not very good.

So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will
limit the size of input supplied via /dev/stdin.
2024-08-20 14:28:34 +02:00
Valentine Krasnobaeva
41275a6918 MEDIUM: init: set default for fd_hard_limit via DEFAULT_MAXFD
Let's provide a default value for fd_hard_limit, if it's not set in the
configuration. With this patch we could set some specific default via
compile-time variable DEFAULT_MAXFD as well. Hope, this will be helpfull for
haproxy package maintainers.

    make -j 8 TARGET=linux-glibc DEBUG=-DDEFAULT_MAXFD=50000

If haproxy is comipled without DEFAULT_MAXFD defined, the default will be set
to 1048576.

This is done to avoid killing the process by its watchdog, while it started
without any limitations in its configuration or in the command line and the
hard RLIMIT_NOFILE is extremely huge (~1000000000). We use in this case
compute_ideal_maxconn() to calculate maxconn and maxsock, maxsock defines the
size of internal fdtab, which becames very-very large as well. When
the process starts to simply loop over this fdtab (0(n)), this takes a lot of
time, so watchdog does it job.

To avoid this, maxconn now is always reduced to some reasonable value either
by explicit global.fd-hard-limit from configuration, or by its default. The
default may be changed at build-time and overwritten then by
global.fd-hard-limit at runtime. Explicit global.fd-hard-limit from the
configuration has always precedence over DEFAULT_MAXFD, if set.

Must be backported in all stable versions until v2.6.0, including v2.6.0.
2024-07-04 07:52:42 +02:00
Willy Tarreau
290659ffd3 MINOR: activity: make the memory profiling hash size configurable at build time
The MEMPROF_HASH_BITS variable was set to 10 without a possibility to
change it (beyond patching the code). After seeing a few reports already
with "other" being listed and a list with close to 1024 entries, it looks
like it's about time to either increase the hash size, or at least make
it configurable for special cases. As a reminder, in order to remain
fast, the algorithm searches no more than 16 places after the hash, so
when a table is almost full, searches are long and new places are rare.

The present patch just makes it possible to redefine it by passing
"-DMEMPROF_HASH_BITS=11" or "-DMEMPROF_HASH_BITS=12" in CFLAGS, and
moves the definition to defaults.h to make it easier to find. Such
values should be way sufficient for the vast majority of use cases.
Maybe in the future we'd change the default. At least this version
should be backported to ease rebuilds, say, till 2.8 or so.
2024-06-27 18:01:27 +02:00
Willy Tarreau
0ce51dc93b MEDIUM: dynbuf: implement emergency buffers
The buffer reserve set by tune.buffers.reserve has long been unused, and
in order to deal gracefully with failed memory allocations we'll need to
resort to a few emergency buffers that are pre-allocated per thread.

These buffers are only for emergency use, so every time their count is
below the configured number a b_free() will refill them. For this reason
their count can remain pretty low. We changed the default number from 2
to 4 per thread, and the minimum value is now zero (e.g. for low-memory
systems). The tune.buffers.limit setting has always been a problem when
trying to deal with the reserve but now we could simplify it by simply
pushing the limit (if set) to match the reserve. That was already done in
the past with a static value, but now with threads it was a bit trickier,
which is why the per-thread allocators increment the limit on the fly
before allocating their own buffers. This also means that the configured
limit is saner and now corresponds to the regular buffers that can be
allocated on top of emergency buffers.

At the moment these emergency buffers are not used upon allocation
failure. The only reason is to ease bisecting later if needed, since
this commit only has to deal with resource management.
2024-05-10 17:18:13 +02:00
Willy Tarreau
772f9a5874 BUILD: pools: make DEBUG_MEMORY_POOLS=1 the default option
This option has been set by default for a very long time and also
complicates the manipulation of the DEBUG variable. Let's make it
the official default and permit to unset it by setting it to zero.
The other pool-related DEBUG options were adjusted to also explicitly
check for the zero value for consistency.
2024-04-11 17:25:45 +02:00
Willy Tarreau
b70981532a BUILD: debug: make DEBUG_STRICT=1 the default
We continue to carry it in the makefile, which adds to the difficulty
of passing new options. Let's make DEBUG_STRICT=1 the default so that
one has to explicitly pass DEBUG_STRICT=0 to disable it. This allows us
to remove the option from the default DEBUG variable in the makefile.
2024-04-11 17:25:45 +02:00
Willy Tarreau
1a088da7c2 MAJOR: stktable: split the keys across multiple shards to reduce contention
In order to reduce the contention on the table when keys expire quickly,
we're spreading the load over multiple trees. That counts for keys and
expiration dates. The shard number is calculated from the key value
itself, both when looking up and when setting it.

The "show table" dump on the CLI iterates over all shards so that the
output is not fully sorted, it's only sorted within each shard. The Lua
table dump just does the same. It was verified with a Lua program to
count stick-table entries that it works as intended (the test case is
reproduced here as it's clearly not easy to automate as a vtc):

  function dump_stk()
    local dmp = core.proxies['tbl'].stktable:dump({});
    local count = 0
    for _, __ in pairs(dmp) do
        count = count + 1
    end
    core.Info('Total entries: ' .. count)
  end

  core.register_action("dump_stk", {'tcp-req', 'http-req'}, dump_stk, 0);

  ##
  global
    tune.lua.log.stderr on
    lua-load-per-thread lua-cnttbl.lua

  listen front
    bind :8001
    http-request lua.dump_stk if { path_beg /stk }
    http-request track-sc1 rand(),upper,hex table tbl
    http-request redirect location /

  backend tbl
    stick-table size 100k type string len 12 store http_req_cnt

  ##
  $ h2load -c 16 -n 10000 0:8001/
  $ curl 0:8001/stk

  ## A count close to 100k appears on haproxy's stderr
  ## On the CLI, "show table tbl" | wc will show the same.

Some large parts were reindented only to add a top-level loop to iterate
over shards (e.g. process_table_expire()). Better check the diff using
git show -b.

The number of shards is decided just like for the pools, at build time
based on the max number of threads, so that we can keep a constant. Maybe
this should be done differently. For now CONFIG_HAP_TBL_BUCKETS is used,
and defaults to CONFIG_HAP_POOL_BUCKETS to keep the benefits of all the
measurements made for the pools. It turns out that this value seems to
be the most reasonable one without inflating the struct stktable too
much. By default for 1024 threads the value is 32 and delivers 980k RPS
in a test involving 80 threads, while adding 1kB to the struct stktable
(roughly doubling it). The same test at 64 gives 1008 kRPS and at 128
it gives 1040 kRPS for 8 times the initial size. 16 would be too low
however, with 675k RPS.

The stksess already have a shard number, it's the one used to decide which
peer connection to send the entry. Maybe we should also store the one
associated with the entry itself instead of recalculating it, though it
does not happen that often. The operation is done by hashing the key using
XXH32().

The peers also take and release the table's lock but the way it's used
it not very clear yet, so at this point it's sure this will not work.

At this point, this allowed to completely unlock the performance on a
80-thread setup:

 before: 5.4 Gbps, 150k RPS, 80 cores
  52.71%  haproxy    [.] stktable_lookup_key
  36.90%  haproxy    [.] stktable_get_entry.part.0
   0.86%  haproxy    [.] ebmb_lookup
   0.18%  haproxy    [.] process_stream
   0.12%  haproxy    [.] process_table_expire
   0.11%  haproxy    [.] fwrr_get_next_server
   0.10%  haproxy    [.] eb32_insert
   0.10%  haproxy    [.] run_tasks_from_lists

 after: 36 Gbps, 980k RPS, 80 cores
  44.92%  haproxy    [.] stktable_get_entry
   5.47%  haproxy    [.] ebmb_lookup
   2.50%  haproxy    [.] fwrr_get_next_server
   0.97%  haproxy    [.] eb32_insert
   0.92%  haproxy    [.] process_stream
   0.52%  haproxy    [.] run_tasks_from_lists
   0.45%  haproxy    [.] conn_backend_get
   0.44%  haproxy    [.] __pool_alloc
   0.35%  haproxy    [.] process_table_expire
   0.35%  haproxy    [.] connect_server
   0.35%  haproxy    [.] h1_headers_to_hdr_list
   0.34%  haproxy    [.] eb_delete
   0.31%  haproxy    [.] srv_add_to_idle_list
   0.30%  haproxy    [.] h1_snd_buf

WIP: uint64_t -> long

WIP: ulong -> uint

code is much smaller
2024-04-03 17:34:47 +02:00
Willy Tarreau
6c1b29d06f MINOR: ring: make the number of queues configurable
Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.

Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.

This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.

It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.
2024-03-25 17:34:19 +00:00
Willy Tarreau
0a0041d195 BUILD: tree-wide: fix a few missing includes in a few files
Some include files, mostly types definitions, are missing a few includes
to define the types they're using, causing include ordering dependencies
between files, which are most often not seen due to the alphabetical
order of includes. Let's just fix them.

These were spotted by building pre-compiled headers for all these files
to .h.gch.
2024-03-05 11:50:34 +01:00
Aurelien DARRAGON
cb3ec978fd MINOR: event_hdl: add global tunables
The local variable "event_hdl_async_max_notif_at_once" which was
introduced with the event_hdl API was left as is but with a TODO note
telling that we should make it a global tunable.

Well, we're doing this now. To prepare for upcoming tunables related to
event_hdl API, we add a dedicated struct named event_hdl_tune which is
globally exposed through the event_hdl header file so that it may be used
from everywhere. The struct is automatically initialized in
event_hdl_init() according to defaults.h.

"event_hdl_async_max_notif_at_once" now becomes
"event_hdl_tune.max_events_at_once" with it's dedicated
configuation keyword: "tune.events.max-events-at-once".

We're also taking this opportunity to raise the default value from 10
to 100 since it's seems quite reasonnable given existing async event_hdl
users.

The documentation was updated accordingly.
2023-11-29 08:59:27 +01:00
Remi Tricot-Le Breton
48f81ec09d MAJOR: cache: Delay cache entry delete in reserve_hot function
A reference counter on the cache_entry was added in a previous commit.
Its value is atomically increased and decreased via the retain_entry and
release_entry functions.

This is needed because of the latest cache and shared_context
modifications that introduced two separate locks instead of the
preexisting single shctx_lock one.
With the new logic, we have two main blocks competing for the two locks:
- the one in the http_action_req_cache_use that performs a lookup in the
  cache tree (locked by the cache lock) and then tries to remove the
  corresponding blocks from the shared_context's 'avail' list until the
  response is sent to the client by the cache applet,
- the shctx_row_reserve_hot that traverses the 'avail' list and gives
  them back to the caller, while removing previous row heads from the
  cache tree
Those two blocks require the two locks but one of them would take the
cache lock first, and the other one the shctx_lock first, which would
end in a deadlock without the current patch.

The way this conflict is resolved in this patch is by ensuring that at
least one of those uses works without taking the two locks at the same
time.
The solution found was to keep taking the two locks in the cache_use
case. We first lock the cache to lookup for an entry and we then take
the shctx lock as well to detach the corresponding blocks from the
'avail' list. The subtlety is that between the cache lookup and the
actual locking of the shctx, another thread might have called the
reserve_hot function in which we only take the shctx lock.
In this function we traverse the 'avail' list to remove blocks that are
then given to the caller. If one of those blocks corresponds to a
previous row head, we call the 'free_blocks' callback that used to
delete the cache entry from the tree.
We now avoid deleting directly the cache entries in reserve_hot and we
rather set the cache entries 'complete' param to 0 so that no other
thread tries to work with this entry. This way, when we release the
shctx lock in reserve_hot, the first thread that had performed the cache
lookup and had found an entry that we just gave to another thread will
see that the 'complete' field is 0 and it won't try to work with this
response.

The actual removal of entries from the cache tree will now be performed
in the new 'reserve_finish' callback called at the end of the
shctx_row_reserve_hot function. It will iterate on all the row head that
were inserted in a dedicated list in the 'free_block' callback and
perform the actual delete.
2023-11-16 19:35:10 +01:00
Willy Tarreau
06885aaea7 MINOR: pools: introduce the use of multiple buckets
On many threads and without the shared cache, there can be extreme
contention on the ->allocated counter, the ->free_list pointer, and
the ->used counter. It's possible to limit this contention by spreading
the counters a little bit over multiple entries, that are summed up when
a consultation is needed. The criterion used to spread the values cannot
be related to the thread ID due to migrations, since we need to keep
consistent stats (allocated vs used).

Instead we'll just hash the pointer, it provides an index that does the
job and that is consistent for the object. When having just a few entries
(16 here as it showed almost identical performance between global and
non-global pools) even iterations should be short enough during
measurements to not be a problem.

A pair of functions designed to ease pointer hash bucket calculation were
added, with one of them doing it for thread IDs because allocation failures
will be associated with a thread and not a pointer.

For now this patch only brings in the relevant parts of the infrastructure,
the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512
threads or more are supported, 5 bits when 128 or more are supported, 4
bits when 16 or more are supported, otherwise 3 bits for small setups.
The array in the pool_head and the two utility functions are already
added. It should have no measurable impact beyond inflating the pool_head
structure.
2023-08-12 19:04:34 +02:00
Willy Tarreau
59c347c15e BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP
LONGBITS was defined long ago with old compilers that didn't provide the
word size. It's still present as being referenced in various places in the
code, but we must not use it to define other macros that may be evaluated
at pre-processing time since it contains sizeof() and casts that are not
compatible with preprocessor conditions. Let's switch MAX_THREADS_PER_GROUP
to __WORDSIZE so that we can condition blocks of code on it if needed.

LONGBITS should really be removed by now, given that we don't support
compilers not providing __WORDSIZE anymore (gcc < 4.2).
2023-08-12 19:04:34 +02:00
Patrick Hemmer
7fccccccea MINOR: acl: add acl() sample fetch
This provides a sample fetch which returns the evaluation result
of the conjunction of named ACLs.
2023-08-01 10:49:06 +02:00
William Lallemand
2078d4b1f7 BUG/MINOR: mworker: use MASTER_MAXCONN as default maxconn value
In environments where SYSTEM_MAXCONN is defined when compiling, the
master will use this value instead of the original minimal value which
was set to 100. When this happens, the master process could allocate
RAM excessively since it does not need to have an high maxconn. (For
example if SYSTEM_MAXCONN was set to 100000 or more)

This patch fixes the issue by using the new define MASTER_MAXCONN which
define a default maxconn of 100 for the master process.

Must be backported as far as 2.5.
2023-03-09 14:28:44 +01:00
Willy Tarreau
28360dc53f MEDIUM: clock: force internal time to wrap early after boot
GH issue #2034 clearly indicates yet another case of time roll-over
that went badly. Issues that happen only once every 50 days are hard
to detect and debug, and are usually reported more or less synchronized
from multiple sources. This patch finally does what had long been planned
but never done yet, which is to force the time to wrap early after boot
so that any such remaining issue can be spotted quicker. The margin delay
here is 20s (it may be changed by setting BOOT_TIME_WRAP_SEC to another
value). This value seems sufficient to permit failed health checks to
succeed and traffic to come in and possibly start to update some time
stamps (accept dates in logs, freq counters, stick-tables expiration
dates etc).

It could theoretically be helpful to have this in 2.7, but as can be
seen with the two patches below, we've already had incorrect use cases
of the internal monotonic time when the wall-clock one was needed, so
we could expect to detect other ones in the future. Note that this will
*not* induce bugs, it will only make them happen much faster (i.e. no
need to wait for 50 days before seeing them). If it were to eventually
be backported, these two previous patches must also be backported:

    BUG/MINOR: clock: use distinct wall-clock and monotonic start dates
    BUG/MEDIUM: cache: use the correct time reference when comparing dates
2023-02-08 11:10:33 +01:00
Willy Tarreau
b8b243ac6a MINOR: trace: add the long awaited TRACE_PRINTF()
TRACE_PRINTF() can be used to produce arbitrary trace contents at any
trace level. It uses the exact same arguments as other TRACE_* macros,
but here they are mandatory since they are followed by the format-string,
though they may be filled with zeroes. The reason for the arguments is to
match tracking or filtering and not pollute other non-inspected objects.

It will probably be used inside loops, in which case there are two points
to be careful about:
  - output atomicity is only per-message, so competing threads may see
    their messages interleaved. As such, it is recommended that the
    caller places a recognizable unique context at the beginning of the
    message such as a connection pointer.
  - iterating over arrays or lists for all requests could be very
    expensive. In order to avoid this it is best to condition the call
    via TRACE_ENABLED() with the same arguments, which will return the
    same decision.
  - messages longer than TRACE_MAX_MSG-1 (1023 by default) will be
    truncated.

For example, in order to dump the list of HTTP headers between hpack
and h2:

  if (outlen > 0 &&
      TRACE_ENABLED(TRACE_LEVEL_DEVELOPER,
      H2_EV_RX_FRAME|H2_EV_RX_HDR, h2c->conn, 0, 0, 0)) {
          int i;
          for (i = 0; list[i].n.len; i++)
              TRACE_PRINTF(TRACE_LEVEL_DEVELOPER, H2_EV_RX_FRAME|H2_EV_RX_HDR,
                           h2c->conn, 0, 0, 0, "h2c=%p hdr[%d]=%s:%s", h2c, i,
                           list[i].n.ptr, list[i].v.ptr);
  }

In addition, a lower-level TRACE_PRINTF_LOC() macro is provided, that takes
two extra arguments, the caller's location and the caller's function name.
This will allow to emit composite traces from central functions on the
behalf of another one.
2023-01-26 15:49:43 +01:00
Willy Tarreau
4da51bd190 CLEANUP: pools: get rid of CONFIG_HAP_POOLS
This one was set in defaults.h only when neither DEBUG_NO_POOLS nor
DEBUG_UAF were set. This was not the most convenient location to look
for it, and it was only used in pool.c to decide on the initial value
of POOL_DBG_NO_CACHE.

Let's just use DEBUG_NO_POOLS || DEBUG_UAF directly on this flag and
get rid of the intermediary condition. This also has the benefit of
removing a double inversion, which is always nice for understanding.
2022-12-08 17:45:08 +01:00
William Lallemand
eba6a54cd4 MINOR: logs: startup-logs can use a shm for logging the reload
When compiled with USE_SHM_OPEN=1 the startup-logs are now able to use
an shm which is used to keep the logs when switching to mworker wait
mode. This allows to keep the failed reload logs.

When allocating the startup-logs at first start of the process, haproxy
will do a shm_open with a unique path using the PID of the process, the
file is unlink immediatly so we don't let unwelcomed files be. The fd
resulting from this shm is stored in the HAPROXY_STARTUPLOGS_FD
environment variable so it can be mmap again when switching to wait
mode.

When forking children, the process is copying the mmap to a a mallocated
ring so we never share the same memory section between the master and
the workers. When switching to wait mode, the shm is not used anymore as
it is also copied to a mallocated structure.

This allow to use the "show startup-logs" command over the master CLI,
to get the logs of the latest startup or reload. This way the logs of
the latest failed reload are also kept.

This is only activated on the linux-glibc target for now.
2022-10-13 16:50:22 +02:00
Willy Tarreau
693688e734 MINOR: config: automatically preset MAX_THREADS based on MAX_TGROUPS
MAX_THREADS was not changed when setting MAX_TGROUPS, which still limits
some possibilities. Let's preset it to 4 * LONGBITS when MAX_TGROUPS is
larger than 1, or LONGBITS when it's set to 1. This means that the new
default value is 256 threads.

The rationale behind this is that the main use of thread groups is
mostly to address NUMA issues and that we don't necessarily need large
thread counts when using many groups, and 256 threads is already plenty
even on quite large systems.

For now it's important not to go too far because some internal structs
are arrays of MAX_THREADS entries, for example accept_queue_ring, which
is around 8kB per thread. Such structures will need to become dynamic
before defaulting to large thread counts (at 4096 threads max the
accept queues would require 32 MB RAM alone).
2022-08-06 16:51:20 +02:00
Willy Tarreau
856d56d2d2 MINOR: config: change default MAX_TGROUPS to 16
This will allows nbtgroups > 1 to be declared in the config without
recompiling. The theoretical limit is 64, though we'd rather not push
it too far for now as some structures might be enlarged to be indexed
per group. Let's start with 16 groups max, allowing to experiment with
dual-socket machines suffering from up to 8 loosely coupled L3 caches.
It's a good start and doesn't engage us too far.
2022-07-15 21:51:48 +02:00
Willy Tarreau
3ccb14d60d MINOR: thread: get rid of MAX_THREADS_MASK
This macro was used both for binding and for lookups. When binding tasks
or FDs, using all_threads_mask instead is better as it will later be per
group. For lookups, ~0UL always does the job. Thus in practice the macro
was already almost not used anymore since the rest of the code could run
fine with a constant of all ones there.
2022-06-14 11:18:40 +02:00
Willy Tarreau
197715ae21 CLEANUP: compression: move the default setting of maxzlibmem to defaults
__comp_fetch_init() only presets the maxzlibmem, and only when both
USE_ZLIB and DEFAULT_MAXZLIBMEM are set. The intent is to preset a
default value to protect the system against excessive memory usage
when no setting is set by the user.

Nowadays the entry in the global struct is always there so there's no
point anymore in passing via a constructor to possibly set this value.
Let's go the cleaner way by always presetting DEFAULT_MAXZLIBMEM to 0
in defaults.h unless these conditions are met, and always assigning it
instead of pre-setting the entry to zero. This is more straightforward
and removes some ifdefs and the last constructor. In addition, now the
setting has a chance of being found.
2022-04-25 19:42:43 +02:00
Remi Tricot-Le Breton
1d6338ea96 MEDIUM: ssl: Disable DHE ciphers by default
DHE ciphers do not present a security risk if the key is big enough but
they are slow and mostly obsoleted by ECDHE. This patch removes any
default DH parameters. This will effectively disable all DHE ciphers
unless a global ssl-dh-param-file is defined, or
tune.ssl.default-dh-param is set, or a frontend has DH parameters
included in its PEM certificate. In this latter case, only the frontends
that have DH parameters will have DHE ciphers enabled.
Adding explicitely a DHE ciphers in a "bind" line will not be enough to
actually enable DHE. We would still need to know which DH parameters to
use so one of the three conditions described above must be met.

This request was described in GitHub issue #1604.
2022-04-20 17:30:55 +02:00
Willy Tarreau
b61fccdc3f CLEANUP: init: remove the ifdef on HAPROXY_MEMMAX
It's ugly, let's move it to defaults.h with all other ones and preset
it to zero if not defined.
2022-02-23 17:11:33 +01:00
Remi Tricot-Le Breton
55d7e782ee MINOR: ssl: Set default dh size to 2048
Starting from OpenSSLv3, we won't rely on the
SSL_CTX_set_tmp_dh_callback mechanism so we will need to know the DH
size we want to use during init. In order for the default DH param size
to be used when no RSA or DSA private key can be found for a given bind
line, we will need to know the default size we want to use (which was
not possible the way the code was built, since the global default dh
size was set too late.
2022-02-14 10:07:14 +01:00
Willy Tarreau
39fd546d4b MINOR: pools: enable pools with DEBUG_FAIL_ALLOC as well
During 2.4-dev, fault injection was enabled for cached pools with commit
207c09509 ("MINOR: pools: move the fault injector to __pool_alloc()"),
except that the condition for CONFIG_HAP_POOLS still depended on
DEBUG_FAIL_ALLOC not being set, which limits the usability to cases
where the define is set by hand. Let's remove it from the equation as
this is not a constraint anymore. While a bit old, there's no need to
backport this as it's only used during development.
2022-01-12 17:31:01 +01:00
Willy Tarreau
f5e94b2f47 OPTIM: pools: reduce local pool cache size to 512kB
Now that we support batched allocations/releases, it appears that we can
reach the same performance on H2 with shared pools and 256kB thread-local
cache as without shared pools, a fast allocator and 1MB thread-local cache.
With 512kB we're up to 10% faster on highly multiplexed H2 than without the
shared cache. This was tested on a 16-core ARM machine. Thus it's time to
slightly reduce the per-thread memory cost, which may also improve the
performance on machines with smaller L2 caches. It essentially reverts
commit f587003fe ("MINOR: pools: double the local pool cache size to 1 MB").
2022-01-02 19:52:15 +01:00
Willy Tarreau
43937e920f MEDIUM: pools: start to batch eviction from local caches
Since previous patch we can forcefully evict multiple objects from the
local cache, even when evicting basd on the LRU entries. Let's define
a compile-time configurable setting to batch releasing of objects. For
now we set this value to 8 items per round.

This is marked medium because eviction from the LRU will slightly change
in order to group the last items that are freed within a single cache
instead of accurately scanning only the oldest ones exactly in their
order of appearance. But this is required in order to evolve towards
batched removals.
2022-01-02 19:35:26 +01:00
Willy Tarreau
f9662848f2 MINOR: threads: introduce a minimalistic notion of thread-group
This creates a struct tgroup_info which knows the thread ID of the first
thread in a group, and the number of threads in it. For now there's only
one thread group supported in the configuration, but it may be forced to
other values for development purposes by defining MAX_TGROUPS, and it's
enabled even when threads are disabled and will need to remain accessible
during boot to keep a simple enough internal API.

For the purpose of easing the configurations which do not specify a thread
group, we're starting group numbering at 1 so that thread group 0 can be
"undefined" (i.e. for "bind" lines or when binding tasks).

The goal will be to later move there some global items that must be
made per-group.
2021-10-08 17:22:26 +02:00
Willy Tarreau
561958c17c CLEANUP: time: move a few configurable defines to defaults.h
TV_ETERNITY, TV_ETERNITY_MS and MAX_DELAY_MS may be configured and
ought to be in defaults.h so that they can be inherited from everywhere
without including time.h and could also be redefined if neede
(particularly for MAX_DELAY_MS).
2021-10-07 01:41:14 +02:00
Willy Tarreau
8de6dc9926 REORG: pools: move default settings to defaults.h
There's no reason CONFIG_HAP_POOLS and its opposite are located into
pools-t.h, it forces those that depend on them to inlcude the file.
Other similar options are normally dealt with in defaults.h, which is
part of the default API, so let's do that.
2021-09-28 19:31:16 +02:00
Tim Duesterhus
d5fc8fcb86 CLEANUP: Add haproxy/xxhash.h to avoid modifying import/xxhash.h
This solves setting XXH_INLINE_ALL in a cleaner way, because the imported
header is not modified, easing future updates.

see 6f7cc11e6d
2021-09-11 19:58:45 +02:00
Willy Tarreau
33056436c7 BUILD/MINOR: defaults: eliminate warning on MAXHOSTNAMELEN with -Wundef
As reported in GH issue #1369, there is a single case of #if with a
possibly undefined value in defaults.h which is on MAXHOSTNAMELEN. Let's
turn it to a #ifdef.
2021-08-28 12:05:32 +02:00
Willy Tarreau
dc70c18ddc BUG/MEDIUM: cfgcond: limit recursion level in the condition expression parser
Oss-fuzz reports in issue 36328 that we can recurse too far by passing
extremely deep expressions to the ".if" parser. I thought we were still
limited to the 1024 chars per line, that would be highly sufficient, but
we don't have any limit now :-/

Let's just pass a maximum recursion counter to the recursive parsers.
It's decremented for each call and the expression fails if it reaches
zero. On the most complex paths it can add 3 levels per parenthesis,
so with a limit of 1024, that's roughly 343 nested sub-expressions that
are supported in the worst case. That's more than sufficient, for just
a few kB of RAM.

No backport is needed.
2021-07-20 18:03:08 +02:00
Willy Tarreau
06987f4238 CLEANUP: global: remove unused definition of MAX_PROCS
This one was forced to 1 and the only reference was a test to verify it
was comprised between 1 and LONGBITS.
2021-06-15 16:52:42 +02:00