haproxy

mirror of https://github.com/haproxy/haproxy.git synced 2026-03-15 07:02:28 -04:00

Author	SHA1	Message	Date
Aurelien DARRAGON	c91d93ed1c	MINOR: stats-file: introduce shm-stats-file directive add initial support for the "shm-stats-file" directive and associated "shm-stats-file-max-objects" directive. For now they are flagged as experimental directives. The shared memory file is automatically created by the first process. The file is created using open() so it is up to the user to provide relevant path (either on regular filesystem or ramfs for performance reasons). The directive takes only one argument which is path of the shared memory file. It is passed as-is to open(). The maximum number of objects per thread-group (hard limit) that can be stored in the shm is defined by "shm-stats-file-max-objects" directive, Upon initial creation, the main shm stats file header is provisioned with the version which must remains the same to be compatible between processes and defaults to 2k. which means approximately 1mb max per thread group and should cover most setups. When the limit is reached (during startup) an error is reported by haproxy which invites the user to increase the "shm-stats-file-max-objects" if desired, but this means more memory will be allocated. Actual memory usage is low at start, because only the mmap (mapping) is provisionned with the maximum number of objects to avoid relocating the memory area during runtime, but the actual shared memory file is dynamically resized when objects are added (resized by following half power of 2 curve when new objects are added, see upcoming commits) For now only the file is created, further logic will be implemented in upcoming commits.	2025-09-03 15:59:22 +02:00
Valentine Krasnobaeva	0c63883be1	MINOR: debug: add distro name and version in postmortem Since 2012, systemd compliant distributions contain /etc/os-release file. This file has some standardized format, see details at https://www.freedesktop.org/software/systemd/man/latest/os-release.html. Let's read it in feed_post_mortem_linux() to gather more info about the distribution. (cherry picked from commit f1594c41368baf8f60737b229e4359fa7e1289a9) Signed-off-by: Willy Tarreau <w@1wt.eu>	2025-07-11 11:48:19 +02:00
Willy Tarreau	e049bd00ab	MEDIUM: config: change default limits to 1024 threads and 32 groups A test run on a dual-socket EPYC 9845 (2x160 cores) showed that we'll be facing new limits during the lifetime of 3.2 with our current 16 groups and 256 threads max: $ cat test.cfg global cpu-policy perforamnce $ ./haproxy -dc -c -f test.cfg ... Thread CPU Bindings: Tgrp/Thr Tid CPU set 1/1-32 1-32 32: 0-15,320-335 2/1-32 33-64 32: 16-31,336-351 3/1-32 65-96 32: 32-47,352-367 4/1-32 97-128 32: 48-63,368-383 5/1-32 129-160 32: 64-79,384-399 6/1-32 161-192 32: 80-95,400-415 7/1-32 193-224 32: 96-111,416-431 8/1-32 225-256 32: 112-127,432-447 Raising the default limit to 1024 threads and 32 groups is sufficient to buy us enough margin for a long time (hopefully, please don't laugh, you, reader from the future): $ ./haproxy -dc -c -f test.cfg ... Thread CPU Bindings: Tgrp/Thr Tid CPU set 1/1-32 1-32 32: 0-15,320-335 2/1-32 33-64 32: 16-31,336-351 3/1-32 65-96 32: 32-47,352-367 4/1-32 97-128 32: 48-63,368-383 5/1-32 129-160 32: 64-79,384-399 6/1-32 161-192 32: 80-95,400-415 7/1-32 193-224 32: 96-111,416-431 8/1-32 225-256 32: 112-127,432-447 9/1-32 257-288 32: 128-143,448-463 10/1-32 289-320 32: 144-159,464-479 11/1-32 321-352 32: 160-175,480-495 12/1-32 353-384 32: 176-191,496-511 13/1-32 385-416 32: 192-207,512-527 14/1-32 417-448 32: 208-223,528-543 15/1-32 449-480 32: 224-239,544-559 16/1-32 481-512 32: 240-255,560-575 17/1-32 513-544 32: 256-271,576-591 18/1-32 545-576 32: 272-287,592-607 19/1-32 577-608 32: 288-303,608-623 20/1-32 609-640 32: 304-319,624-639 We can change this default now because it has no functional effect without any configured cpu-policy, so this will only be an opt-in and it's better to do it now than to have an effect during the maintenance phase. A tiny effect is a doubling of the number of pool buckets and stick-table shards internally, which means that aside slightly reducing contention in these areas, a dump of tables can enumerate keys in a different order (hence the adjustment in the vtc). The only really visible effect is a slightly higher static memory consumption (29->35 MB on a small config), but that difference remains even with 50k servers so that's pretty much acceptable. Thanks to Erwan Velu for the quick tests and the insights!	2025-05-13 18:15:33 +02:00
Willy Tarreau	8a96216847	MEDIUM: sock-inet: re-check IPv6 connectivity every 30s IPv6 connectivity might start off (e.g. network not fully up when haproxy starts), so for features like resolvers, it would be nice to periodically recheck. With this change, instead of having the resolvers code rely on a variable indicating connectivity, it will now call a function that will check for how long a connectivity check hasn't been run, and will perform a new one if needed. The age was set to 30s which seems reasonable considering that the DNS will cache results anyway. There's no saving in spacing it more since the syscall is very check (just a connect() without any packet being emitted). The variables remain exported so that we could present them in show info or anywhere else. This way, "dns-accept-family auto" will now stay up to date. Warning though, it does perform some caching so even with a refreshed IPv6 connectivity, an older record may be returned anyway.	2025-05-09 15:45:44 +02:00
Olivier Houchard	388539faa3	MEDIUM: stick-tables: defer adding updates to a tasklet There is a lot of contention trying to add updates to the tree. So instead of trying to add the updates to the tree right away, just add them to a mt-list (with one mt-list per thread group, so that the mt-list does not become the new point of contention that much), and create a tasklet dedicated to adding updates to the tree, in batchs, to avoid keeping the update lock for too long. This helps getting stick tables perform better under heavy load.	2025-05-02 15:27:55 +02:00
Amaury Denoyelle	0f9b3daf98	MEDIUM: quic: limit global Tx memory Define a new settings tune.quic.frontend.max-tot-window. It contains a size argument which can be used to set a limit on the sum of all QUIC connections congestion window. This is applied both on quic_cc_path_set() and quic_cc_path_inc(). Note that this limitation cannot reduce a congestion window more than the minimal limit which is set to 2 datagrams.	2025-04-29 15:19:32 +02:00
Willy Tarreau	12c7189bc8	MEDIUM: thread: set DEBUG_THREAD to 1 by default Setting DEBUG_THREAD to 1 allows recording the lock history for each thread. Tests have shown that (as predicted) the cost of updating a single thread-local variable is not perceptible in the noise, especially when compared to the cost of obtaining a lock. Since this can provide useful value when debugging deadlocks, let's enable it by default when threads are enabled.	2025-04-28 16:50:34 +02:00
Willy Tarreau	903a6b14ef	MINOR: threads: prepare DEBUG_THREAD to receive more values We now default the value to zero and make sure all tests properly take care of values above zero. This is in preparation for supporting several degrees of debugging.	2025-04-28 16:50:34 +02:00
Willy Tarreau	61d633a3ac	DEBUG: rename DEBUG_GLITCHES to DEBUG_COUNTERS and enable it by default Till now the per-line glitches counters were only enabled with the confusingly named DEBUG_GLITCHES (which would not turn glitches off when disabled). Let's instead change it to DEBUG_COUNTERS and make sure it's enabled by default (though it can still be disabled with -DDEBUG_GLITCHES=0 just like for DEBUG_STRICT). It will later be expanded to cover more counters.	2025-04-14 19:02:13 +02:00
Olivier Houchard	9fe72bba3c	MAJOR: leastconn; Revamp the way servers are ordered. For leastconn, servers used to just be stored in an ebtree. Each server would be one node. Change that so that nodes contain multiple mt_lists. Each list will contain servers that share the same key (typically meaning they have the same number of connections). Using mt_lists means that as long as tree elements already exist, moving a server from one tree element to another does no longer require the lbprm write lock. We use multiple mt_lists to reduce the contention when moving a server from one tree element to another. A list in the new element will be chosen randomly. We no longer remove a tree element as soon as they no longer contain any server. Instead, we keep a list of all elements, and when we need a new element, we look at that list only if it contains a number of elements already, otherwise we'll allocate a new one. Keeping nodes in the tree ensures that we very rarely have to take the lbrpm write lock (as it only happens when we're moving the server to a position for which no element is currently in the tree). The number of mt_lists used is defined as FWLC_NB_LISTS. The number of tree elements we want to keep is defined as FWLC_MIN_FREE_ENTRIES, both in defaults.h. The value used were picked afrer experimentation, and seems to be the best choice of performances vs memory usage. Doing that gives a good boost in performances when a lot of servers are used. With a configuration using 500 servers, before that patch, about 830000 requests per second could be processed, with that patch, about 1550000 requests per second are processed, on an 64-cores AMD, using 1200 concurrent connections.	2025-04-01 18:05:30 +02:00
Aurelien DARRAGON	0846638f7f	MEDIUM: stream: interrupt costly rulesets after too many evaluations It is not rare to see configurations with a large number of "tcp-request content" or "http-request" rules for instance. A large number of rules combined with cpu-demanding actions (e.g.: actions that work on content) may create thread contention as all the rules from a given ruleset are evaluated under the same polling loop if the evaluation is not interrupted Thus, in this patch we add extra logic around "tcp-request content", "tcp-response content", "http-request" and "http-response" rulesets, so that when a certain number of rules are evaluated under the single polling loop, we force the evaluating function to yield. As such, the rule which was about to be evaluated is saved, and the function starts evaluating rules from the save pointer when it returns (in the next polling loop). We use task_wakeup(task, TASK_WOKEN_MSG) to explicitly wake the task so that no time is wasted and the processing is resumed ASAP. TASK_WOKEN_MSG is mandatory here because process_stream() expects TASK_WOKEN_MSG for explicit analyzers re-evaluation. rules_bcount stream's attribute was added to count how manu rules were evaluated since last interruption (yield). Also, SF_RULE_FYIELD flag was added to know that the s->current_rule was assigned due to forced yield and not regular yield. By default haproxy will enforce a yield every 50 rules, this behavior can be configured using the "tune.max-rules-at-once" global keyword. There is a limitation though: for now, if the ACT_OPT_FINAL flag is set on act_opts, we consider it is not safe to yield (as it is already the case for automatic yield). In this case instead of yielding an taking the risk of not being called back, we skip the yield and hope it will not create contention. This is something we should ideally try to improve in order to yield in all conditions.	2025-02-03 17:09:48 +01:00
Olivier Houchard	26b3e5236f	MEDIUM: servers/proxies: Switch to using per-tgroup queues. For both servers and proxies, use one connection queue per thread-group, instead of only one. Having only one can lead to severe performance issues on NUMA machines, it is actually trivial to get the watchdog to trigger on an AMD machine, having a server with a maxconn of 96, and an injector that uses 160 concurrent connections. We now have one queue per thread-group, however when dequeueing, we're dequeuing MAX_SELF_USE_QUEUE (currently 9) pendconns from our own queue, before dequeueing one from another thread group, if available, to make sure everybody is still running.	2025-01-28 12:49:41 +01:00
Valentine Krasnobaeva	fb7bef781d	MINOR: defaults: update MASTER_MAXCONN description This is a one of the commits to prepare the removal of MODE_MWORKER_WAIT support, as it became redundant with MODE_MWORKER due to moving master-worker fork in init().	2024-10-16 22:02:39 +02:00
Amaury Denoyelle	d0d8e57d47	MINOR: quic: define sbuf pool Define a new buffer pool reserved to allocate smaller memory area. For the moment, its usage will be restricted to QUIC, as such it is declared in quic_stream module. Add a new config option "tune.bufsize.small" to specify the size of the allocated objects. A special check ensures that it is not greater than the default bufsize to avoid unexpected effects.	2024-08-20 18:12:27 +02:00
Valentine Krasnobaeva	8b1dfa9def	MINOR: cfgparse: limit file size loaded via /dev/stdin load_cfg_in_mem() can continuously reallocate memory in order to load an extremely large input from /dev/stdin, until it fails with ENOMEM, which means that process has consumed all available RAM. In case of containers and virtualized environments it's not very good. So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will limit the size of input supplied via /dev/stdin.	2024-08-20 14:28:34 +02:00
Valentine Krasnobaeva	41275a6918	MEDIUM: init: set default for fd_hard_limit via DEFAULT_MAXFD Let's provide a default value for fd_hard_limit, if it's not set in the configuration. With this patch we could set some specific default via compile-time variable DEFAULT_MAXFD as well. Hope, this will be helpfull for haproxy package maintainers. make -j 8 TARGET=linux-glibc DEBUG=-DDEFAULT_MAXFD=50000 If haproxy is comipled without DEFAULT_MAXFD defined, the default will be set to 1048576. This is done to avoid killing the process by its watchdog, while it started without any limitations in its configuration or in the command line and the hard RLIMIT_NOFILE is extremely huge (~1000000000). We use in this case compute_ideal_maxconn() to calculate maxconn and maxsock, maxsock defines the size of internal fdtab, which becames very-very large as well. When the process starts to simply loop over this fdtab (0(n)), this takes a lot of time, so watchdog does it job. To avoid this, maxconn now is always reduced to some reasonable value either by explicit global.fd-hard-limit from configuration, or by its default. The default may be changed at build-time and overwritten then by global.fd-hard-limit at runtime. Explicit global.fd-hard-limit from the configuration has always precedence over DEFAULT_MAXFD, if set. Must be backported in all stable versions until v2.6.0, including v2.6.0.	2024-07-04 07:52:42 +02:00
Willy Tarreau	290659ffd3	MINOR: activity: make the memory profiling hash size configurable at build time The MEMPROF_HASH_BITS variable was set to 10 without a possibility to change it (beyond patching the code). After seeing a few reports already with "other" being listed and a list with close to 1024 entries, it looks like it's about time to either increase the hash size, or at least make it configurable for special cases. As a reminder, in order to remain fast, the algorithm searches no more than 16 places after the hash, so when a table is almost full, searches are long and new places are rare. The present patch just makes it possible to redefine it by passing "-DMEMPROF_HASH_BITS=11" or "-DMEMPROF_HASH_BITS=12" in CFLAGS, and moves the definition to defaults.h to make it easier to find. Such values should be way sufficient for the vast majority of use cases. Maybe in the future we'd change the default. At least this version should be backported to ease rebuilds, say, till 2.8 or so.	2024-06-27 18:01:27 +02:00
Willy Tarreau	0ce51dc93b	MEDIUM: dynbuf: implement emergency buffers The buffer reserve set by tune.buffers.reserve has long been unused, and in order to deal gracefully with failed memory allocations we'll need to resort to a few emergency buffers that are pre-allocated per thread. These buffers are only for emergency use, so every time their count is below the configured number a b_free() will refill them. For this reason their count can remain pretty low. We changed the default number from 2 to 4 per thread, and the minimum value is now zero (e.g. for low-memory systems). The tune.buffers.limit setting has always been a problem when trying to deal with the reserve but now we could simplify it by simply pushing the limit (if set) to match the reserve. That was already done in the past with a static value, but now with threads it was a bit trickier, which is why the per-thread allocators increment the limit on the fly before allocating their own buffers. This also means that the configured limit is saner and now corresponds to the regular buffers that can be allocated on top of emergency buffers. At the moment these emergency buffers are not used upon allocation failure. The only reason is to ease bisecting later if needed, since this commit only has to deal with resource management.	2024-05-10 17:18:13 +02:00
Willy Tarreau	772f9a5874	BUILD: pools: make DEBUG_MEMORY_POOLS=1 the default option This option has been set by default for a very long time and also complicates the manipulation of the DEBUG variable. Let's make it the official default and permit to unset it by setting it to zero. The other pool-related DEBUG options were adjusted to also explicitly check for the zero value for consistency.	2024-04-11 17:25:45 +02:00
Willy Tarreau	b70981532a	BUILD: debug: make DEBUG_STRICT=1 the default We continue to carry it in the makefile, which adds to the difficulty of passing new options. Let's make DEBUG_STRICT=1 the default so that one has to explicitly pass DEBUG_STRICT=0 to disable it. This allows us to remove the option from the default DEBUG variable in the makefile.	2024-04-11 17:25:45 +02:00
Willy Tarreau	1a088da7c2	MAJOR: stktable: split the keys across multiple shards to reduce contention In order to reduce the contention on the table when keys expire quickly, we're spreading the load over multiple trees. That counts for keys and expiration dates. The shard number is calculated from the key value itself, both when looking up and when setting it. The "show table" dump on the CLI iterates over all shards so that the output is not fully sorted, it's only sorted within each shard. The Lua table dump just does the same. It was verified with a Lua program to count stick-table entries that it works as intended (the test case is reproduced here as it's clearly not easy to automate as a vtc): function dump_stk() local dmp = core.proxies['tbl'].stktable:dump({}); local count = 0 for _, __ in pairs(dmp) do count = count + 1 end core.Info('Total entries: ' .. count) end core.register_action("dump_stk", {'tcp-req', 'http-req'}, dump_stk, 0); ## global tune.lua.log.stderr on lua-load-per-thread lua-cnttbl.lua listen front bind :8001 http-request lua.dump_stk if { path_beg /stk } http-request track-sc1 rand(),upper,hex table tbl http-request redirect location / backend tbl stick-table size 100k type string len 12 store http_req_cnt ## $ h2load -c 16 -n 10000 0:8001/ $ curl 0:8001/stk ## A count close to 100k appears on haproxy's stderr ## On the CLI, "show table tbl" \| wc will show the same. Some large parts were reindented only to add a top-level loop to iterate over shards (e.g. process_table_expire()). Better check the diff using git show -b. The number of shards is decided just like for the pools, at build time based on the max number of threads, so that we can keep a constant. Maybe this should be done differently. For now CONFIG_HAP_TBL_BUCKETS is used, and defaults to CONFIG_HAP_POOL_BUCKETS to keep the benefits of all the measurements made for the pools. It turns out that this value seems to be the most reasonable one without inflating the struct stktable too much. By default for 1024 threads the value is 32 and delivers 980k RPS in a test involving 80 threads, while adding 1kB to the struct stktable (roughly doubling it). The same test at 64 gives 1008 kRPS and at 128 it gives 1040 kRPS for 8 times the initial size. 16 would be too low however, with 675k RPS. The stksess already have a shard number, it's the one used to decide which peer connection to send the entry. Maybe we should also store the one associated with the entry itself instead of recalculating it, though it does not happen that often. The operation is done by hashing the key using XXH32(). The peers also take and release the table's lock but the way it's used it not very clear yet, so at this point it's sure this will not work. At this point, this allowed to completely unlock the performance on a 80-thread setup: before: 5.4 Gbps, 150k RPS, 80 cores 52.71% haproxy [.] stktable_lookup_key 36.90% haproxy [.] stktable_get_entry.part.0 0.86% haproxy [.] ebmb_lookup 0.18% haproxy [.] process_stream 0.12% haproxy [.] process_table_expire 0.11% haproxy [.] fwrr_get_next_server 0.10% haproxy [.] eb32_insert 0.10% haproxy [.] run_tasks_from_lists after: 36 Gbps, 980k RPS, 80 cores 44.92% haproxy [.] stktable_get_entry 5.47% haproxy [.] ebmb_lookup 2.50% haproxy [.] fwrr_get_next_server 0.97% haproxy [.] eb32_insert 0.92% haproxy [.] process_stream 0.52% haproxy [.] run_tasks_from_lists 0.45% haproxy [.] conn_backend_get 0.44% haproxy [.] __pool_alloc 0.35% haproxy [.] process_table_expire 0.35% haproxy [.] connect_server 0.35% haproxy [.] h1_headers_to_hdr_list 0.34% haproxy [.] eb_delete 0.31% haproxy [.] srv_add_to_idle_list 0.30% haproxy [.] h1_snd_buf WIP: uint64_t -> long WIP: ulong -> uint code is much smaller	2024-04-03 17:34:47 +02:00
Willy Tarreau	6c1b29d06f	MINOR: ring: make the number of queues configurable Now the rings have one wait queue per group. This should limit the contention on systems such as EPYC CPUs where the performance drops dramatically when using more than one CCX. Tests were run with different numbers and it was showed that value 6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and anything around these values degrades quickly. The value has been left tunable in the global section. This commit only introduces everything needed to set up the queue count so that it's easier to adjust it in the forthcoming patches, but it was initially added after the series, making it harder to compare. It was also shown that trying to group the threads in queues by their thread groups is counter-productive and that it was more efficient to do that by applying a modulo on the thread number. As surprising as it seems, it does have the benefit of well balancing any number of threads.	2024-03-25 17:34:19 +00:00
Willy Tarreau	0a0041d195	BUILD: tree-wide: fix a few missing includes in a few files Some include files, mostly types definitions, are missing a few includes to define the types they're using, causing include ordering dependencies between files, which are most often not seen due to the alphabetical order of includes. Let's just fix them. These were spotted by building pre-compiled headers for all these files to .h.gch.	2024-03-05 11:50:34 +01:00
Aurelien DARRAGON	cb3ec978fd	MINOR: event_hdl: add global tunables The local variable "event_hdl_async_max_notif_at_once" which was introduced with the event_hdl API was left as is but with a TODO note telling that we should make it a global tunable. Well, we're doing this now. To prepare for upcoming tunables related to event_hdl API, we add a dedicated struct named event_hdl_tune which is globally exposed through the event_hdl header file so that it may be used from everywhere. The struct is automatically initialized in event_hdl_init() according to defaults.h. "event_hdl_async_max_notif_at_once" now becomes "event_hdl_tune.max_events_at_once" with it's dedicated configuation keyword: "tune.events.max-events-at-once". We're also taking this opportunity to raise the default value from 10 to 100 since it's seems quite reasonnable given existing async event_hdl users. The documentation was updated accordingly.	2023-11-29 08:59:27 +01:00
Remi Tricot-Le Breton	48f81ec09d	MAJOR: cache: Delay cache entry delete in reserve_hot function A reference counter on the cache_entry was added in a previous commit. Its value is atomically increased and decreased via the retain_entry and release_entry functions. This is needed because of the latest cache and shared_context modifications that introduced two separate locks instead of the preexisting single shctx_lock one. With the new logic, we have two main blocks competing for the two locks: - the one in the http_action_req_cache_use that performs a lookup in the cache tree (locked by the cache lock) and then tries to remove the corresponding blocks from the shared_context's 'avail' list until the response is sent to the client by the cache applet, - the shctx_row_reserve_hot that traverses the 'avail' list and gives them back to the caller, while removing previous row heads from the cache tree Those two blocks require the two locks but one of them would take the cache lock first, and the other one the shctx_lock first, which would end in a deadlock without the current patch. The way this conflict is resolved in this patch is by ensuring that at least one of those uses works without taking the two locks at the same time. The solution found was to keep taking the two locks in the cache_use case. We first lock the cache to lookup for an entry and we then take the shctx lock as well to detach the corresponding blocks from the 'avail' list. The subtlety is that between the cache lookup and the actual locking of the shctx, another thread might have called the reserve_hot function in which we only take the shctx lock. In this function we traverse the 'avail' list to remove blocks that are then given to the caller. If one of those blocks corresponds to a previous row head, we call the 'free_blocks' callback that used to delete the cache entry from the tree. We now avoid deleting directly the cache entries in reserve_hot and we rather set the cache entries 'complete' param to 0 so that no other thread tries to work with this entry. This way, when we release the shctx lock in reserve_hot, the first thread that had performed the cache lookup and had found an entry that we just gave to another thread will see that the 'complete' field is 0 and it won't try to work with this response. The actual removal of entries from the cache tree will now be performed in the new 'reserve_finish' callback called at the end of the shctx_row_reserve_hot function. It will iterate on all the row head that were inserted in a dedicated list in the 'free_block' callback and perform the actual delete.	2023-11-16 19:35:10 +01:00
Willy Tarreau	06885aaea7	MINOR: pools: introduce the use of multiple buckets On many threads and without the shared cache, there can be extreme contention on the ->allocated counter, the ->free_list pointer, and the ->used counter. It's possible to limit this contention by spreading the counters a little bit over multiple entries, that are summed up when a consultation is needed. The criterion used to spread the values cannot be related to the thread ID due to migrations, since we need to keep consistent stats (allocated vs used). Instead we'll just hash the pointer, it provides an index that does the job and that is consistent for the object. When having just a few entries (16 here as it showed almost identical performance between global and non-global pools) even iterations should be short enough during measurements to not be a problem. A pair of functions designed to ease pointer hash bucket calculation were added, with one of them doing it for thread IDs because allocation failures will be associated with a thread and not a pointer. For now this patch only brings in the relevant parts of the infrastructure, the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512 threads or more are supported, 5 bits when 128 or more are supported, 4 bits when 16 or more are supported, otherwise 3 bits for small setups. The array in the pool_head and the two utility functions are already added. It should have no measurable impact beyond inflating the pool_head structure.	2023-08-12 19:04:34 +02:00
Willy Tarreau	59c347c15e	BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP LONGBITS was defined long ago with old compilers that didn't provide the word size. It's still present as being referenced in various places in the code, but we must not use it to define other macros that may be evaluated at pre-processing time since it contains sizeof() and casts that are not compatible with preprocessor conditions. Let's switch MAX_THREADS_PER_GROUP to __WORDSIZE so that we can condition blocks of code on it if needed. LONGBITS should really be removed by now, given that we don't support compilers not providing __WORDSIZE anymore (gcc < 4.2).	2023-08-12 19:04:34 +02:00
Patrick Hemmer	7fccccccea	MINOR: acl: add acl() sample fetch This provides a sample fetch which returns the evaluation result of the conjunction of named ACLs.	2023-08-01 10:49:06 +02:00
William Lallemand	2078d4b1f7	BUG/MINOR: mworker: use MASTER_MAXCONN as default maxconn value In environments where SYSTEM_MAXCONN is defined when compiling, the master will use this value instead of the original minimal value which was set to 100. When this happens, the master process could allocate RAM excessively since it does not need to have an high maxconn. (For example if SYSTEM_MAXCONN was set to 100000 or more) This patch fixes the issue by using the new define MASTER_MAXCONN which define a default maxconn of 100 for the master process. Must be backported as far as 2.5.	2023-03-09 14:28:44 +01:00
Willy Tarreau	28360dc53f	MEDIUM: clock: force internal time to wrap early after boot GH issue #2034 clearly indicates yet another case of time roll-over that went badly. Issues that happen only once every 50 days are hard to detect and debug, and are usually reported more or less synchronized from multiple sources. This patch finally does what had long been planned but never done yet, which is to force the time to wrap early after boot so that any such remaining issue can be spotted quicker. The margin delay here is 20s (it may be changed by setting BOOT_TIME_WRAP_SEC to another value). This value seems sufficient to permit failed health checks to succeed and traffic to come in and possibly start to update some time stamps (accept dates in logs, freq counters, stick-tables expiration dates etc). It could theoretically be helpful to have this in 2.7, but as can be seen with the two patches below, we've already had incorrect use cases of the internal monotonic time when the wall-clock one was needed, so we could expect to detect other ones in the future. Note that this will not induce bugs, it will only make them happen much faster (i.e. no need to wait for 50 days before seeing them). If it were to eventually be backported, these two previous patches must also be backported: BUG/MINOR: clock: use distinct wall-clock and monotonic start dates BUG/MEDIUM: cache: use the correct time reference when comparing dates	2023-02-08 11:10:33 +01:00
Willy Tarreau	b8b243ac6a	MINOR: trace: add the long awaited TRACE_PRINTF() TRACE_PRINTF() can be used to produce arbitrary trace contents at any trace level. It uses the exact same arguments as other TRACE_* macros, but here they are mandatory since they are followed by the format-string, though they may be filled with zeroes. The reason for the arguments is to match tracking or filtering and not pollute other non-inspected objects. It will probably be used inside loops, in which case there are two points to be careful about: - output atomicity is only per-message, so competing threads may see their messages interleaved. As such, it is recommended that the caller places a recognizable unique context at the beginning of the message such as a connection pointer. - iterating over arrays or lists for all requests could be very expensive. In order to avoid this it is best to condition the call via TRACE_ENABLED() with the same arguments, which will return the same decision. - messages longer than TRACE_MAX_MSG-1 (1023 by default) will be truncated. For example, in order to dump the list of HTTP headers between hpack and h2: if (outlen > 0 && TRACE_ENABLED(TRACE_LEVEL_DEVELOPER, H2_EV_RX_FRAME\|H2_EV_RX_HDR, h2c->conn, 0, 0, 0)) { int i; for (i = 0; list[i].n.len; i++) TRACE_PRINTF(TRACE_LEVEL_DEVELOPER, H2_EV_RX_FRAME\|H2_EV_RX_HDR, h2c->conn, 0, 0, 0, "h2c=%p hdr[%d]=%s:%s", h2c, i, list[i].n.ptr, list[i].v.ptr); } In addition, a lower-level TRACE_PRINTF_LOC() macro is provided, that takes two extra arguments, the caller's location and the caller's function name. This will allow to emit composite traces from central functions on the behalf of another one.	2023-01-26 15:49:43 +01:00
Willy Tarreau	4da51bd190	CLEANUP: pools: get rid of CONFIG_HAP_POOLS This one was set in defaults.h only when neither DEBUG_NO_POOLS nor DEBUG_UAF were set. This was not the most convenient location to look for it, and it was only used in pool.c to decide on the initial value of POOL_DBG_NO_CACHE. Let's just use DEBUG_NO_POOLS \|\| DEBUG_UAF directly on this flag and get rid of the intermediary condition. This also has the benefit of removing a double inversion, which is always nice for understanding.	2022-12-08 17:45:08 +01:00
William Lallemand	eba6a54cd4	MINOR: logs: startup-logs can use a shm for logging the reload When compiled with USE_SHM_OPEN=1 the startup-logs are now able to use an shm which is used to keep the logs when switching to mworker wait mode. This allows to keep the failed reload logs. When allocating the startup-logs at first start of the process, haproxy will do a shm_open with a unique path using the PID of the process, the file is unlink immediatly so we don't let unwelcomed files be. The fd resulting from this shm is stored in the HAPROXY_STARTUPLOGS_FD environment variable so it can be mmap again when switching to wait mode. When forking children, the process is copying the mmap to a a mallocated ring so we never share the same memory section between the master and the workers. When switching to wait mode, the shm is not used anymore as it is also copied to a mallocated structure. This allow to use the "show startup-logs" command over the master CLI, to get the logs of the latest startup or reload. This way the logs of the latest failed reload are also kept. This is only activated on the linux-glibc target for now.	2022-10-13 16:50:22 +02:00
Willy Tarreau	693688e734	MINOR: config: automatically preset MAX_THREADS based on MAX_TGROUPS MAX_THREADS was not changed when setting MAX_TGROUPS, which still limits some possibilities. Let's preset it to 4 * LONGBITS when MAX_TGROUPS is larger than 1, or LONGBITS when it's set to 1. This means that the new default value is 256 threads. The rationale behind this is that the main use of thread groups is mostly to address NUMA issues and that we don't necessarily need large thread counts when using many groups, and 256 threads is already plenty even on quite large systems. For now it's important not to go too far because some internal structs are arrays of MAX_THREADS entries, for example accept_queue_ring, which is around 8kB per thread. Such structures will need to become dynamic before defaulting to large thread counts (at 4096 threads max the accept queues would require 32 MB RAM alone).	2022-08-06 16:51:20 +02:00
Willy Tarreau	856d56d2d2	MINOR: config: change default MAX_TGROUPS to 16 This will allows nbtgroups > 1 to be declared in the config without recompiling. The theoretical limit is 64, though we'd rather not push it too far for now as some structures might be enlarged to be indexed per group. Let's start with 16 groups max, allowing to experiment with dual-socket machines suffering from up to 8 loosely coupled L3 caches. It's a good start and doesn't engage us too far.	2022-07-15 21:51:48 +02:00
Willy Tarreau	3ccb14d60d	MINOR: thread: get rid of MAX_THREADS_MASK This macro was used both for binding and for lookups. When binding tasks or FDs, using all_threads_mask instead is better as it will later be per group. For lookups, ~0UL always does the job. Thus in practice the macro was already almost not used anymore since the rest of the code could run fine with a constant of all ones there.	2022-06-14 11:18:40 +02:00
Willy Tarreau	197715ae21	CLEANUP: compression: move the default setting of maxzlibmem to defaults __comp_fetch_init() only presets the maxzlibmem, and only when both USE_ZLIB and DEFAULT_MAXZLIBMEM are set. The intent is to preset a default value to protect the system against excessive memory usage when no setting is set by the user. Nowadays the entry in the global struct is always there so there's no point anymore in passing via a constructor to possibly set this value. Let's go the cleaner way by always presetting DEFAULT_MAXZLIBMEM to 0 in defaults.h unless these conditions are met, and always assigning it instead of pre-setting the entry to zero. This is more straightforward and removes some ifdefs and the last constructor. In addition, now the setting has a chance of being found.	2022-04-25 19:42:43 +02:00
Remi Tricot-Le Breton	1d6338ea96	MEDIUM: ssl: Disable DHE ciphers by default DHE ciphers do not present a security risk if the key is big enough but they are slow and mostly obsoleted by ECDHE. This patch removes any default DH parameters. This will effectively disable all DHE ciphers unless a global ssl-dh-param-file is defined, or tune.ssl.default-dh-param is set, or a frontend has DH parameters included in its PEM certificate. In this latter case, only the frontends that have DH parameters will have DHE ciphers enabled. Adding explicitely a DHE ciphers in a "bind" line will not be enough to actually enable DHE. We would still need to know which DH parameters to use so one of the three conditions described above must be met. This request was described in GitHub issue #1604.	2022-04-20 17:30:55 +02:00
Willy Tarreau	b61fccdc3f	CLEANUP: init: remove the ifdef on HAPROXY_MEMMAX It's ugly, let's move it to defaults.h with all other ones and preset it to zero if not defined.	2022-02-23 17:11:33 +01:00
Remi Tricot-Le Breton	55d7e782ee	MINOR: ssl: Set default dh size to 2048 Starting from OpenSSLv3, we won't rely on the SSL_CTX_set_tmp_dh_callback mechanism so we will need to know the DH size we want to use during init. In order for the default DH param size to be used when no RSA or DSA private key can be found for a given bind line, we will need to know the default size we want to use (which was not possible the way the code was built, since the global default dh size was set too late.	2022-02-14 10:07:14 +01:00
Willy Tarreau	39fd546d4b	MINOR: pools: enable pools with DEBUG_FAIL_ALLOC as well During 2.4-dev, fault injection was enabled for cached pools with commit `207c09509` ("MINOR: pools: move the fault injector to __pool_alloc()"), except that the condition for CONFIG_HAP_POOLS still depended on DEBUG_FAIL_ALLOC not being set, which limits the usability to cases where the define is set by hand. Let's remove it from the equation as this is not a constraint anymore. While a bit old, there's no need to backport this as it's only used during development.	2022-01-12 17:31:01 +01:00
Willy Tarreau	f5e94b2f47	OPTIM: pools: reduce local pool cache size to 512kB Now that we support batched allocations/releases, it appears that we can reach the same performance on H2 with shared pools and 256kB thread-local cache as without shared pools, a fast allocator and 1MB thread-local cache. With 512kB we're up to 10% faster on highly multiplexed H2 than without the shared cache. This was tested on a 16-core ARM machine. Thus it's time to slightly reduce the per-thread memory cost, which may also improve the performance on machines with smaller L2 caches. It essentially reverts commit `f587003fe` ("MINOR: pools: double the local pool cache size to 1 MB").	2022-01-02 19:52:15 +01:00
Willy Tarreau	43937e920f	MEDIUM: pools: start to batch eviction from local caches Since previous patch we can forcefully evict multiple objects from the local cache, even when evicting basd on the LRU entries. Let's define a compile-time configurable setting to batch releasing of objects. For now we set this value to 8 items per round. This is marked medium because eviction from the LRU will slightly change in order to group the last items that are freed within a single cache instead of accurately scanning only the oldest ones exactly in their order of appearance. But this is required in order to evolve towards batched removals.	2022-01-02 19:35:26 +01:00
Willy Tarreau	f9662848f2	MINOR: threads: introduce a minimalistic notion of thread-group This creates a struct tgroup_info which knows the thread ID of the first thread in a group, and the number of threads in it. For now there's only one thread group supported in the configuration, but it may be forced to other values for development purposes by defining MAX_TGROUPS, and it's enabled even when threads are disabled and will need to remain accessible during boot to keep a simple enough internal API. For the purpose of easing the configurations which do not specify a thread group, we're starting group numbering at 1 so that thread group 0 can be "undefined" (i.e. for "bind" lines or when binding tasks). The goal will be to later move there some global items that must be made per-group.	2021-10-08 17:22:26 +02:00
Willy Tarreau	561958c17c	CLEANUP: time: move a few configurable defines to defaults.h TV_ETERNITY, TV_ETERNITY_MS and MAX_DELAY_MS may be configured and ought to be in defaults.h so that they can be inherited from everywhere without including time.h and could also be redefined if neede (particularly for MAX_DELAY_MS).	2021-10-07 01:41:14 +02:00
Willy Tarreau	8de6dc9926	REORG: pools: move default settings to defaults.h There's no reason CONFIG_HAP_POOLS and its opposite are located into pools-t.h, it forces those that depend on them to inlcude the file. Other similar options are normally dealt with in defaults.h, which is part of the default API, so let's do that.	2021-09-28 19:31:16 +02:00
Tim Duesterhus	d5fc8fcb86	CLEANUP: Add haproxy/xxhash.h to avoid modifying import/xxhash.h This solves setting XXH_INLINE_ALL in a cleaner way, because the imported header is not modified, easing future updates. see `6f7cc11e6d`	2021-09-11 19:58:45 +02:00
Willy Tarreau	33056436c7	BUILD/MINOR: defaults: eliminate warning on MAXHOSTNAMELEN with -Wundef As reported in GH issue #1369, there is a single case of #if with a possibly undefined value in defaults.h which is on MAXHOSTNAMELEN. Let's turn it to a #ifdef.	2021-08-28 12:05:32 +02:00
Willy Tarreau	dc70c18ddc	BUG/MEDIUM: cfgcond: limit recursion level in the condition expression parser Oss-fuzz reports in issue 36328 that we can recurse too far by passing extremely deep expressions to the ".if" parser. I thought we were still limited to the 1024 chars per line, that would be highly sufficient, but we don't have any limit now :-/ Let's just pass a maximum recursion counter to the recursive parsers. It's decremented for each call and the expression fails if it reaches zero. On the most complex paths it can add 3 levels per parenthesis, so with a limit of 1024, that's roughly 343 nested sub-expressions that are supported in the worst case. That's more than sufficient, for just a few kB of RAM. No backport is needed.	2021-07-20 18:03:08 +02:00
Willy Tarreau	06987f4238	CLEANUP: global: remove unused definition of MAX_PROCS This one was forced to 1 and the only reference was a test to verify it was comprised between 1 and LONGBITS.	2021-06-15 16:52:42 +02:00

1 2

65 commits