2009-10-01 05:19:37 -04:00
|
|
|
/*
|
|
|
|
|
* Fast Weighted Least Connection load balancing algorithm.
|
|
|
|
|
*
|
|
|
|
|
* Copyright 2000-2009 Willy Tarreau <w@1wt.eu>
|
|
|
|
|
*
|
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
|
*
|
|
|
|
|
*/
|
|
|
|
|
|
2020-06-04 17:20:13 -04:00
|
|
|
#include <import/eb32tree.h>
|
2020-05-27 06:58:42 -04:00
|
|
|
#include <haproxy/api.h>
|
2020-06-09 03:07:15 -04:00
|
|
|
#include <haproxy/backend.h>
|
2020-06-04 16:59:39 -04:00
|
|
|
#include <haproxy/queue.h>
|
2020-06-04 17:20:13 -04:00
|
|
|
#include <haproxy/server-t.h>
|
MAJOR: leastconn: postpone the server's repositioning under contention
When leastconn is used under many threads, there can be a lot of
contention on leastconn, because the same node has to be moved around
all the time (when picking it and when releasing it). In GH issue #2861
it was noticed that 46 threads out of 64 were waiting on the same lock
in fwlc_srv_reposition().
In such a case, the accuracy of the server's key becomes quite irrelevant
because nobody cares if the same server is picked twice in a row and the
next one twice again.
While other approaches in the past considered using a floating key to
avoid moving the server each time (which was not compatible with the
round-robin rule for equal keys), here a more drastic solution is needed.
What we're doing instead is that we turn this lock into a trylock. If we
can grab it, we do the job. If we can't, then we just wake up a server's
tasklet dedicated to this. That tasklet will then try again slightly
later, knowing that during this short time frame, the server's position
in the queue is slightly inaccurate. Note that any thread touching the
same server will also reposition it and save that work for next time.
Also if multiple threads wake the tasklet up, then that's fine, their
calls will be merged and a single lock will be taken in the end.
Testing this on a 24-core EPYC 74F3 showed a significant performance
boost from 382krps to 610krps. The performance profile reported by
perf top dropped from 43% to 2.5%:
Before:
Overhead Shared Object Symbol
43.46% haproxy-master-inlineebo [.] fwlc_srv_reposition
21.20% haproxy-master-inlineebo [.] fwlc_get_next_server
0.91% haproxy-master-inlineebo [.] process_stream
0.75% [kernel] [k] ice_napi_poll
0.51% [kernel] [k] tcp_recvmsg
0.50% [kernel] [k] ice_start_xmit
0.50% [kernel] [k] tcp_ack
After:
Overhead Shared Object Symbol
30.37% haproxy [.] fwlc_get_next_server
2.51% haproxy [.] fwlc_srv_reposition
1.91% haproxy [.] process_stream
1.46% [kernel] [k] ice_napi_poll
1.36% [kernel] [k] tcp_recvmsg
1.04% [kernel] [k] tcp_ack
1.00% [kernel] [k] skb_release_data
0.96% [kernel] [k] ice_start_xmit
0.91% haproxy [.] conn_backend_get
0.82% haproxy [.] connect_server
0.82% haproxy [.] run_tasks_from_lists
Tested on an Ampere Altra with 64 aarch64 cores dedicated to haproxy,
the gain is even more visible (3.6x):
Before: 311-323k rps, 3.16-3.25ms, 6400% CPU
Overhead Shared Object Symbol
55.69% haproxy-master [.] fwlc_srv_reposition
33.30% haproxy-master [.] fwlc_get_next_server
0.89% haproxy-master [.] process_stream
0.45% haproxy-master [.] h1_snd_buf
0.34% haproxy-master [.] run_tasks_from_lists
0.32% haproxy-master [.] connect_server
0.31% haproxy-master [.] conn_backend_get
0.31% haproxy-master [.] h1_headers_to_hdr_list
0.24% haproxy-master [.] srv_add_to_idle_list
0.23% haproxy-master [.] http_request_forward_body
0.22% haproxy-master [.] __pool_alloc
0.21% haproxy-master [.] http_wait_for_response
0.21% haproxy-master [.] h1_send
After: 1.21M rps, 0.842ms, 6400% CPU
Overhead Shared Object Symbol
17.44% haproxy [.] fwlc_get_next_server
6.33% haproxy [.] process_stream
4.40% haproxy [.] fwlc_srv_reposition
3.64% haproxy [.] conn_backend_get
2.75% haproxy [.] connect_server
2.71% haproxy [.] h1_snd_buf
2.66% haproxy [.] srv_add_to_idle_list
2.33% haproxy [.] run_tasks_from_lists
2.14% haproxy [.] h1_headers_to_hdr_list
1.56% haproxy [.] stream_set_backend
1.37% haproxy [.] http_request_forward_body
1.35% haproxy [.] http_wait_for_response
1.34% haproxy [.] h1_send
And at similar loads, the CPU usage considerably drops (3.55x), as
well as the response time (10x):
After: 320k rps, 0.322ms, 1800% CPU
Overhead Shared Object Symbol
7.62% haproxy [.] process_stream
4.64% haproxy [.] h1_headers_to_hdr_list
3.09% haproxy [.] h1_snd_buf
3.08% haproxy [.] h1_process_demux
2.22% haproxy [.] __pool_alloc
2.14% haproxy [.] connect_server
1.87% haproxy [.] h1_send
> 1.84% haproxy [.] fwlc_srv_reposition
1.84% haproxy [.] run_tasks_from_lists
1.77% haproxy [.] sock_conn_iocb
1.75% haproxy [.] srv_add_to_idle_list
1.66% haproxy [.] http_request_forward_body
1.65% haproxy [.] wake_expired_tasks
1.59% haproxy [.] h1_parse_msg_hdrs
1.51% haproxy [.] http_wait_for_response
> 1.50% haproxy [.] fwlc_get_next_server
The cost of fwlc_get_next_server() naturally increases as the server
count increases, but now has no visible effect on updates. The load
distribution remains unchanged compared to the previous approach,
the weight still being respected.
For further improvements to the fwlc algo, please consult github
issue #881 which centralizes everything related to this algorithm.
2025-02-11 11:24:19 -05:00
|
|
|
#include <haproxy/task.h>
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
#include <haproxy/tools.h>
|
2009-10-01 05:19:37 -04:00
|
|
|
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
struct fwlc_tree_elt {
|
|
|
|
|
struct mt_list srv_list[FWLC_LISTS_NB];
|
|
|
|
|
struct mt_list free_list;
|
|
|
|
|
struct eb32_node lb_node;
|
|
|
|
|
unsigned int elements;
|
|
|
|
|
};
|
|
|
|
|
|
MEDIUM: tree-wide: replace most DECLARE_POOL with DECLARE_TYPED_POOL
This will make the pools size and alignment automatically inherit
the type declaration. It was done like this:
sed -i -e 's:DECLARE_POOL(\([^,]*,[^,]*,\s*\)sizeof(\([^)]*\))):DECLARE_TYPED_POOL(\1\2):g' $(git grep -lw DECLARE_POOL src addons)
sed -i -e 's:DECLARE_STATIC_POOL(\([^,]*,[^,]*,\s*\)sizeof(\([^)]*\))):DECLARE_STATIC_TYPED_POOL(\1\2):g' $(git grep -lw DECLARE_STATIC_POOL src addons)
81 replacements were made. The only remaining ones are those which set
their own size without depending on a structure. The few ones with an
extra size were manually handled.
It also means that the requested alignments are now checked against the
type's. Given that none is specified for now, no issue is reported.
It was verified with "show pools detailed" that the definitions are
exactly the same, and that the binaries are similar.
2025-08-06 10:43:27 -04:00
|
|
|
DECLARE_STATIC_TYPED_POOL(pool_head_fwlc_elt, "fwlc_tree_elt", struct fwlc_tree_elt);
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
|
|
|
|
|
#define FWLC_LBPRM_SEQ(lbprm) ((lbprm) & 0xffffffff)
|
|
|
|
|
#define FWLC_LBPRM_SMALLEST(lbprm) ((lbprm) >> 32)
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Atomically try to update the sequence number, and the smallest key for which there is at least one server.
|
|
|
|
|
* Returns 1 on success, and 0 on failure.
|
|
|
|
|
*/
|
|
|
|
|
static int fwlc_set_seq_and_smallest(struct lbprm *lbprm, uint64_t current, unsigned int seq, unsigned int smallest)
|
|
|
|
|
{
|
|
|
|
|
uint64_t dst_nb = seq | ((uint64_t)smallest << 32);
|
|
|
|
|
int ret;
|
|
|
|
|
#if defined(HA_CAS_IS_8B)
|
|
|
|
|
ret = _HA_ATOMIC_CAS(&lbprm->lb_seq, ¤t, dst_nb);
|
|
|
|
|
#elif defined(HA_HAVE_CAS_DW)
|
|
|
|
|
ret = _HA_ATOMIC_DWCAS(&lbprm->lb_seq, ¤t, &dst_nb);
|
|
|
|
|
#else
|
2025-04-28 10:48:42 -04:00
|
|
|
__decl_thread(static HA_SPINLOCK_T seq_lock);
|
|
|
|
|
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
HA_SPIN_LOCK(OTHER_LOCK, &seq_lock);
|
|
|
|
|
if (lbprm->lb_seq == current) {
|
|
|
|
|
lbprm->lb_seq = dst_nb;
|
|
|
|
|
ret = 1;
|
|
|
|
|
} else
|
|
|
|
|
ret = 0;
|
|
|
|
|
HA_SPIN_UNLOCK(OTHER_LOCK, &seq_lock);
|
|
|
|
|
#endif
|
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
|
|
}
|
2009-10-01 05:19:37 -04:00
|
|
|
|
|
|
|
|
/* Remove a server from a tree. It must have previously been dequeued. This
|
|
|
|
|
* function is meant to be called when a server is going down or has its
|
|
|
|
|
* weight disabled.
|
2018-08-21 13:44:53 -04:00
|
|
|
*
|
|
|
|
|
* The server's lock and the lbprm's lock must be held.
|
2009-10-01 05:19:37 -04:00
|
|
|
*/
|
|
|
|
|
static inline void fwlc_remove_from_tree(struct server *s)
|
|
|
|
|
{
|
|
|
|
|
s->lb_tree = NULL;
|
|
|
|
|
}
|
|
|
|
|
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
/*
|
|
|
|
|
* Remove anything allocated by the proxy
|
|
|
|
|
*/
|
|
|
|
|
static void fwlc_proxy_deinit(struct proxy *p)
|
|
|
|
|
{
|
|
|
|
|
struct fwlc_tree_elt *tree_elt;
|
|
|
|
|
|
|
|
|
|
while ((tree_elt = MT_LIST_POP(&p->lbprm.lb_free_list, struct fwlc_tree_elt *, free_list)) != NULL) {
|
|
|
|
|
pool_free(pool_head_fwlc_elt, tree_elt);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Remove anything allocated by the server
|
|
|
|
|
*/
|
|
|
|
|
static void fwlc_server_deinit(struct server *s)
|
|
|
|
|
{
|
|
|
|
|
if (s->free_elt) {
|
|
|
|
|
pool_free(pool_head_fwlc_elt, s->free_elt);
|
|
|
|
|
s->free_elt = NULL;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2018-08-21 13:44:53 -04:00
|
|
|
/* simply removes a server from a tree.
|
|
|
|
|
*
|
2021-02-17 10:14:00 -05:00
|
|
|
* The lbprm's lock must be held.
|
2018-08-21 13:44:53 -04:00
|
|
|
*/
|
2009-10-01 05:19:37 -04:00
|
|
|
static inline void fwlc_dequeue_srv(struct server *s)
|
|
|
|
|
{
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
struct fwlc_tree_elt *tree_elt = s->tree_elt;
|
|
|
|
|
unsigned int elts;
|
|
|
|
|
|
|
|
|
|
MT_LIST_DELETE(&s->lb_mt_list);
|
|
|
|
|
if (tree_elt) {
|
|
|
|
|
elts = _HA_ATOMIC_FETCH_SUB(&tree_elt->elements, 1);
|
|
|
|
|
/* We are the last element, we can nuke the node */
|
|
|
|
|
if (elts == 1) {
|
|
|
|
|
if (FWLC_LBPRM_SMALLEST(s->proxy->lbprm.lb_seq) == tree_elt->lb_node.key) {
|
|
|
|
|
/*
|
|
|
|
|
* We were the smallest one, and now we're
|
|
|
|
|
* gone, reset it
|
|
|
|
|
*/
|
|
|
|
|
/*
|
|
|
|
|
* We're holding the lbprm lock so this should never fail,
|
|
|
|
|
* as nobody should be around to modify it
|
|
|
|
|
*/
|
|
|
|
|
do {
|
|
|
|
|
} while (fwlc_set_seq_and_smallest(&s->proxy->lbprm, s->proxy->lbprm.lb_seq, FWLC_LBPRM_SEQ(s->proxy->lbprm.lb_seq) + 1, 0) == 0 && __ha_cpu_relax());
|
|
|
|
|
|
|
|
|
|
}
|
|
|
|
|
eb32_delete(&tree_elt->lb_node);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
s->tree_elt = NULL;
|
|
|
|
|
if (s->free_elt) {
|
|
|
|
|
pool_free(pool_head_fwlc_elt, s->free_elt);
|
|
|
|
|
s->free_elt = NULL;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Allocate a tree element, either from the free list, from an element provided, or
|
|
|
|
|
* from allocation.
|
|
|
|
|
* Must be called with the wrlock
|
|
|
|
|
*/
|
|
|
|
|
static struct fwlc_tree_elt *fwlc_alloc_tree_elt(struct proxy *p, struct fwlc_tree_elt *allocated_elt)
|
|
|
|
|
{
|
|
|
|
|
struct fwlc_tree_elt *tree_elt = NULL;
|
|
|
|
|
int i = 0;
|
|
|
|
|
|
|
|
|
|
if (p->lbprm.lb_free_list_nb >= FWLC_MIN_FREE_ENTRIES) {
|
|
|
|
|
while ((tree_elt = MT_LIST_POP(&p->lbprm.lb_free_list, struct fwlc_tree_elt *, free_list)) != NULL) {
|
|
|
|
|
MT_LIST_APPEND(&p->lbprm.lb_free_list, &tree_elt->free_list);
|
|
|
|
|
if (tree_elt->elements == 0) {
|
|
|
|
|
eb32_delete(&tree_elt->lb_node);
|
|
|
|
|
if (i == 0) {
|
|
|
|
|
struct fwlc_tree_elt *tmptree;
|
|
|
|
|
|
|
|
|
|
tmptree = MT_LIST_POP(&p->lbprm.lb_free_list, struct fwlc_tree_elt *, free_list);
|
|
|
|
|
/*
|
|
|
|
|
* Check if the next element still contains servers, and if not,
|
|
|
|
|
* just free it, to do some cleanup.
|
|
|
|
|
*/
|
|
|
|
|
if (tmptree && tmptree->elements == 0) {
|
|
|
|
|
eb32_delete(&tmptree->lb_node);
|
|
|
|
|
pool_free(pool_head_fwlc_elt, tmptree);
|
|
|
|
|
p->lbprm.lb_free_list_nb--;
|
|
|
|
|
} else if (tmptree)
|
|
|
|
|
MT_LIST_APPEND(&p->lbprm.lb_free_list, &tmptree->free_list);
|
|
|
|
|
}
|
|
|
|
|
return tree_elt;
|
|
|
|
|
}
|
|
|
|
|
i++;
|
|
|
|
|
if (i > 3)
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
2025-10-01 12:05:53 -04:00
|
|
|
if (!allocated_elt) {
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
tree_elt = pool_alloc(pool_head_fwlc_elt);
|
2025-10-01 12:05:53 -04:00
|
|
|
if (!tree_elt)
|
|
|
|
|
return NULL;
|
|
|
|
|
} else
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
tree_elt = allocated_elt;
|
|
|
|
|
|
|
|
|
|
for (i = 0; i < FWLC_LISTS_NB; i++) {
|
|
|
|
|
MT_LIST_INIT(&tree_elt->srv_list[i]);
|
|
|
|
|
}
|
|
|
|
|
MT_LIST_INIT(&tree_elt->free_list);
|
|
|
|
|
MT_LIST_APPEND(&p->lbprm.lb_free_list, &tree_elt->free_list);
|
|
|
|
|
p->lbprm.lb_free_list_nb++;
|
|
|
|
|
tree_elt->elements = 0;
|
|
|
|
|
return tree_elt;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Return the tree element for the provided key, allocate it first if needed.
|
|
|
|
|
* Must be called with the lbprm lock held.
|
|
|
|
|
*/
|
|
|
|
|
static struct fwlc_tree_elt *fwlc_get_tree_elt(struct server *s, u32 key)
|
|
|
|
|
{
|
|
|
|
|
struct eb32_node *node;
|
|
|
|
|
struct fwlc_tree_elt *tree_elt = NULL;
|
|
|
|
|
|
|
|
|
|
node = eb32_lookup(s->lb_tree, key);
|
|
|
|
|
if (node)
|
|
|
|
|
tree_elt = container_of(node, struct fwlc_tree_elt, lb_node);
|
|
|
|
|
if (!tree_elt) {
|
|
|
|
|
/* No element available, we have to allocate one */
|
|
|
|
|
tree_elt = fwlc_alloc_tree_elt(s->proxy, NULL);
|
2025-10-01 12:05:53 -04:00
|
|
|
if (!tree_elt)
|
|
|
|
|
return NULL;
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
tree_elt->lb_node.key = key;
|
|
|
|
|
eb32_insert(s->lb_tree, &tree_elt->lb_node);
|
|
|
|
|
}
|
|
|
|
|
return tree_elt;
|
2009-10-01 05:19:37 -04:00
|
|
|
}
|
|
|
|
|
|
BUG/MEDIUM: lb-leastconn: Reposition a server using the right eweight
Depending on the context, the current eweight or the next one must be used
to reposition a server in the tree. When the server state is updated, for
instance its weight, the next eweight must be used because it is not yet
committed. However, when the server is used, on normal conditions, the
current eweight must be used.
In fact, it is only a bug on the 1.8. On newer versions, the changes on a
server are performed synchronously. But it is safer to rely on the right
eweight value to avoid any futur bugs.
On the 1.8, it is important to do so, because the server state is updated
and committed inside the rendez-vous point. Thus, the next server state may
be unsync with the current state for a short time, waiting all threads join
the rendez-vous point. It is especially a problem if the next eweight is set
to 0. Because otherwise, it must not be used to reposition the server in the
tree, leading to a divide by 0.
This patch must be backported as far as 1.8.
2020-12-11 09:36:01 -05:00
|
|
|
/* Queue a server in its associated tree, assuming the <eweight> is >0.
|
2018-12-14 02:33:28 -05:00
|
|
|
* Servers are sorted by (#conns+1)/weight. To ensure maximum accuracy,
|
|
|
|
|
* we use (#conns+1)*SRV_EWGHT_MAX/eweight as the sorting key. The reason
|
|
|
|
|
* for using #conns+1 is to sort by weights in case the server is picked
|
|
|
|
|
* and not before it is picked. This provides a better load accuracy for
|
|
|
|
|
* low connection counts when weights differ and makes sure the round-robin
|
2019-09-06 11:04:04 -04:00
|
|
|
* applies between servers of highest weight first. However servers with no
|
|
|
|
|
* connection are always picked first so that under low loads, it's not
|
|
|
|
|
* always the single server with the highest weight that gets picked.
|
2018-08-21 13:44:53 -04:00
|
|
|
*
|
BUG/MEDIUM: lb-leastconn: Reposition a server using the right eweight
Depending on the context, the current eweight or the next one must be used
to reposition a server in the tree. When the server state is updated, for
instance its weight, the next eweight must be used because it is not yet
committed. However, when the server is used, on normal conditions, the
current eweight must be used.
In fact, it is only a bug on the 1.8. On newer versions, the changes on a
server are performed synchronously. But it is safer to rely on the right
eweight value to avoid any futur bugs.
On the 1.8, it is important to do so, because the server state is updated
and committed inside the rendez-vous point. Thus, the next server state may
be unsync with the current state for a short time, waiting all threads join
the rendez-vous point. It is especially a problem if the next eweight is set
to 0. Because otherwise, it must not be used to reposition the server in the
tree, leading to a divide by 0.
This patch must be backported as far as 1.8.
2020-12-11 09:36:01 -05:00
|
|
|
* NOTE: Depending on the calling context, we use s->next_eweight or
|
|
|
|
|
* s->cur_eweight. The next value is used when the server state is updated
|
|
|
|
|
* (because the weight changed for instance). During this step, the server
|
|
|
|
|
* state is not yet committed. The current value is used to reposition the
|
|
|
|
|
* server in the tree. This happens when the server is used.
|
|
|
|
|
*
|
2021-02-17 10:14:00 -05:00
|
|
|
* The lbprm's lock must be held.
|
2009-10-01 05:19:37 -04:00
|
|
|
*/
|
BUG/MEDIUM: lb-leastconn: Reposition a server using the right eweight
Depending on the context, the current eweight or the next one must be used
to reposition a server in the tree. When the server state is updated, for
instance its weight, the next eweight must be used because it is not yet
committed. However, when the server is used, on normal conditions, the
current eweight must be used.
In fact, it is only a bug on the 1.8. On newer versions, the changes on a
server are performed synchronously. But it is safer to rely on the right
eweight value to avoid any futur bugs.
On the 1.8, it is important to do so, because the server state is updated
and committed inside the rendez-vous point. Thus, the next server state may
be unsync with the current state for a short time, waiting all threads join
the rendez-vous point. It is especially a problem if the next eweight is set
to 0. Because otherwise, it must not be used to reposition the server in the
tree, leading to a divide by 0.
This patch must be backported as far as 1.8.
2020-12-11 09:36:01 -05:00
|
|
|
static inline void fwlc_queue_srv(struct server *s, unsigned int eweight)
|
2009-10-01 05:19:37 -04:00
|
|
|
{
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
struct fwlc_tree_elt *tree_elt;
|
2025-01-15 10:37:35 -05:00
|
|
|
unsigned int inflight = _HA_ATOMIC_LOAD(&s->served) + _HA_ATOMIC_LOAD(&s->queueslength);
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
unsigned int list_nb;
|
|
|
|
|
u32 key;
|
|
|
|
|
|
|
|
|
|
key = inflight ? (inflight + 1) * SRV_EWGHT_MAX / eweight : 0;
|
|
|
|
|
tree_elt = fwlc_get_tree_elt(s, key);
|
2025-10-01 12:05:53 -04:00
|
|
|
if (tree_elt == NULL) {
|
|
|
|
|
/*
|
|
|
|
|
* We failed to allocate memory for the tree_elt, just stop
|
|
|
|
|
* now and schedule the requeue tasklet which will take care
|
|
|
|
|
* of the queueing later.
|
|
|
|
|
* If the tasklet doesn't exist yet, then there is nothing to
|
|
|
|
|
* do, as it will be eventually scheduled after being created.
|
|
|
|
|
*/
|
|
|
|
|
tasklet_wakeup(s->requeue_tasklet);
|
|
|
|
|
return;
|
|
|
|
|
}
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
list_nb = statistical_prng_range(FWLC_LISTS_NB);
|
|
|
|
|
MT_LIST_APPEND(&tree_elt->srv_list[list_nb], &s->lb_mt_list);
|
|
|
|
|
s->tree_elt = tree_elt;
|
|
|
|
|
_HA_ATOMIC_INC(&tree_elt->elements);
|
|
|
|
|
if (FWLC_LBPRM_SMALLEST(s->proxy->lbprm.lb_seq) > key) {
|
|
|
|
|
/*
|
|
|
|
|
* We're holding the lbprm lock so this should never fail,
|
|
|
|
|
* as nobody should be around to modify it
|
|
|
|
|
*/
|
|
|
|
|
do {
|
|
|
|
|
} while (fwlc_set_seq_and_smallest(&s->proxy->lbprm, s->proxy->lbprm.lb_seq, FWLC_LBPRM_SEQ(s->proxy->lbprm.lb_seq) + 1, key) == 0);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Loop across the different lists until we find an unlocked one, and lock it.
|
|
|
|
|
*/
|
|
|
|
|
static __inline struct mt_list fwlc_lock_target_list(struct fwlc_tree_elt *tree_elt)
|
|
|
|
|
{
|
|
|
|
|
struct mt_list list = {NULL, NULL};
|
|
|
|
|
int i;
|
|
|
|
|
int dst_list;
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
dst_list = statistical_prng_range(FWLC_LISTS_NB);
|
|
|
|
|
|
|
|
|
|
while (list.next == NULL) {
|
|
|
|
|
for (i = 0; i < FWLC_LISTS_NB; i++) {
|
|
|
|
|
list = mt_list_try_lock_prev(&tree_elt->srv_list[(dst_list + i) % FWLC_LISTS_NB]);
|
|
|
|
|
if (list.next != NULL)
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
return list;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Calculate the key to be used for a given server
|
|
|
|
|
*/
|
|
|
|
|
static inline unsigned int fwlc_get_key(struct server *s)
|
|
|
|
|
{
|
|
|
|
|
unsigned int inflight;
|
|
|
|
|
unsigned int eweight;
|
|
|
|
|
unsigned int new_key;
|
|
|
|
|
|
|
|
|
|
inflight = _HA_ATOMIC_LOAD(&s->served) + _HA_ATOMIC_LOAD(&s->queueslength);
|
|
|
|
|
eweight = _HA_ATOMIC_LOAD(&s->cur_eweight);
|
|
|
|
|
new_key = inflight ? (inflight + 1) * SRV_EWGHT_MAX / (eweight ? eweight : 1) : 0;
|
|
|
|
|
|
|
|
|
|
return new_key;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Only one thread will try to update a server position at a given time,
|
|
|
|
|
* thanks to the lb_lock. However that means that by the time we are done
|
|
|
|
|
* with the update, a new one might be needed, so check for that and
|
|
|
|
|
* schedule the tasklet if needed, once we dropped the lock.
|
|
|
|
|
*/
|
|
|
|
|
static inline void fwlc_check_srv_key(struct server *s, unsigned int expected)
|
|
|
|
|
{
|
|
|
|
|
unsigned int key = fwlc_get_key(s);
|
2020-10-22 11:41:45 -04:00
|
|
|
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
if (key != expected && s->requeue_tasklet)
|
|
|
|
|
tasklet_wakeup(s->requeue_tasklet);
|
2009-10-01 05:19:37 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Re-position the server in the FWLC tree after it has been assigned one
|
|
|
|
|
* connection or after it has released one. Note that it is possible that
|
|
|
|
|
* the server has been moved out of the tree due to failed health-checks.
|
2021-06-18 12:29:25 -04:00
|
|
|
* The lbprm's lock will be used.
|
2009-10-01 05:19:37 -04:00
|
|
|
*/
|
2021-06-18 12:29:25 -04:00
|
|
|
static void fwlc_srv_reposition(struct server *s)
|
2009-10-01 05:19:37 -04:00
|
|
|
{
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
struct mt_list to_unlock;
|
|
|
|
|
struct fwlc_tree_elt *tree_elt = NULL, *allocated_elt = NULL;
|
|
|
|
|
struct eb32_node *node;
|
|
|
|
|
struct mt_list list;
|
|
|
|
|
uint64_t cur_seq = 0;
|
2021-09-22 01:15:57 -04:00
|
|
|
unsigned int eweight = _HA_ATOMIC_LOAD(&s->cur_eweight);
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
unsigned int new_key;
|
|
|
|
|
unsigned int smallest;
|
|
|
|
|
int srv_lock;
|
OPTIM: lb-leastconn: do not unlink the server if it did not change
Due to the two-phase server reservation, there are 3 calls to
fwlc_srv_reposition() per request, one during assign_server() to reserve
the slot, one in connect_server() to commit it, and one in process_stream()
to release it. However only one of the first two will change the key, so
it's needlessly costly to take the lock, remove a server and insert it
again at the same place when we can already figure we ought not to even
have taken the lock.
Furthermore, even when the server needs to move, there can be quite some
contention on the lbprm lock forcing the thread to wait. During this time
the served and nbpend server values might have changed, just like the
lb_node.key itself. Thus we measure the values again under the lock
before updating the tree. Measurements have shown that under contention
with 16 servers and 16 threads, 50% of the updates can be avoided there.
This patch makes the function compute the new key and compare it to
the current one before deciding to move the entry (and does it again
under the lock forthe second test).
This removes between 40 and 50% of the tree updates depending on the
thread contention and the number of servers. The performance gain due
to this (on 16 threads) was:
16 servers: 415 krps -> 440 krps (6%, contention on lbprm)
4 servers: 554 krps -> 714 krps (+29%, little contention)
One point worth thinking about is that it's not logic to update the
tree 2-3 times per request while it's only read once. half to 2/3 of
these updates are not needed. An experiment consisting in subscribing
the server to a list and letting the readers reinsert them on the fly
showed further degradation instead of an improvement.
A better approach would probably consist in avoinding writes to shared
cache lines by having the leastconn nodes distinct from the servers,
with one node per value, and having them hold an mt-list with all the
servers having that number of connections. The connection count tree
would then be read-mostly instead of facing heavy writes, and most
write operations would be performed on 1-3 list heads which are way
cheaper to migrate than a tree node, and do not require updating the
last two updated neighbors' cache lines.
2021-02-17 10:26:55 -05:00
|
|
|
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
HA_RWLOCK_RDLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
|
|
|
|
new_key = fwlc_get_key(s);
|
OPTIM: lb-leastconn: do not unlink the server if it did not change
Due to the two-phase server reservation, there are 3 calls to
fwlc_srv_reposition() per request, one during assign_server() to reserve
the slot, one in connect_server() to commit it, and one in process_stream()
to release it. However only one of the first two will change the key, so
it's needlessly costly to take the lock, remove a server and insert it
again at the same place when we can already figure we ought not to even
have taken the lock.
Furthermore, even when the server needs to move, there can be quite some
contention on the lbprm lock forcing the thread to wait. During this time
the served and nbpend server values might have changed, just like the
lb_node.key itself. Thus we measure the values again under the lock
before updating the tree. Measurements have shown that under contention
with 16 servers and 16 threads, 50% of the updates can be avoided there.
This patch makes the function compute the new key and compare it to
the current one before deciding to move the entry (and does it again
under the lock forthe second test).
This removes between 40 and 50% of the tree updates depending on the
thread contention and the number of servers. The performance gain due
to this (on 16 threads) was:
16 servers: 415 krps -> 440 krps (6%, contention on lbprm)
4 servers: 554 krps -> 714 krps (+29%, little contention)
One point worth thinking about is that it's not logic to update the
tree 2-3 times per request while it's only read once. half to 2/3 of
these updates are not needed. An experiment consisting in subscribing
the server to a list and letting the readers reinsert them on the fly
showed further degradation instead of an improvement.
A better approach would probably consist in avoinding writes to shared
cache lines by having the leastconn nodes distinct from the servers,
with one node per value, and having them hold an mt-list with all the
servers having that number of connections. The connection count tree
would then be read-mostly instead of facing heavy writes, and most
write operations would be performed on 1-3 list heads which are way
cheaper to migrate than a tree node, and do not require updating the
last two updated neighbors' cache lines.
2021-02-17 10:26:55 -05:00
|
|
|
/* some calls will be made for no change (e.g connect_server() after
|
|
|
|
|
* assign_server(). Let's check that first.
|
|
|
|
|
*/
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
if ((s->tree_elt && s->tree_elt->lb_node.node.leaf_p && eweight &&
|
|
|
|
|
s->tree_elt->lb_node.key == new_key) || !s->lb_tree) {
|
|
|
|
|
HA_RWLOCK_RDUNLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
srv_lock = HA_ATOMIC_XCHG(&s->lb_lock, 1);
|
|
|
|
|
/* Somebody else is updating that server, give up */
|
|
|
|
|
if (srv_lock == 1) {
|
|
|
|
|
HA_RWLOCK_RDUNLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
OPTIM: lb-leastconn: do not unlink the server if it did not change
Due to the two-phase server reservation, there are 3 calls to
fwlc_srv_reposition() per request, one during assign_server() to reserve
the slot, one in connect_server() to commit it, and one in process_stream()
to release it. However only one of the first two will change the key, so
it's needlessly costly to take the lock, remove a server and insert it
again at the same place when we can already figure we ought not to even
have taken the lock.
Furthermore, even when the server needs to move, there can be quite some
contention on the lbprm lock forcing the thread to wait. During this time
the served and nbpend server values might have changed, just like the
lb_node.key itself. Thus we measure the values again under the lock
before updating the tree. Measurements have shown that under contention
with 16 servers and 16 threads, 50% of the updates can be avoided there.
This patch makes the function compute the new key and compare it to
the current one before deciding to move the entry (and does it again
under the lock forthe second test).
This removes between 40 and 50% of the tree updates depending on the
thread contention and the number of servers. The performance gain due
to this (on 16 threads) was:
16 servers: 415 krps -> 440 krps (6%, contention on lbprm)
4 servers: 554 krps -> 714 krps (+29%, little contention)
One point worth thinking about is that it's not logic to update the
tree 2-3 times per request while it's only read once. half to 2/3 of
these updates are not needed. An experiment consisting in subscribing
the server to a list and letting the readers reinsert them on the fly
showed further degradation instead of an improvement.
A better approach would probably consist in avoinding writes to shared
cache lines by having the leastconn nodes distinct from the servers,
with one node per value, and having them hold an mt-list with all the
servers having that number of connections. The connection count tree
would then be read-mostly instead of facing heavy writes, and most
write operations would be performed on 1-3 list heads which are way
cheaper to migrate than a tree node, and do not require updating the
last two updated neighbors' cache lines.
2021-02-17 10:26:55 -05:00
|
|
|
return;
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
}
|
OPTIM: lb-leastconn: do not unlink the server if it did not change
Due to the two-phase server reservation, there are 3 calls to
fwlc_srv_reposition() per request, one during assign_server() to reserve
the slot, one in connect_server() to commit it, and one in process_stream()
to release it. However only one of the first two will change the key, so
it's needlessly costly to take the lock, remove a server and insert it
again at the same place when we can already figure we ought not to even
have taken the lock.
Furthermore, even when the server needs to move, there can be quite some
contention on the lbprm lock forcing the thread to wait. During this time
the served and nbpend server values might have changed, just like the
lb_node.key itself. Thus we measure the values again under the lock
before updating the tree. Measurements have shown that under contention
with 16 servers and 16 threads, 50% of the updates can be avoided there.
This patch makes the function compute the new key and compare it to
the current one before deciding to move the entry (and does it again
under the lock forthe second test).
This removes between 40 and 50% of the tree updates depending on the
thread contention and the number of servers. The performance gain due
to this (on 16 threads) was:
16 servers: 415 krps -> 440 krps (6%, contention on lbprm)
4 servers: 554 krps -> 714 krps (+29%, little contention)
One point worth thinking about is that it's not logic to update the
tree 2-3 times per request while it's only read once. half to 2/3 of
these updates are not needed. An experiment consisting in subscribing
the server to a list and letting the readers reinsert them on the fly
showed further degradation instead of an improvement.
A better approach would probably consist in avoinding writes to shared
cache lines by having the leastconn nodes distinct from the servers,
with one node per value, and having them hold an mt-list with all the
servers having that number of connections. The connection count tree
would then be read-mostly instead of facing heavy writes, and most
write operations would be performed on 1-3 list heads which are way
cheaper to migrate than a tree node, and do not require updating the
last two updated neighbors' cache lines.
2021-02-17 10:26:55 -05:00
|
|
|
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
node = eb32_lookup(s->lb_tree, new_key);
|
|
|
|
|
if (node)
|
|
|
|
|
tree_elt = container_of(node, struct fwlc_tree_elt, lb_node);
|
|
|
|
|
/*
|
|
|
|
|
* It is possible that s->tree_elt was changed since we checked
|
|
|
|
|
* As s->tree_elt is only changed while holding s->lb_lock,
|
|
|
|
|
* check again now that we acquired it, and if we're using
|
|
|
|
|
* the right element, do nothing.
|
MAJOR: leastconn: postpone the server's repositioning under contention
When leastconn is used under many threads, there can be a lot of
contention on leastconn, because the same node has to be moved around
all the time (when picking it and when releasing it). In GH issue #2861
it was noticed that 46 threads out of 64 were waiting on the same lock
in fwlc_srv_reposition().
In such a case, the accuracy of the server's key becomes quite irrelevant
because nobody cares if the same server is picked twice in a row and the
next one twice again.
While other approaches in the past considered using a floating key to
avoid moving the server each time (which was not compatible with the
round-robin rule for equal keys), here a more drastic solution is needed.
What we're doing instead is that we turn this lock into a trylock. If we
can grab it, we do the job. If we can't, then we just wake up a server's
tasklet dedicated to this. That tasklet will then try again slightly
later, knowing that during this short time frame, the server's position
in the queue is slightly inaccurate. Note that any thread touching the
same server will also reposition it and save that work for next time.
Also if multiple threads wake the tasklet up, then that's fine, their
calls will be merged and a single lock will be taken in the end.
Testing this on a 24-core EPYC 74F3 showed a significant performance
boost from 382krps to 610krps. The performance profile reported by
perf top dropped from 43% to 2.5%:
Before:
Overhead Shared Object Symbol
43.46% haproxy-master-inlineebo [.] fwlc_srv_reposition
21.20% haproxy-master-inlineebo [.] fwlc_get_next_server
0.91% haproxy-master-inlineebo [.] process_stream
0.75% [kernel] [k] ice_napi_poll
0.51% [kernel] [k] tcp_recvmsg
0.50% [kernel] [k] ice_start_xmit
0.50% [kernel] [k] tcp_ack
After:
Overhead Shared Object Symbol
30.37% haproxy [.] fwlc_get_next_server
2.51% haproxy [.] fwlc_srv_reposition
1.91% haproxy [.] process_stream
1.46% [kernel] [k] ice_napi_poll
1.36% [kernel] [k] tcp_recvmsg
1.04% [kernel] [k] tcp_ack
1.00% [kernel] [k] skb_release_data
0.96% [kernel] [k] ice_start_xmit
0.91% haproxy [.] conn_backend_get
0.82% haproxy [.] connect_server
0.82% haproxy [.] run_tasks_from_lists
Tested on an Ampere Altra with 64 aarch64 cores dedicated to haproxy,
the gain is even more visible (3.6x):
Before: 311-323k rps, 3.16-3.25ms, 6400% CPU
Overhead Shared Object Symbol
55.69% haproxy-master [.] fwlc_srv_reposition
33.30% haproxy-master [.] fwlc_get_next_server
0.89% haproxy-master [.] process_stream
0.45% haproxy-master [.] h1_snd_buf
0.34% haproxy-master [.] run_tasks_from_lists
0.32% haproxy-master [.] connect_server
0.31% haproxy-master [.] conn_backend_get
0.31% haproxy-master [.] h1_headers_to_hdr_list
0.24% haproxy-master [.] srv_add_to_idle_list
0.23% haproxy-master [.] http_request_forward_body
0.22% haproxy-master [.] __pool_alloc
0.21% haproxy-master [.] http_wait_for_response
0.21% haproxy-master [.] h1_send
After: 1.21M rps, 0.842ms, 6400% CPU
Overhead Shared Object Symbol
17.44% haproxy [.] fwlc_get_next_server
6.33% haproxy [.] process_stream
4.40% haproxy [.] fwlc_srv_reposition
3.64% haproxy [.] conn_backend_get
2.75% haproxy [.] connect_server
2.71% haproxy [.] h1_snd_buf
2.66% haproxy [.] srv_add_to_idle_list
2.33% haproxy [.] run_tasks_from_lists
2.14% haproxy [.] h1_headers_to_hdr_list
1.56% haproxy [.] stream_set_backend
1.37% haproxy [.] http_request_forward_body
1.35% haproxy [.] http_wait_for_response
1.34% haproxy [.] h1_send
And at similar loads, the CPU usage considerably drops (3.55x), as
well as the response time (10x):
After: 320k rps, 0.322ms, 1800% CPU
Overhead Shared Object Symbol
7.62% haproxy [.] process_stream
4.64% haproxy [.] h1_headers_to_hdr_list
3.09% haproxy [.] h1_snd_buf
3.08% haproxy [.] h1_process_demux
2.22% haproxy [.] __pool_alloc
2.14% haproxy [.] connect_server
1.87% haproxy [.] h1_send
> 1.84% haproxy [.] fwlc_srv_reposition
1.84% haproxy [.] run_tasks_from_lists
1.77% haproxy [.] sock_conn_iocb
1.75% haproxy [.] srv_add_to_idle_list
1.66% haproxy [.] http_request_forward_body
1.65% haproxy [.] wake_expired_tasks
1.59% haproxy [.] h1_parse_msg_hdrs
1.51% haproxy [.] http_wait_for_response
> 1.50% haproxy [.] fwlc_get_next_server
The cost of fwlc_get_next_server() naturally increases as the server
count increases, but now has no visible effect on updates. The load
distribution remains unchanged compared to the previous approach,
the weight still being respected.
For further improvements to the fwlc algo, please consult github
issue #881 which centralizes everything related to this algorithm.
2025-02-11 11:24:19 -05:00
|
|
|
*/
|
2025-10-01 11:59:37 -04:00
|
|
|
if (s->tree_elt && tree_elt == s->tree_elt) {
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
HA_RWLOCK_RDUNLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
|
|
|
|
_HA_ATOMIC_STORE(&s->lb_lock, 0);
|
|
|
|
|
fwlc_check_srv_key(s, new_key);
|
MAJOR: leastconn: postpone the server's repositioning under contention
When leastconn is used under many threads, there can be a lot of
contention on leastconn, because the same node has to be moved around
all the time (when picking it and when releasing it). In GH issue #2861
it was noticed that 46 threads out of 64 were waiting on the same lock
in fwlc_srv_reposition().
In such a case, the accuracy of the server's key becomes quite irrelevant
because nobody cares if the same server is picked twice in a row and the
next one twice again.
While other approaches in the past considered using a floating key to
avoid moving the server each time (which was not compatible with the
round-robin rule for equal keys), here a more drastic solution is needed.
What we're doing instead is that we turn this lock into a trylock. If we
can grab it, we do the job. If we can't, then we just wake up a server's
tasklet dedicated to this. That tasklet will then try again slightly
later, knowing that during this short time frame, the server's position
in the queue is slightly inaccurate. Note that any thread touching the
same server will also reposition it and save that work for next time.
Also if multiple threads wake the tasklet up, then that's fine, their
calls will be merged and a single lock will be taken in the end.
Testing this on a 24-core EPYC 74F3 showed a significant performance
boost from 382krps to 610krps. The performance profile reported by
perf top dropped from 43% to 2.5%:
Before:
Overhead Shared Object Symbol
43.46% haproxy-master-inlineebo [.] fwlc_srv_reposition
21.20% haproxy-master-inlineebo [.] fwlc_get_next_server
0.91% haproxy-master-inlineebo [.] process_stream
0.75% [kernel] [k] ice_napi_poll
0.51% [kernel] [k] tcp_recvmsg
0.50% [kernel] [k] ice_start_xmit
0.50% [kernel] [k] tcp_ack
After:
Overhead Shared Object Symbol
30.37% haproxy [.] fwlc_get_next_server
2.51% haproxy [.] fwlc_srv_reposition
1.91% haproxy [.] process_stream
1.46% [kernel] [k] ice_napi_poll
1.36% [kernel] [k] tcp_recvmsg
1.04% [kernel] [k] tcp_ack
1.00% [kernel] [k] skb_release_data
0.96% [kernel] [k] ice_start_xmit
0.91% haproxy [.] conn_backend_get
0.82% haproxy [.] connect_server
0.82% haproxy [.] run_tasks_from_lists
Tested on an Ampere Altra with 64 aarch64 cores dedicated to haproxy,
the gain is even more visible (3.6x):
Before: 311-323k rps, 3.16-3.25ms, 6400% CPU
Overhead Shared Object Symbol
55.69% haproxy-master [.] fwlc_srv_reposition
33.30% haproxy-master [.] fwlc_get_next_server
0.89% haproxy-master [.] process_stream
0.45% haproxy-master [.] h1_snd_buf
0.34% haproxy-master [.] run_tasks_from_lists
0.32% haproxy-master [.] connect_server
0.31% haproxy-master [.] conn_backend_get
0.31% haproxy-master [.] h1_headers_to_hdr_list
0.24% haproxy-master [.] srv_add_to_idle_list
0.23% haproxy-master [.] http_request_forward_body
0.22% haproxy-master [.] __pool_alloc
0.21% haproxy-master [.] http_wait_for_response
0.21% haproxy-master [.] h1_send
After: 1.21M rps, 0.842ms, 6400% CPU
Overhead Shared Object Symbol
17.44% haproxy [.] fwlc_get_next_server
6.33% haproxy [.] process_stream
4.40% haproxy [.] fwlc_srv_reposition
3.64% haproxy [.] conn_backend_get
2.75% haproxy [.] connect_server
2.71% haproxy [.] h1_snd_buf
2.66% haproxy [.] srv_add_to_idle_list
2.33% haproxy [.] run_tasks_from_lists
2.14% haproxy [.] h1_headers_to_hdr_list
1.56% haproxy [.] stream_set_backend
1.37% haproxy [.] http_request_forward_body
1.35% haproxy [.] http_wait_for_response
1.34% haproxy [.] h1_send
And at similar loads, the CPU usage considerably drops (3.55x), as
well as the response time (10x):
After: 320k rps, 0.322ms, 1800% CPU
Overhead Shared Object Symbol
7.62% haproxy [.] process_stream
4.64% haproxy [.] h1_headers_to_hdr_list
3.09% haproxy [.] h1_snd_buf
3.08% haproxy [.] h1_process_demux
2.22% haproxy [.] __pool_alloc
2.14% haproxy [.] connect_server
1.87% haproxy [.] h1_send
> 1.84% haproxy [.] fwlc_srv_reposition
1.84% haproxy [.] run_tasks_from_lists
1.77% haproxy [.] sock_conn_iocb
1.75% haproxy [.] srv_add_to_idle_list
1.66% haproxy [.] http_request_forward_body
1.65% haproxy [.] wake_expired_tasks
1.59% haproxy [.] h1_parse_msg_hdrs
1.51% haproxy [.] http_wait_for_response
> 1.50% haproxy [.] fwlc_get_next_server
The cost of fwlc_get_next_server() naturally increases as the server
count increases, but now has no visible effect on updates. The load
distribution remains unchanged compared to the previous approach,
the weight still being respected.
For further improvements to the fwlc algo, please consult github
issue #881 which centralizes everything related to this algorithm.
2025-02-11 11:24:19 -05:00
|
|
|
return;
|
|
|
|
|
}
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
/*
|
|
|
|
|
* We have to allocate a new tree element, and/or remove the
|
|
|
|
|
* previous element, we will modify the tree, so let's get the write
|
|
|
|
|
* lock.
|
|
|
|
|
*/
|
|
|
|
|
if (!tree_elt) {
|
|
|
|
|
unsigned int new_new_key;
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* We don't want to allocate something while holding the lock,
|
|
|
|
|
* so make sure we have something allocated before.
|
|
|
|
|
*/
|
|
|
|
|
if (s->free_elt != NULL) {
|
|
|
|
|
allocated_elt = s->free_elt;
|
|
|
|
|
s->free_elt = NULL;
|
|
|
|
|
} else
|
|
|
|
|
allocated_elt = pool_alloc(pool_head_fwlc_elt);
|
|
|
|
|
if (HA_RWLOCK_TRYRDTOWR(LBPRM_LOCK, &s->proxy->lbprm.lock) != 0) {
|
|
|
|
|
/* there's already some contention on the tree's lock, there's
|
|
|
|
|
* no point insisting. Better wake up the server's tasklet that
|
|
|
|
|
* will let this or another thread retry later. For the time
|
|
|
|
|
* being, the server's apparent load is slightly inaccurate but
|
|
|
|
|
* we don't care, if there is contention, it will self-regulate.
|
|
|
|
|
*/
|
|
|
|
|
if (s->requeue_tasklet)
|
|
|
|
|
tasklet_wakeup(s->requeue_tasklet);
|
|
|
|
|
HA_RWLOCK_RDUNLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
|
|
|
|
s->free_elt = allocated_elt;
|
|
|
|
|
_HA_ATOMIC_STORE(&s->lb_lock, 0);
|
|
|
|
|
return;
|
|
|
|
|
}
|
MAJOR: leastconn: postpone the server's repositioning under contention
When leastconn is used under many threads, there can be a lot of
contention on leastconn, because the same node has to be moved around
all the time (when picking it and when releasing it). In GH issue #2861
it was noticed that 46 threads out of 64 were waiting on the same lock
in fwlc_srv_reposition().
In such a case, the accuracy of the server's key becomes quite irrelevant
because nobody cares if the same server is picked twice in a row and the
next one twice again.
While other approaches in the past considered using a floating key to
avoid moving the server each time (which was not compatible with the
round-robin rule for equal keys), here a more drastic solution is needed.
What we're doing instead is that we turn this lock into a trylock. If we
can grab it, we do the job. If we can't, then we just wake up a server's
tasklet dedicated to this. That tasklet will then try again slightly
later, knowing that during this short time frame, the server's position
in the queue is slightly inaccurate. Note that any thread touching the
same server will also reposition it and save that work for next time.
Also if multiple threads wake the tasklet up, then that's fine, their
calls will be merged and a single lock will be taken in the end.
Testing this on a 24-core EPYC 74F3 showed a significant performance
boost from 382krps to 610krps. The performance profile reported by
perf top dropped from 43% to 2.5%:
Before:
Overhead Shared Object Symbol
43.46% haproxy-master-inlineebo [.] fwlc_srv_reposition
21.20% haproxy-master-inlineebo [.] fwlc_get_next_server
0.91% haproxy-master-inlineebo [.] process_stream
0.75% [kernel] [k] ice_napi_poll
0.51% [kernel] [k] tcp_recvmsg
0.50% [kernel] [k] ice_start_xmit
0.50% [kernel] [k] tcp_ack
After:
Overhead Shared Object Symbol
30.37% haproxy [.] fwlc_get_next_server
2.51% haproxy [.] fwlc_srv_reposition
1.91% haproxy [.] process_stream
1.46% [kernel] [k] ice_napi_poll
1.36% [kernel] [k] tcp_recvmsg
1.04% [kernel] [k] tcp_ack
1.00% [kernel] [k] skb_release_data
0.96% [kernel] [k] ice_start_xmit
0.91% haproxy [.] conn_backend_get
0.82% haproxy [.] connect_server
0.82% haproxy [.] run_tasks_from_lists
Tested on an Ampere Altra with 64 aarch64 cores dedicated to haproxy,
the gain is even more visible (3.6x):
Before: 311-323k rps, 3.16-3.25ms, 6400% CPU
Overhead Shared Object Symbol
55.69% haproxy-master [.] fwlc_srv_reposition
33.30% haproxy-master [.] fwlc_get_next_server
0.89% haproxy-master [.] process_stream
0.45% haproxy-master [.] h1_snd_buf
0.34% haproxy-master [.] run_tasks_from_lists
0.32% haproxy-master [.] connect_server
0.31% haproxy-master [.] conn_backend_get
0.31% haproxy-master [.] h1_headers_to_hdr_list
0.24% haproxy-master [.] srv_add_to_idle_list
0.23% haproxy-master [.] http_request_forward_body
0.22% haproxy-master [.] __pool_alloc
0.21% haproxy-master [.] http_wait_for_response
0.21% haproxy-master [.] h1_send
After: 1.21M rps, 0.842ms, 6400% CPU
Overhead Shared Object Symbol
17.44% haproxy [.] fwlc_get_next_server
6.33% haproxy [.] process_stream
4.40% haproxy [.] fwlc_srv_reposition
3.64% haproxy [.] conn_backend_get
2.75% haproxy [.] connect_server
2.71% haproxy [.] h1_snd_buf
2.66% haproxy [.] srv_add_to_idle_list
2.33% haproxy [.] run_tasks_from_lists
2.14% haproxy [.] h1_headers_to_hdr_list
1.56% haproxy [.] stream_set_backend
1.37% haproxy [.] http_request_forward_body
1.35% haproxy [.] http_wait_for_response
1.34% haproxy [.] h1_send
And at similar loads, the CPU usage considerably drops (3.55x), as
well as the response time (10x):
After: 320k rps, 0.322ms, 1800% CPU
Overhead Shared Object Symbol
7.62% haproxy [.] process_stream
4.64% haproxy [.] h1_headers_to_hdr_list
3.09% haproxy [.] h1_snd_buf
3.08% haproxy [.] h1_process_demux
2.22% haproxy [.] __pool_alloc
2.14% haproxy [.] connect_server
1.87% haproxy [.] h1_send
> 1.84% haproxy [.] fwlc_srv_reposition
1.84% haproxy [.] run_tasks_from_lists
1.77% haproxy [.] sock_conn_iocb
1.75% haproxy [.] srv_add_to_idle_list
1.66% haproxy [.] http_request_forward_body
1.65% haproxy [.] wake_expired_tasks
1.59% haproxy [.] h1_parse_msg_hdrs
1.51% haproxy [.] http_wait_for_response
> 1.50% haproxy [.] fwlc_get_next_server
The cost of fwlc_get_next_server() naturally increases as the server
count increases, but now has no visible effect on updates. The load
distribution remains unchanged compared to the previous approach,
the weight still being respected.
For further improvements to the fwlc algo, please consult github
issue #881 which centralizes everything related to this algorithm.
2025-02-11 11:24:19 -05:00
|
|
|
|
OPTIM: lb-leastconn: do not unlink the server if it did not change
Due to the two-phase server reservation, there are 3 calls to
fwlc_srv_reposition() per request, one during assign_server() to reserve
the slot, one in connect_server() to commit it, and one in process_stream()
to release it. However only one of the first two will change the key, so
it's needlessly costly to take the lock, remove a server and insert it
again at the same place when we can already figure we ought not to even
have taken the lock.
Furthermore, even when the server needs to move, there can be quite some
contention on the lbprm lock forcing the thread to wait. During this time
the served and nbpend server values might have changed, just like the
lb_node.key itself. Thus we measure the values again under the lock
before updating the tree. Measurements have shown that under contention
with 16 servers and 16 threads, 50% of the updates can be avoided there.
This patch makes the function compute the new key and compare it to
the current one before deciding to move the entry (and does it again
under the lock forthe second test).
This removes between 40 and 50% of the tree updates depending on the
thread contention and the number of servers. The performance gain due
to this (on 16 threads) was:
16 servers: 415 krps -> 440 krps (6%, contention on lbprm)
4 servers: 554 krps -> 714 krps (+29%, little contention)
One point worth thinking about is that it's not logic to update the
tree 2-3 times per request while it's only read once. half to 2/3 of
these updates are not needed. An experiment consisting in subscribing
the server to a list and letting the readers reinsert them on the fly
showed further degradation instead of an improvement.
A better approach would probably consist in avoinding writes to shared
cache lines by having the leastconn nodes distinct from the servers,
with one node per value, and having them hold an mt-list with all the
servers having that number of connections. The connection count tree
would then be read-mostly instead of facing heavy writes, and most
write operations would be performed on 1-3 list heads which are way
cheaper to migrate than a tree node, and do not require updating the
last two updated neighbors' cache lines.
2021-02-17 10:26:55 -05:00
|
|
|
/* we might have been waiting for a while on the lock above
|
|
|
|
|
* so it's worth testing again because other threads are very
|
|
|
|
|
* likely to have released a connection or taken one leading
|
|
|
|
|
* to our target value (50% of the case in measurements).
|
|
|
|
|
*/
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
|
|
|
|
|
new_new_key = fwlc_get_key(s);
|
|
|
|
|
if (new_new_key != new_key) {
|
|
|
|
|
if (s->tree_elt &&
|
|
|
|
|
s->tree_elt->lb_node.node.leaf_p &&
|
|
|
|
|
eweight && s->tree_elt->lb_node.key == new_new_key) {
|
|
|
|
|
/* Okay after all we have nothing to do */
|
|
|
|
|
HA_RWLOCK_WRUNLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
|
|
|
|
s->free_elt = allocated_elt;
|
|
|
|
|
_HA_ATOMIC_STORE(&s->lb_lock, 0);
|
|
|
|
|
fwlc_check_srv_key(s, new_new_key);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
node = eb32_lookup(s->lb_tree, new_new_key);
|
|
|
|
|
if (node) {
|
|
|
|
|
tree_elt = container_of(node, struct fwlc_tree_elt, lb_node);
|
|
|
|
|
HA_RWLOCK_WRTORD(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
|
|
|
|
s->free_elt = allocated_elt;
|
|
|
|
|
allocated_elt = NULL;
|
|
|
|
|
} else
|
|
|
|
|
tree_elt = NULL;
|
|
|
|
|
new_key = new_new_key;
|
OPTIM: lb-leastconn: do not unlink the server if it did not change
Due to the two-phase server reservation, there are 3 calls to
fwlc_srv_reposition() per request, one during assign_server() to reserve
the slot, one in connect_server() to commit it, and one in process_stream()
to release it. However only one of the first two will change the key, so
it's needlessly costly to take the lock, remove a server and insert it
again at the same place when we can already figure we ought not to even
have taken the lock.
Furthermore, even when the server needs to move, there can be quite some
contention on the lbprm lock forcing the thread to wait. During this time
the served and nbpend server values might have changed, just like the
lb_node.key itself. Thus we measure the values again under the lock
before updating the tree. Measurements have shown that under contention
with 16 servers and 16 threads, 50% of the updates can be avoided there.
This patch makes the function compute the new key and compare it to
the current one before deciding to move the entry (and does it again
under the lock forthe second test).
This removes between 40 and 50% of the tree updates depending on the
thread contention and the number of servers. The performance gain due
to this (on 16 threads) was:
16 servers: 415 krps -> 440 krps (6%, contention on lbprm)
4 servers: 554 krps -> 714 krps (+29%, little contention)
One point worth thinking about is that it's not logic to update the
tree 2-3 times per request while it's only read once. half to 2/3 of
these updates are not needed. An experiment consisting in subscribing
the server to a list and letting the readers reinsert them on the fly
showed further degradation instead of an improvement.
A better approach would probably consist in avoinding writes to shared
cache lines by having the leastconn nodes distinct from the servers,
with one node per value, and having them hold an mt-list with all the
servers having that number of connections. The connection count tree
would then be read-mostly instead of facing heavy writes, and most
write operations would be performed on 1-3 list heads which are way
cheaper to migrate than a tree node, and do not require updating the
last two updated neighbors' cache lines.
2021-02-17 10:26:55 -05:00
|
|
|
}
|
2019-06-19 04:50:38 -04:00
|
|
|
}
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Now we increment the number of elements in the new tree_elt,
|
|
|
|
|
* we change our sequence number and smallest, and we then
|
|
|
|
|
* decrement the number of elements in the old tree_elt.
|
|
|
|
|
* It is important to keep this sequencing, as fwlc_get_next_server()
|
|
|
|
|
* uses the number of elements to know if there is something to look for,
|
|
|
|
|
* and we want to make sure we do not miss a server.
|
|
|
|
|
*/
|
|
|
|
|
if (!tree_elt) {
|
|
|
|
|
/*
|
|
|
|
|
* There were no tree element matching our key,
|
|
|
|
|
* allocate one and insert it into the tree
|
|
|
|
|
*/
|
|
|
|
|
tree_elt = fwlc_alloc_tree_elt(s->proxy, allocated_elt);
|
2025-10-01 12:05:53 -04:00
|
|
|
if (tree_elt == NULL) {
|
|
|
|
|
/* We failed to allocate memory, just try again later */
|
|
|
|
|
HA_RWLOCK_RDUNLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
|
|
|
|
_HA_ATOMIC_STORE(&s->lb_lock, 0);
|
|
|
|
|
if (s->requeue_tasklet)
|
|
|
|
|
tasklet_wakeup(s->requeue_tasklet);
|
|
|
|
|
return;
|
|
|
|
|
}
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
if (tree_elt == allocated_elt)
|
|
|
|
|
allocated_elt = NULL;
|
|
|
|
|
tree_elt->lb_node.key = new_key;
|
|
|
|
|
tree_elt->elements = 1;
|
|
|
|
|
__ha_barrier_store();
|
|
|
|
|
/* If we allocated, then we hold the write lock */
|
|
|
|
|
eb32_insert(s->lb_tree, &tree_elt->lb_node);
|
|
|
|
|
HA_RWLOCK_WRTORD(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
|
|
|
|
} else {
|
|
|
|
|
_HA_ATOMIC_INC(&tree_elt->elements);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
__ha_barrier_store();
|
|
|
|
|
/*
|
|
|
|
|
* Update the sequence number, and the smallest if needed.
|
|
|
|
|
* We always have to do it, even if we're not actually
|
|
|
|
|
* updating the smallest one, otherwise we'll get na
|
|
|
|
|
* ABA problem and a server may be missed when looked up.
|
|
|
|
|
* The only time we don't have to do it if is another thread
|
|
|
|
|
* increased it, and the new smallest element is not
|
|
|
|
|
* higher than our new key.
|
|
|
|
|
*/
|
|
|
|
|
do {
|
|
|
|
|
unsigned int tmpsmallest;
|
|
|
|
|
uint64_t newcurseq = _HA_ATOMIC_LOAD(&s->proxy->lbprm.lb_seq);
|
|
|
|
|
|
|
|
|
|
if (cur_seq != 0 && FWLC_LBPRM_SEQ(newcurseq) >
|
|
|
|
|
FWLC_LBPRM_SEQ(cur_seq) && new_key >= FWLC_LBPRM_SMALLEST(newcurseq))
|
|
|
|
|
break;
|
|
|
|
|
|
|
|
|
|
cur_seq = newcurseq;
|
|
|
|
|
tmpsmallest = FWLC_LBPRM_SMALLEST(cur_seq);
|
|
|
|
|
if (new_key > tmpsmallest)
|
|
|
|
|
smallest = tmpsmallest;
|
|
|
|
|
else
|
|
|
|
|
smallest = new_key;
|
|
|
|
|
|
|
|
|
|
} while (fwlc_set_seq_and_smallest(&s->proxy->lbprm, cur_seq, FWLC_LBPRM_SEQ(cur_seq) + 1, smallest) == 0 && __ha_cpu_relax());
|
|
|
|
|
|
|
|
|
|
__ha_barrier_store();
|
|
|
|
|
|
2025-10-01 11:59:37 -04:00
|
|
|
if (likely(s->tree_elt)) {
|
|
|
|
|
_HA_ATOMIC_DEC(&s->tree_elt->elements);
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
|
2025-10-01 11:59:37 -04:00
|
|
|
/*
|
|
|
|
|
* Now lock the existing element, and its target list.
|
|
|
|
|
* To prevent a deadlock, we always lock the one
|
|
|
|
|
* with the lowest key first.
|
|
|
|
|
*/
|
|
|
|
|
if (new_key < s->tree_elt->lb_node.key) {
|
|
|
|
|
to_unlock = mt_list_lock_full(&s->lb_mt_list);
|
|
|
|
|
list = fwlc_lock_target_list(tree_elt);
|
|
|
|
|
} else {
|
|
|
|
|
list = fwlc_lock_target_list(tree_elt);
|
|
|
|
|
to_unlock = mt_list_lock_full(&s->lb_mt_list);
|
|
|
|
|
}
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
|
2025-10-01 11:59:37 -04:00
|
|
|
/*
|
|
|
|
|
* Unlock the old list, the element is now
|
|
|
|
|
* no longer in it.
|
|
|
|
|
*/
|
|
|
|
|
mt_list_unlock_link(to_unlock);
|
|
|
|
|
} else
|
|
|
|
|
list = fwlc_lock_target_list(tree_elt);
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Add the element to the new list, and unlock it.
|
|
|
|
|
*/
|
|
|
|
|
mt_list_unlock_full(&s->lb_mt_list, list);
|
|
|
|
|
|
|
|
|
|
s->tree_elt = tree_elt;
|
|
|
|
|
|
2025-06-02 22:37:26 -04:00
|
|
|
HA_RWLOCK_RDUNLOCK(LBPRM_LOCK, &s->proxy->lbprm.lock);
|
|
|
|
|
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
if (allocated_elt)
|
|
|
|
|
s->free_elt = allocated_elt;
|
|
|
|
|
|
|
|
|
|
__ha_barrier_store();
|
|
|
|
|
_HA_ATOMIC_STORE(&s->lb_lock, 0);
|
|
|
|
|
|
|
|
|
|
fwlc_check_srv_key(s, new_key);
|
2009-10-01 05:19:37 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* This function updates the server trees according to server <srv>'s new
|
|
|
|
|
* state. It should be called when server <srv>'s status changes to down.
|
|
|
|
|
* It is not important whether the server was already down or not. It is not
|
|
|
|
|
* important either that the new state is completely down (the caller may not
|
|
|
|
|
* know all the variables of a server's state).
|
2018-08-21 13:44:53 -04:00
|
|
|
*
|
|
|
|
|
* The server's lock must be held. The lbprm's lock will be used.
|
2009-10-01 05:19:37 -04:00
|
|
|
*/
|
|
|
|
|
static void fwlc_set_server_status_down(struct server *srv)
|
|
|
|
|
{
|
|
|
|
|
struct proxy *p = srv->proxy;
|
|
|
|
|
|
2014-05-13 13:27:31 -04:00
|
|
|
if (!srv_lb_status_changed(srv))
|
2009-10-01 05:19:37 -04:00
|
|
|
return;
|
|
|
|
|
|
2017-08-31 08:41:55 -04:00
|
|
|
if (srv_willbe_usable(srv))
|
2009-10-01 05:19:37 -04:00
|
|
|
goto out_update_state;
|
2020-10-17 12:48:47 -04:00
|
|
|
HA_RWLOCK_WRLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
2018-08-21 13:44:53 -04:00
|
|
|
|
2009-10-01 05:19:37 -04:00
|
|
|
|
2017-08-31 08:41:55 -04:00
|
|
|
if (!srv_currently_usable(srv))
|
2009-10-01 05:19:37 -04:00
|
|
|
/* server was already down */
|
|
|
|
|
goto out_update_backend;
|
|
|
|
|
|
2014-05-13 09:54:22 -04:00
|
|
|
if (srv->flags & SRV_F_BACKUP) {
|
2017-08-31 08:41:55 -04:00
|
|
|
p->lbprm.tot_wbck -= srv->cur_eweight;
|
2009-10-01 05:19:37 -04:00
|
|
|
p->srv_bck--;
|
|
|
|
|
|
|
|
|
|
if (srv == p->lbprm.fbck) {
|
|
|
|
|
/* we lost the first backup server in a single-backup
|
|
|
|
|
* configuration, we must search another one.
|
|
|
|
|
*/
|
|
|
|
|
struct server *srv2 = p->lbprm.fbck;
|
|
|
|
|
do {
|
|
|
|
|
srv2 = srv2->next;
|
|
|
|
|
} while (srv2 &&
|
2014-05-13 09:54:22 -04:00
|
|
|
!((srv2->flags & SRV_F_BACKUP) &&
|
2017-08-31 08:41:55 -04:00
|
|
|
srv_willbe_usable(srv2)));
|
2009-10-01 05:19:37 -04:00
|
|
|
p->lbprm.fbck = srv2;
|
|
|
|
|
}
|
|
|
|
|
} else {
|
2017-08-31 08:41:55 -04:00
|
|
|
p->lbprm.tot_wact -= srv->cur_eweight;
|
2009-10-01 05:19:37 -04:00
|
|
|
p->srv_act--;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
fwlc_dequeue_srv(srv);
|
|
|
|
|
fwlc_remove_from_tree(srv);
|
|
|
|
|
|
|
|
|
|
out_update_backend:
|
|
|
|
|
/* check/update tot_used, tot_weight */
|
|
|
|
|
update_backend_weight(p);
|
2020-10-17 12:48:47 -04:00
|
|
|
HA_RWLOCK_WRUNLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
2018-08-21 13:44:53 -04:00
|
|
|
|
2009-10-01 05:19:37 -04:00
|
|
|
out_update_state:
|
2014-05-13 13:27:31 -04:00
|
|
|
srv_lb_commit_status(srv);
|
2009-10-01 05:19:37 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* This function updates the server trees according to server <srv>'s new
|
|
|
|
|
* state. It should be called when server <srv>'s status changes to up.
|
|
|
|
|
* It is not important whether the server was already down or not. It is not
|
|
|
|
|
* important either that the new state is completely UP (the caller may not
|
|
|
|
|
* know all the variables of a server's state). This function will not change
|
|
|
|
|
* the weight of a server which was already up.
|
2018-08-21 13:44:53 -04:00
|
|
|
*
|
|
|
|
|
* The server's lock must be held. The lbprm's lock will be used.
|
2009-10-01 05:19:37 -04:00
|
|
|
*/
|
|
|
|
|
static void fwlc_set_server_status_up(struct server *srv)
|
|
|
|
|
{
|
|
|
|
|
struct proxy *p = srv->proxy;
|
|
|
|
|
|
2014-05-13 13:27:31 -04:00
|
|
|
if (!srv_lb_status_changed(srv))
|
2009-10-01 05:19:37 -04:00
|
|
|
return;
|
|
|
|
|
|
2017-08-31 08:41:55 -04:00
|
|
|
if (!srv_willbe_usable(srv))
|
2009-10-01 05:19:37 -04:00
|
|
|
goto out_update_state;
|
|
|
|
|
|
2020-10-17 12:48:47 -04:00
|
|
|
HA_RWLOCK_WRLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
2018-08-21 13:44:53 -04:00
|
|
|
|
2017-08-31 08:41:55 -04:00
|
|
|
if (srv_currently_usable(srv))
|
2009-10-01 05:19:37 -04:00
|
|
|
/* server was already up */
|
|
|
|
|
goto out_update_backend;
|
|
|
|
|
|
2014-05-13 09:54:22 -04:00
|
|
|
if (srv->flags & SRV_F_BACKUP) {
|
2009-10-01 05:19:37 -04:00
|
|
|
srv->lb_tree = &p->lbprm.fwlc.bck;
|
2017-08-31 08:41:55 -04:00
|
|
|
p->lbprm.tot_wbck += srv->next_eweight;
|
2009-10-01 05:19:37 -04:00
|
|
|
p->srv_bck++;
|
|
|
|
|
|
|
|
|
|
if (!(p->options & PR_O_USE_ALL_BK)) {
|
|
|
|
|
if (!p->lbprm.fbck) {
|
|
|
|
|
/* there was no backup server anymore */
|
|
|
|
|
p->lbprm.fbck = srv;
|
|
|
|
|
} else {
|
|
|
|
|
/* we may have restored a backup server prior to fbck,
|
|
|
|
|
* in which case it should replace it.
|
|
|
|
|
*/
|
|
|
|
|
struct server *srv2 = srv;
|
|
|
|
|
do {
|
|
|
|
|
srv2 = srv2->next;
|
|
|
|
|
} while (srv2 && (srv2 != p->lbprm.fbck));
|
|
|
|
|
if (srv2)
|
|
|
|
|
p->lbprm.fbck = srv;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
} else {
|
|
|
|
|
srv->lb_tree = &p->lbprm.fwlc.act;
|
2017-08-31 08:41:55 -04:00
|
|
|
p->lbprm.tot_wact += srv->next_eweight;
|
2009-10-01 05:19:37 -04:00
|
|
|
p->srv_act++;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* note that eweight cannot be 0 here */
|
BUG/MEDIUM: lb-leastconn: Reposition a server using the right eweight
Depending on the context, the current eweight or the next one must be used
to reposition a server in the tree. When the server state is updated, for
instance its weight, the next eweight must be used because it is not yet
committed. However, when the server is used, on normal conditions, the
current eweight must be used.
In fact, it is only a bug on the 1.8. On newer versions, the changes on a
server are performed synchronously. But it is safer to rely on the right
eweight value to avoid any futur bugs.
On the 1.8, it is important to do so, because the server state is updated
and committed inside the rendez-vous point. Thus, the next server state may
be unsync with the current state for a short time, waiting all threads join
the rendez-vous point. It is especially a problem if the next eweight is set
to 0. Because otherwise, it must not be used to reposition the server in the
tree, leading to a divide by 0.
This patch must be backported as far as 1.8.
2020-12-11 09:36:01 -05:00
|
|
|
fwlc_queue_srv(srv, srv->next_eweight);
|
2009-10-01 05:19:37 -04:00
|
|
|
|
|
|
|
|
out_update_backend:
|
|
|
|
|
/* check/update tot_used, tot_weight */
|
|
|
|
|
update_backend_weight(p);
|
2020-10-17 12:48:47 -04:00
|
|
|
HA_RWLOCK_WRUNLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
2018-08-21 13:44:53 -04:00
|
|
|
|
2009-10-01 05:19:37 -04:00
|
|
|
out_update_state:
|
2014-05-13 13:27:31 -04:00
|
|
|
srv_lb_commit_status(srv);
|
2009-10-01 05:19:37 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* This function must be called after an update to server <srv>'s effective
|
|
|
|
|
* weight. It may be called after a state change too.
|
2018-08-21 13:44:53 -04:00
|
|
|
*
|
|
|
|
|
* The server's lock must be held. The lbprm's lock will be used.
|
2009-10-01 05:19:37 -04:00
|
|
|
*/
|
|
|
|
|
static void fwlc_update_server_weight(struct server *srv)
|
|
|
|
|
{
|
|
|
|
|
int old_state, new_state;
|
|
|
|
|
struct proxy *p = srv->proxy;
|
|
|
|
|
|
2014-05-13 13:27:31 -04:00
|
|
|
if (!srv_lb_status_changed(srv))
|
2009-10-01 05:19:37 -04:00
|
|
|
return;
|
|
|
|
|
|
|
|
|
|
/* If changing the server's weight changes its state, we simply apply
|
|
|
|
|
* the procedures we already have for status change. If the state
|
|
|
|
|
* remains down, the server is not in any tree, so it's as easy as
|
|
|
|
|
* updating its values. If the state remains up with different weights,
|
|
|
|
|
* there are some computations to perform to find a new place and
|
|
|
|
|
* possibly a new tree for this server.
|
|
|
|
|
*/
|
|
|
|
|
|
2017-08-31 08:41:55 -04:00
|
|
|
old_state = srv_currently_usable(srv);
|
|
|
|
|
new_state = srv_willbe_usable(srv);
|
2009-10-01 05:19:37 -04:00
|
|
|
|
|
|
|
|
if (!old_state && !new_state) {
|
2014-05-13 13:27:31 -04:00
|
|
|
srv_lb_commit_status(srv);
|
2009-10-01 05:19:37 -04:00
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
else if (!old_state && new_state) {
|
|
|
|
|
fwlc_set_server_status_up(srv);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
else if (old_state && !new_state) {
|
|
|
|
|
fwlc_set_server_status_down(srv);
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
|
2020-10-17 12:48:47 -04:00
|
|
|
HA_RWLOCK_WRLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
2018-08-21 13:44:53 -04:00
|
|
|
|
2009-10-01 05:19:37 -04:00
|
|
|
if (srv->lb_tree)
|
|
|
|
|
fwlc_dequeue_srv(srv);
|
|
|
|
|
|
2014-05-13 09:54:22 -04:00
|
|
|
if (srv->flags & SRV_F_BACKUP) {
|
2017-08-31 08:41:55 -04:00
|
|
|
p->lbprm.tot_wbck += srv->next_eweight - srv->cur_eweight;
|
2009-10-01 05:19:37 -04:00
|
|
|
srv->lb_tree = &p->lbprm.fwlc.bck;
|
|
|
|
|
} else {
|
2017-08-31 08:41:55 -04:00
|
|
|
p->lbprm.tot_wact += srv->next_eweight - srv->cur_eweight;
|
2009-10-01 05:19:37 -04:00
|
|
|
srv->lb_tree = &p->lbprm.fwlc.act;
|
|
|
|
|
}
|
|
|
|
|
|
BUG/MEDIUM: lb-leastconn: Reposition a server using the right eweight
Depending on the context, the current eweight or the next one must be used
to reposition a server in the tree. When the server state is updated, for
instance its weight, the next eweight must be used because it is not yet
committed. However, when the server is used, on normal conditions, the
current eweight must be used.
In fact, it is only a bug on the 1.8. On newer versions, the changes on a
server are performed synchronously. But it is safer to rely on the right
eweight value to avoid any futur bugs.
On the 1.8, it is important to do so, because the server state is updated
and committed inside the rendez-vous point. Thus, the next server state may
be unsync with the current state for a short time, waiting all threads join
the rendez-vous point. It is especially a problem if the next eweight is set
to 0. Because otherwise, it must not be used to reposition the server in the
tree, leading to a divide by 0.
This patch must be backported as far as 1.8.
2020-12-11 09:36:01 -05:00
|
|
|
fwlc_queue_srv(srv, srv->next_eweight);
|
2009-10-01 05:19:37 -04:00
|
|
|
|
|
|
|
|
update_backend_weight(p);
|
2020-10-17 12:48:47 -04:00
|
|
|
HA_RWLOCK_WRUNLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
2018-08-21 13:44:53 -04:00
|
|
|
|
2014-05-13 13:27:31 -04:00
|
|
|
srv_lb_commit_status(srv);
|
2009-10-01 05:19:37 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* This function is responsible for building the trees in case of fast
|
|
|
|
|
* weighted least-conns. It also sets p->lbprm.wdiv to the eweight to
|
|
|
|
|
* uweight ratio. Both active and backup groups are initialized.
|
|
|
|
|
*/
|
|
|
|
|
void fwlc_init_server_tree(struct proxy *p)
|
|
|
|
|
{
|
|
|
|
|
struct server *srv;
|
|
|
|
|
struct eb_root init_head = EB_ROOT;
|
|
|
|
|
|
|
|
|
|
p->lbprm.set_server_status_up = fwlc_set_server_status_up;
|
|
|
|
|
p->lbprm.set_server_status_down = fwlc_set_server_status_down;
|
|
|
|
|
p->lbprm.update_server_eweight = fwlc_update_server_weight;
|
|
|
|
|
p->lbprm.server_take_conn = fwlc_srv_reposition;
|
|
|
|
|
p->lbprm.server_drop_conn = fwlc_srv_reposition;
|
2025-02-11 11:16:14 -05:00
|
|
|
p->lbprm.server_requeue = fwlc_srv_reposition;
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
p->lbprm.server_deinit = fwlc_server_deinit;
|
|
|
|
|
p->lbprm.proxy_deinit = fwlc_proxy_deinit;
|
2009-10-01 05:19:37 -04:00
|
|
|
|
|
|
|
|
p->lbprm.wdiv = BE_WEIGHT_SCALE;
|
|
|
|
|
for (srv = p->srv; srv; srv = srv->next) {
|
2017-08-31 08:41:55 -04:00
|
|
|
srv->next_eweight = (srv->uweight * p->lbprm.wdiv + p->lbprm.wmult - 1) / p->lbprm.wmult;
|
2014-05-13 13:27:31 -04:00
|
|
|
srv_lb_commit_status(srv);
|
2009-10-01 05:19:37 -04:00
|
|
|
}
|
|
|
|
|
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
p->lbprm.lb_seq = 0;
|
|
|
|
|
|
2009-10-01 05:19:37 -04:00
|
|
|
recount_servers(p);
|
|
|
|
|
update_backend_weight(p);
|
|
|
|
|
|
|
|
|
|
p->lbprm.fwlc.act = init_head;
|
|
|
|
|
p->lbprm.fwlc.bck = init_head;
|
|
|
|
|
|
|
|
|
|
/* queue active and backup servers in two distinct groups */
|
|
|
|
|
for (srv = p->srv; srv; srv = srv->next) {
|
2017-08-31 08:41:55 -04:00
|
|
|
if (!srv_currently_usable(srv))
|
2009-10-01 05:19:37 -04:00
|
|
|
continue;
|
2014-05-13 09:54:22 -04:00
|
|
|
srv->lb_tree = (srv->flags & SRV_F_BACKUP) ? &p->lbprm.fwlc.bck : &p->lbprm.fwlc.act;
|
BUG/MEDIUM: lb-leastconn: Reposition a server using the right eweight
Depending on the context, the current eweight or the next one must be used
to reposition a server in the tree. When the server state is updated, for
instance its weight, the next eweight must be used because it is not yet
committed. However, when the server is used, on normal conditions, the
current eweight must be used.
In fact, it is only a bug on the 1.8. On newer versions, the changes on a
server are performed synchronously. But it is safer to rely on the right
eweight value to avoid any futur bugs.
On the 1.8, it is important to do so, because the server state is updated
and committed inside the rendez-vous point. Thus, the next server state may
be unsync with the current state for a short time, waiting all threads join
the rendez-vous point. It is especially a problem if the next eweight is set
to 0. Because otherwise, it must not be used to reposition the server in the
tree, leading to a divide by 0.
This patch must be backported as far as 1.8.
2020-12-11 09:36:01 -05:00
|
|
|
fwlc_queue_srv(srv, srv->next_eweight);
|
2009-10-01 05:19:37 -04:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Return next server from the FWLC tree in backend <p>. If the tree is empty,
|
|
|
|
|
* return NULL. Saturated servers are skipped.
|
2018-08-21 13:44:53 -04:00
|
|
|
*
|
2021-06-01 10:58:31 -04:00
|
|
|
* The lbprm's lock will be used in R/O mode. The server's lock is not used.
|
2009-10-01 05:19:37 -04:00
|
|
|
*/
|
|
|
|
|
struct server *fwlc_get_next_server(struct proxy *p, struct server *srvtoavoid)
|
|
|
|
|
{
|
|
|
|
|
struct server *srv, *avoided;
|
|
|
|
|
struct eb32_node *node;
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
uint64_t curseq;
|
|
|
|
|
int found = 0;
|
2009-10-01 05:19:37 -04:00
|
|
|
|
|
|
|
|
srv = avoided = NULL;
|
|
|
|
|
|
2020-10-17 13:32:09 -04:00
|
|
|
HA_RWLOCK_RDLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
curseq = _HA_ATOMIC_LOAD(&p->lbprm.lb_seq);
|
|
|
|
|
redo:
|
2009-10-01 05:19:37 -04:00
|
|
|
if (p->srv_act)
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
node = eb32_lookup_ge(&p->lbprm.fwlc.act, FWLC_LBPRM_SMALLEST(curseq));
|
2017-06-09 08:17:53 -04:00
|
|
|
else if (p->lbprm.fbck) {
|
|
|
|
|
srv = p->lbprm.fbck;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2009-10-01 05:19:37 -04:00
|
|
|
else if (p->srv_bck)
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
node = eb32_lookup_ge(&p->lbprm.fwlc.bck, FWLC_LBPRM_SMALLEST(curseq));
|
2017-06-09 08:17:53 -04:00
|
|
|
else {
|
|
|
|
|
srv = NULL;
|
|
|
|
|
goto out;
|
|
|
|
|
}
|
2009-10-01 05:19:37 -04:00
|
|
|
|
|
|
|
|
while (node) {
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
struct fwlc_tree_elt *tree_elt;
|
2009-10-01 05:19:37 -04:00
|
|
|
struct server *s;
|
2025-05-17 03:54:49 -04:00
|
|
|
int unusable = 0;
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
int orig_nb;
|
|
|
|
|
int i = 0;
|
|
|
|
|
|
|
|
|
|
tree_elt = eb32_entry(node, struct fwlc_tree_elt, lb_node);
|
|
|
|
|
orig_nb = statistical_prng_range(FWLC_LISTS_NB);
|
|
|
|
|
|
2025-05-17 03:54:49 -04:00
|
|
|
while (_HA_ATOMIC_LOAD(&tree_elt->elements) > unusable) {
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
struct mt_list mt_list;
|
|
|
|
|
mt_list.next = _HA_ATOMIC_LOAD(&tree_elt->srv_list[(i + orig_nb) % FWLC_LISTS_NB].next);
|
|
|
|
|
|
|
|
|
|
if (mt_list.next != &tree_elt->srv_list[(i + orig_nb) % FWLC_LISTS_NB] && mt_list.next != MT_LIST_BUSY) {
|
|
|
|
|
unsigned int eweight;
|
|
|
|
|
unsigned int planned_inflight;
|
|
|
|
|
s = container_of(mt_list.next, struct server, lb_mt_list);
|
|
|
|
|
eweight = _HA_ATOMIC_LOAD(&s->cur_eweight);
|
|
|
|
|
|
|
|
|
|
planned_inflight = tree_elt->lb_node.key * eweight / SRV_EWGHT_MAX;
|
|
|
|
|
if (!s->maxconn || s->served + s->queueslength < srv_dynamic_maxconn(s) + s->maxqueue) {
|
|
|
|
|
if (_HA_ATOMIC_LOAD(&s->served) + _HA_ATOMIC_LOAD(&s->queueslength) > planned_inflight + 2) {
|
|
|
|
|
/*
|
|
|
|
|
* The server has more requests than expected,
|
|
|
|
|
* let's try to reposition it, to avoid too
|
|
|
|
|
* many threads using the same server at the
|
BUG/MAJOR: leastconn: never reuse the node after dropping the lock
On ARM with 80 cores and a single server, it's sometimes possible to see
a segfault in fwlc_get_next_server() around 600-700k RPS. It seldom
happens as well on x86 with 128 threads with the same config around 1M
rps. It turns out that in fwlc_get_next_server(), before calling
fwlc_srv_reposition(), we have to drop the lock and that one takes it
back again.
The problem is that anything can happen to our node during this time,
and it can be freed. Then when continuing our work, we later iterate
over it and its next to find a node with an acceptable key, and by
doing so we can visit either uninitialized memory or simply nodes that
are no longer in the tree.
A first attempt at fixing this consisted in artificially incrementing
the elements count before dropping the lock, but that turned out to be
even worse because other threads could loop forever on such an element
looking for an entry that does not exist. Maintaining a separate
refcount didn't work well either, and it required to deal with the
memory release while dropping it, which is really not convenient.
Here we're taking a different approach consisting in simply not
trusting this node anymore and going back to the beginning of the
loop, as is done at a few other places as well. This way we can
safely ignore the possibly released node, and the test runs reliably
both on the arm and the x86 platforms mentioned above. No performance
regression was observed either, likely because this operation is quite
rare.
No backport is needed since this appeared with the leastconn rework
in 3.2.
2025-05-19 09:57:03 -04:00
|
|
|
* same time. From the moment we release the
|
|
|
|
|
* lock, we cannot trust the node nor tree_elt
|
|
|
|
|
* anymore, so we need to loop back to the
|
|
|
|
|
* beginning.
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
*/
|
|
|
|
|
if (i >= FWLC_LISTS_NB) {
|
|
|
|
|
HA_RWLOCK_RDUNLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
|
|
|
|
fwlc_srv_reposition(s);
|
|
|
|
|
HA_RWLOCK_RDLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
BUG/MAJOR: leastconn: never reuse the node after dropping the lock
On ARM with 80 cores and a single server, it's sometimes possible to see
a segfault in fwlc_get_next_server() around 600-700k RPS. It seldom
happens as well on x86 with 128 threads with the same config around 1M
rps. It turns out that in fwlc_get_next_server(), before calling
fwlc_srv_reposition(), we have to drop the lock and that one takes it
back again.
The problem is that anything can happen to our node during this time,
and it can be freed. Then when continuing our work, we later iterate
over it and its next to find a node with an acceptable key, and by
doing so we can visit either uninitialized memory or simply nodes that
are no longer in the tree.
A first attempt at fixing this consisted in artificially incrementing
the elements count before dropping the lock, but that turned out to be
even worse because other threads could loop forever on such an element
looking for an entry that does not exist. Maintaining a separate
refcount didn't work well either, and it required to deal with the
memory release while dropping it, which is really not convenient.
Here we're taking a different approach consisting in simply not
trusting this node anymore and going back to the beginning of the
loop, as is done at a few other places as well. This way we can
safely ignore the possibly released node, and the test runs reliably
both on the arm and the x86 platforms mentioned above. No performance
regression was observed either, likely because this operation is quite
rare.
No backport is needed since this appeared with the leastconn rework
in 3.2.
2025-05-19 09:57:03 -04:00
|
|
|
goto redo;
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
}
|
|
|
|
|
i++;
|
|
|
|
|
continue;
|
|
|
|
|
}
|
|
|
|
|
if (s != srvtoavoid) {
|
|
|
|
|
srv = s;
|
|
|
|
|
found = 1;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
avoided = s;
|
|
|
|
|
}
|
2025-06-20 09:53:27 -04:00
|
|
|
unusable++;
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
i++;
|
|
|
|
|
} else if (mt_list.next == &tree_elt->srv_list[(i + orig_nb) % FWLC_LISTS_NB]) {
|
|
|
|
|
i++;
|
|
|
|
|
continue;
|
|
|
|
|
} else {
|
|
|
|
|
i++;
|
|
|
|
|
continue;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
if (found)
|
|
|
|
|
break;
|
2009-10-01 05:19:37 -04:00
|
|
|
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
do {
|
|
|
|
|
node = eb32_next(node);
|
|
|
|
|
} while (node && node->key < FWLC_LBPRM_SMALLEST(curseq));
|
|
|
|
|
|
|
|
|
|
if (node) {
|
|
|
|
|
uint64_t newcurseq = HA_ATOMIC_LOAD(&p->lbprm.lb_seq);
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* If we have a bigger element than the smallest recorded, and we're up to date,
|
|
|
|
|
* update the smallest one.
|
|
|
|
|
*/
|
|
|
|
|
if (likely(newcurseq == curseq && FWLC_LBPRM_SMALLEST(newcurseq) < node->key)) {
|
|
|
|
|
if (fwlc_set_seq_and_smallest(&p->lbprm, curseq, FWLC_LBPRM_SEQ(curseq), node->key) != 0) {
|
|
|
|
|
curseq = FWLC_LBPRM_SEQ(curseq) | ((uint64_t)node->key << 32);
|
|
|
|
|
__ha_barrier_store();
|
|
|
|
|
continue;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
}
|
|
|
|
|
/*
|
|
|
|
|
* Somebody added a new server in node we already skipped, so retry from the beginning.
|
|
|
|
|
*/
|
|
|
|
|
if (unlikely(FWLC_LBPRM_SMALLEST(newcurseq) < node->key && FWLC_LBPRM_SEQ(newcurseq) != FWLC_LBPRM_SEQ(curseq))) {
|
|
|
|
|
curseq = newcurseq;
|
|
|
|
|
goto redo;
|
|
|
|
|
}
|
|
|
|
|
curseq = newcurseq;
|
|
|
|
|
} else {
|
|
|
|
|
uint64_t newcurseq = _HA_ATOMIC_LOAD(&p->lbprm.lb_seq);
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* No more node, but somebody changed the tree, so it's
|
|
|
|
|
* worth trying again.
|
|
|
|
|
*/
|
|
|
|
|
if (FWLC_LBPRM_SEQ(newcurseq) != FWLC_LBPRM_SEQ(curseq)) {
|
|
|
|
|
curseq = newcurseq;
|
|
|
|
|
goto redo;
|
2009-10-01 05:19:37 -04:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (!srv)
|
|
|
|
|
srv = avoided;
|
2017-06-09 08:17:53 -04:00
|
|
|
out:
|
2020-10-17 13:32:09 -04:00
|
|
|
HA_RWLOCK_RDUNLOCK(LBPRM_LOCK, &p->lbprm.lock);
|
MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-03-26 12:16:48 -04:00
|
|
|
|
2009-10-01 05:19:37 -04:00
|
|
|
return srv;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Local variables:
|
|
|
|
|
* c-indent-level: 8
|
|
|
|
|
* c-basic-offset: 8
|
|
|
|
|
* End:
|
|
|
|
|
*/
|