HAProxy - Load balancer
Find a file
Willy Tarreau 2a4523f6f4 BUG/MAJOR: pools: fix possible race with free() in the lockless variant
In GH issue #1275, Fabiano Nunes Parente provided a nicely detailed
report showing reproducible crashes under musl. Musl is one of the libs
coming with a simple allocator for which we prefer to keep the shared
cache. On x86 we have a DWCAS so the lockless implementation is enabled
for such libraries.

And this implementation has had a small race since day one: the allocator
will need to read the first object's <next> pointer to place it into the
free list's head. If another thread picks the same element and immediately
releases it, while both the local and the shared pools are too crowded, it
will be freed to the OS. If the libc's allocator immediately releases it,
the memory area is unmapped and we can have a crash while trying to read
that pointer. However there is no problem as long as the item remains
mapped in memory because whatever value found there will not be placed
into the head since the counter will have changed.

The probability for this to happen is extremely low, but as analyzed by
Fabiano, it increases with the buffer size. On 16 threads it's relatively
easy to reproduce with 2MB buffers above 200k req/s, where it should
happen within the first 20 seconds of traffic usually.

This is a structural issue for which there are two non-trivial solutions:
  - place a read lock in the alloc call and a barrier made of lock/unlock
    in the free() call to force to serialize operations; this will have
    a big performance impact since free() is already one of the contention
    points;

  - change the allocator to use a self-locked head, similar to what is
    done in the MT_LISTS. This requires two memory writes to the head
    instead of a single one, thus the overhead is exactly one memory
    write during alloc and one during free;

This patch implements the second option. A new POOL_DUMMY pointer was
defined for the locked pointer value, allowing to both read and lock it
with a single xchg call. The code was carefully optimized so that the
locked period remains the shortest possible and that bus writes are
avoided as much as possible whenever the lock is held.

Tests show that while a bit slower than the original lockless
implementation on large buffers (2MB), it's 2.6 times faster than both
the no-cache and the locked implementation on such large buffers, and
remains as fast or faster than the all implementations when buffers are
48k or higher. Tests were also run on arm64 with similar results.

Note that this code is not used on modern libcs featuring a fast allocator.

A nice benefit of this change is that since it removes a dependency on
the DWCAS, it will be possible to remove the locked implementation and
replace it with this one, that is then usable on all systems, thus
significantly increasing their performance with large buffers.

Given that lockless pools were introduced in 1.9 (not supported anymore),
this patch will have to be backported as far as 2.0. The code changed
several times in this area and is subject to many ifdefs which will
complicate the backport. What is important is to remove all the DWCAS
code from the shared cache alloc/free lockless code and replace it with
this one. The pool_flush() code is basically the same code as the
allocator, retrieving the whole list at once. If in doubt regarding what
barriers to use in older versions, it's safe to use the generic ones.

This patch depends on the following previous commits:

 - MINOR: pools: do not maintain the lock during pool_flush()
 - MINOR: pools: call malloc_trim() under thread isolation
 - MEDIUM: pools: use a single pool_gc() function for locked and lockless

The last one also removes one occurrence of an unneeded DWCAS in the
code that was incompatible with this fix. The removal of the now unused
seq field will happen in a future patch.

Many thanks to Fabiano for his detailed report, and to Olivier for
his help on this issue.
2021-06-10 17:46:50 +02:00
.github CI: Make matrix.py executable and add shebang 2021-06-08 21:30:36 +02:00
addons BUG/MEDIUM: opentracing: initialization before establishing daemon and/or chroot mode 2021-06-10 06:45:39 +02:00
admin DOC: fix a few remainig cases of "Haproxy" and "HAproxy" in doc and comments 2021-05-09 06:50:46 +02:00
dev CLEANUP: dev/flags: remove useless test in the stdin number parser 2021-04-03 15:29:10 +02:00
doc BUG/MINOR: server: explicitly set "none" init-addr for dynamic servers 2021-06-10 17:44:05 +02:00
examples EXAMPLES: add a trivial config for quick testing 2021-05-12 17:54:56 +02:00
include BUG/MAJOR: pools: fix possible race with free() in the lockless variant 2021-06-10 17:46:50 +02:00
reg-tests REGTESTS: ssl: Add "show ssl ocsp-response" test 2021-06-10 16:44:11 +02:00
scripts SCRIPTS: opentracing: enable parallel builds in build-ot.sh 2021-06-10 07:35:15 +02:00
src BUG/MAJOR: pools: fix possible race with free() in the lockless variant 2021-06-10 17:46:50 +02:00
tests CLEANUP: lists/tree-wide: rename some list operations to avoid some confusion 2021-04-21 09:20:17 +02:00
.cirrus.yml CI: introduce scripts/build-vtest.sh for installing VTest 2021-05-18 10:48:30 +02:00
.gitattributes MINOR: Configure the cpp userdiff driver for *.[ch] in .gitattributes 2021-02-22 18:17:57 +01:00
.gitignore ADDONS: make addons/ discoverable by git via .gitignore 2021-05-07 16:48:14 +02:00
.travis.yml CI: introduce scripts/build-vtest.sh for installing VTest 2021-05-18 10:48:30 +02:00
BRANCHES DOC: fix some spelling issues over multiple files 2021-01-08 14:53:47 +01:00
CHANGELOG [RELEASE] Released version 2.5-dev0 2021-05-14 09:36:37 +02:00
CONTRIBUTING CLEANUP: contrib: remove the last references to the now dead contrib/ directory 2021-04-21 15:13:58 +02:00
INSTALL MINOR: version: it's development again 2021-05-14 09:36:08 +02:00
LICENSE LICENSE: add licence exception for OpenSSL 2012-09-07 13:52:26 +02:00
MAINTAINERS CONTRIB: move spoa_example out of the tree 2021-04-21 09:39:06 +02:00
Makefile REORG: errors: split errors reporting function from log.c 2021-06-07 16:58:15 +02:00
README DOC: create a BRANCHES file to explain the life cycle 2019-06-15 22:00:14 +02:00
ROADMAP DOC: update the outdated ROADMAP file 2019-06-15 21:59:54 +02:00
SUBVERS BUILD: use format tags in VERDATE and SUBVERS files 2013-12-10 11:22:49 +01:00
VERDATE [RELEASE] Released version 2.4.0 2021-05-14 09:03:30 +02:00
VERSION [RELEASE] Released version 2.5-dev0 2021-05-14 09:36:37 +02:00

The HAProxy documentation has been split into a number of different files for
ease of use.

Please refer to the following files depending on what you're looking for :

  - INSTALL for instructions on how to build and install HAProxy
  - BRANCHES to understand the project's life cycle and what version to use
  - LICENSE for the project's license
  - CONTRIBUTING for the process to follow to submit contributions

The more detailed documentation is located into the doc/ directory :

  - doc/intro.txt for a quick introduction on HAProxy
  - doc/configuration.txt for the configuration's reference manual
  - doc/lua.txt for the Lua's reference manual
  - doc/SPOE.txt for how to use the SPOE engine
  - doc/network-namespaces.txt for how to use network namespaces under Linux
  - doc/management.txt for the management guide
  - doc/regression-testing.txt for how to use the regression testing suite
  - doc/peers.txt for the peers protocol reference
  - doc/coding-style.txt for how to adopt HAProxy's coding style
  - doc/internals for developer-specific documentation (not all up to date)