As reported in GH issue #99, when hard-stop-after triggers and threads
are in use, the chance that any thread releases the resources in use by
the other ones is non-null. Thus no thread should be allowed to deinit()
nor exit by itself.
Here we take a different approach. We simply use a 3rd possible value
for the "killed" variable so that all threads know they must break out
of the run-poll-loop and immediately stop.
This patch was tested by commenting the stream_shutdown() calls in
hard_stop() to increase the chances to see a stream use released
resources. With this fix applied, it never crashes anymore.
This fix should be backported to 1.9 and 1.8.
Many times we've been missing per-process traffic statistics. While it
didn't make sense in multi-process mode, with threads it does. Thus we
now have a counter of bytes emitted by raw_sock, and a freq counter for
these as well. However, freq_ctr are limited to 32 bits, and given that
loads of 300 Gbps have already been reached over a loopback using
splicing, we need to downscale this a bit. Here we're storing 1/32 of
the byte rate, which gives a theorical limit of 128 GB/s or ~1 Tbps,
which is more than enough. Let's have fun re-reading this sentence in
2029 :-) The values can be read in "show info" output on the CLI.
These commands don't follow the same flow as the rest of the commands,
each of them iterates over all header lines before switching to the
next directive. In addition they make no distinction between start
line and headers and can lead to unparsable rewrites which are very
difficult to deal with internally.
Most of them are still occasionally found in configurations, mainly
because of the usual "we've always done this way". By marking them
deprecated and emitting a warning and recommendation on first use of
each of them, we will raise users' awareness of users regarding the
cleaner, faster and more reliable alternatives.
Some use cases of "reqrep" still appear from time to time for URL
rewriting that is not so convenient with other rules. But at least
users facing this requirement will explain their use case so that we
can best serve them. Some discussion started on this subject in a
thread linked to from github issue #100.
The goal is to remove them in 2.1 since they require to reparse the
result before indexing it and we don't want this hack to live long.
The following directives were marked deprecated :
-reqadd
-reqallow
-reqdel
-reqdeny
-reqiallow
-reqidel
-reqideny
-reqipass
-reqirep
-reqitarpit
-reqpass
-reqrep
-reqtarpit
-rspadd
-rspdel
-rspdeny
-rspidel
-rspideny
-rspirep
-rsprep
We currently have the ability to register functions to be called early
on thread creation and at thread deinitialization. It turns out this is
not sufficient because certain such functions may use resources that are
being allocated by the other ones, thus creating a race condition depending
only on the linking order. For example the mworker needs to register a
file descriptor while the pollers will reallocate the fd_updt[] array.
Similarly logs and trashes may be used by some init functions while it's
unclear whether they have been deduplicated.
The same issue happens on deinit, if the fd_updt[] or trash is released
before some functions finish to use them, we'll get into trouble.
This patch creates a couple of early and late callbacks for per-thread
allocation/freeing of resources. A few init functions were moved there,
and the fd init code was split between the two (since it used to both
allocate and initialize at once). This way the init/deinit sequence is
expected to be safe now.
This patch should be backported to 1.9 as at least the trash/log issue
seems to be present. The run_thread_poll_loop() code is a bit different
there as the mworker is not a callback, but it will have no effect and
it's enough to drop the mworker changes.
This bug was reported by Ilya Shipitsin in github issue #104.
Event ports are kqueue/epoll polling class for Solaris. Code is based
on https://github.com/joyent/haproxy-1.8/tree/joyent/dev-v1.8.8.
Event ports are available only on SunOS systems derived from
Solaris 10 and later (including illumos systems).
It doesn't make sense to keep this struct thread_info in global.h, it
causes difficulties to access its contents from hathreads.h, let's move
it to the threads where it ought to have been created.
This is the per-thread CPU runtime clock, it will be used to measure
the CPU usage of each thread and by the lockup detection mechanism. It
must only be retrieved at the beginning of run_thread_poll_loop() since
the thread must already have been started for this. But it must be done
before performing any per-thread initcall so that all thread init
functions have access to the clock ID.
Note that it could make sense to always have this clockid available even
in non-threaded situations and place the process' clock there instead.
But it would add portability issues which are currently easy to deal
with by disabling threads so it may not be worth it for now.
This way we'll be able to store more per-thread information than just
the pthread pointer. The storage became an array of struct instead of
an allocated array since it's very small (typically 512 bytes) and not
worth the hassle of dealing with memory allocation on this. The array
was also renamed thread_info to make its intended usage more explicit.
Currently the thread array is a local variable inside a function block
and there is no access to it from outside, which often complicates
debugging. Let's make it global and export it. Also the allocation
return is now checked.
When we initially experimented with threads and processes support, we
needed to implement arrays of threads per process for cpu-map, but this
is not needed anymore since we support either threads or processes.
Let's simply make the thread-based cpu-map per thread and not per
thread and per process since that's not used anymore. Doing so reduces
the global struct from 33kB to 1.5kB.
As by default we add all keepalive connections to the idle pool, if we run
into a pathological case, where all client don't do keepalive, but the server
does, and haproxy is configured to only reuse "safe" connections, we will
soon find ourself having lots of idling, unusable for new sessions, connections,
while we won't have any file descriptors available to create new connections.
To fix this, add 2 new global settings, "pool_low_ratio" and "pool_high_ratio".
pool-low-fd-ratio is the % of fds we're allowed to use (against the maximum
number of fds available to haproxy) before we stop adding connections to the
idle pool, and destroy them instead. The default is 20. pool-high-fd-ratio is
the % of fds we're allowed to use (against the maximum number of fds available
to haproxy) before we start killing idling connection in the event we have to
create a new outgoing connection, and no reuse is possible. The default is 25.
It's always a pain to get a core dump when enabling user/group setting
(which disables the dumpable flag on Linux), when using a chroot and/or
when haproxy is started by a service management tool which requires
complex operations to just raise the core dump limit.
This patch introduces a new "set-dumpable" global directive to work
around these troubles by doing the following :
- remove file size limits (equivalent of ulimit -f unlimited)
- remove core size limits (equivalent of ulimit -c unlimited)
- mark the process dumpable again (equivalent of suid_dumpable=1)
Some of these will depend on the operating system. This way it becomes
much easier to retrieve a core file. Temporarily moving the chroot to
a user-writable place generally enough.
Since the introduction of the options field, we can use it to store the
type of process.
type = 'm' is replaced by PROC_O_TYPE_MASTER
type = 'w' is replaced by PROC_O_TYPE_WORKER
type = 'e' is replaced by PROC_O_TYPE_PROG
The old values are still used in the HAPROXY_PROCESSES environment
variable to pass the information during a reload.
This option is already the default, but its opposite 'no option
start-on-reload' allows the master to keep a previous instance of a
program and don't start a new one upon a reload.
The old program will then appear as a current one in "show proc" and
could also trigger an exit-on-failure upon a segfault.
Previously we were assuming than a process was in a leaving state when
its number of reload was greater than 0. With mworker programs it's not
the case anymore so we need to store a leaving state.
This patch implements the external binary support in the master worker.
To configure an external process, you need to use the program section,
for example:
program dataplane-api
command ./dataplane_api
Those processes are launched at the same time as the workers.
During a reload of HAProxy, those processes are dealing with the same
sequence as a worker:
- the master is re-executed
- the master sends a USR1 signal to the program
- the master launches a new instance of the program
During a stop, or restart, a SIGTERM is sent to the program.
Let's keep a copy of these initial values. They will be useful to
compute automatic maxconn, as well as to restore proper limits when
doing an execve() on external checks.
tune.listener.multi-queue { on | off }
Enables ('on') or disables ('off') the listener's multi-queue accept which
spreads the incoming traffic to all threads a "bind" line is allowed to run
on instead of taking them for itself. This provides a smoother traffic
distribution and scales much better, especially in environments where threads
may be unevenly loaded due to external activity (network interrupts colliding
with one thread for example). This option is enabled by default, but it may
be forcefully disabled for troubleshooting or for situations where it is
estimated that the operating system already provides a good enough
distribution and connections are extremely short-lived.
For some embedded systems, it's pointless to have 32- or even 64- large
arrays of processes when it's known that much fewer processes will be
used in the worst case. Let's introduce this MAX_PROCS define which
contains the highest number of processes allowed to run at once. It
still defaults to LONGBITS but may be lowered.
These two functions return either all_{proc,threads}_mask, or the argument.
This is used to default to all_proc_mask or all_threads_mask when not set
on bind_conf or proxies.
Most calls to hap_register_post_check(), hap_register_post_deinit(),
hap_register_per_thread_init(), hap_register_per_thread_deinit() can
be done using initcalls and will not require a constructor anymore.
Let's create a set of simplified macros for this, called respectively
REGISTER_POST_CHECK, REGISTER_POST_DEINIT, REGISTER_PER_THREAD_INIT,
and REGISTER_PER_THREAD_DEINIT.
Some files were not modified because they wouldn't benefit from this
or because they conditionally register (e.g. the pollers).
Most register_build_opts() calls use static strings. These ones were
replaced with a trivial REGISTER_BUILD_OPTS() statement adding the string
and its call to the STG_REGISTER section. A dedicated section could be
made for this if needed, but there are very few such calls for this to
be worth it. The calls made with computed strings however, like those
which retrieve OpenSSL's version or zlib's version, were moved to a
dedicated function to guarantee they are called late in the process.
For example, the SSL call probably requires that SSL_library_init()
has been called first.
In some situations, especially when dealing with low latency on processors
supporting a variable frequency or when running inside virtual machines,
each time the process waits for an I/O using the poller, the processor
goes back to sleep or is offered to another VM for a long time, and it
causes excessively high latencies.
A solution to this provided by this patch is to enable busy polling using
a global option. When busy polling is enabled, the pollers never sleep and
loop over themselves waiting for an I/O event to happen or for a timeout
to occur. On multi-processor machines it can significantly overheat the
processor but it usually results in much lower latencies.
A typical test consisting in injecting traffic over a single connection at
a time over the loopback shows a bump from 4640 to 8540 connections per
second on forwarded connections, indicating a latency reduction of 98
microseconds for each connection, and a bump from 12500 to 21250 for
locally terminated connections (redirects), indicating a reduction of
33 microseconds.
It is only usable with epoll and kqueue because select() and poll()'s
API is not convenient for such usages, and the level of performance they
are used in doesn't benefit from this anyway.
The option, which obviously remains disabled by default, can be turned
on using "busy-polling" in the global section, and turned off later
using "no busy-polling". Its status is reported in "show info" to help
troubleshooting suspicious CPU spikes.
At the moment the situation with activity measurement is quite tricky
because the struct activity is defined in global.h and declared in
haproxy.c, with operations made in time.h and relying on freq_ctr
which are defined in freq_ctr.h which itself includes time.h. It's
barely possible to touch any of these files without breaking all the
circular dependency.
Let's move all this stuff to activity.{c,h} and be done with it. The
measurement of active and stolen time is now done in a dedicated
function called just after tv_before_poll() instead of mixing the two,
which used to be a lazy (but convenient) decision.
No code was changed, stuff was just moved around.
In the output of 'show fd', the worker CLI's socketpair was still
handled by an "unknown" function. That can be really confusing during
debug. Fixed it by showing "mworker_accept_wrapper" instead.
The mworker waitpid mode (which is used when a reload failed to apply
the new configuration) was still using a specific initialisation path.
That's a problem since we use a polling loop in the master now, the
master proxy is not initialized and the master CLI is not activated.
This patch removes the initialisation code of the wait mode and
introduce the MODE_MWORKER_WAIT in order to use the same init path as
the MODE_MWORKER with some exceptions. It allows to use the master proxy
and the master CLI during the waitpid mode.
This patch allows a process to properly quit when some jobs are still
active, this feature is handled by the unstoppable_jobs variable, which
must be atomically incremented.
During each new iteration of run_poll_loop() the break condition of the
loop is now (jobs - unstoppable_jobs) == 0.
The unique usage of this at the moment is to handle the socketpair CLI
of a the worker during the stopping of the process. During the soft
stop, we could mark the CLI listener as an unstoppable job and still
handle new connections till every other jobs are stopped.
The active peers output indicates both the number of established peers
connections and the number of peers connection attempts. The new counter
"ConnectedPeers" also indicates the number of currently connected peers.
This helps detect that some peers cannot be reached for example. It's
worth mentioning that this value changes over time because unused peers
are often disconnected and reconnected. Most of the time it should be
equal to ActivePeers.
Peers are the last type of activity which can maintain a job present, so
it's important to report that such an entity is still active to explain
why the job count may be higher than zero. Here by "ActivePeers" we report
peers sessions, which include both established connections and outgoing
connection attempts.
Add a struct server pointer in the mworker_proc struct so we can easily
use it as a target for the mworker proxy.
pcli_prefix_to_pid() is used to find the right PID of the worker
when using a prefix in the CLI. (@master, @#<relative pid> , @<pid>)
pcli_pid_to_server() is used to find the right target server for the
CLI proxy.
The master process does not need all the keywords of the cli, add 2
flags to chose which keyword to use.
It might be useful to activate some of them in a debug mode later...
The purpose is to detect if threads or processes are competing for the
same CPU. This can happen when threads are incorrectly bound, or after a
reload if the previous process still has an important activity. With
threads this situation is problematic because a preempted thread holding
a lock will block other ones waiting for this lock to be released.
A first attempt consisted in measuring the cumulated lost time more
precisely but the system's scheduler is smart enough to try to limit the
thread preemption rate by mostly context switching during poll()'s blank
periods, so most of the time lost is not seen. In essence this is good
because it means a thread is not preempted with a lock held, and even
regarding the rendez-vous point it cannot prevent the other ones from
making progress. But still it happens tens to hundreds of times per
second that a thread might be preempted, so it's still possible to detect
that the situation is happening, thus it's interesting to measure and
report its frequency.
Each time we enter the poller, we check the CPU time spent working and
see if we've lost time doing something else. To limit false positives,
we're only interested in losses of 500 microseconds or more (i.e. half
a clock tick on a 1 kHz system). If so, it indicates that some time was
stolen by another thread or process. Note that we purposely store some
sub-millisecond counters so that under heavy traffic with a 1 kHz clock,
it's still possible to measure something without being subject to the
risk of rounding errors (i.e. if exactly 1 ms is stolen it's possible
that the time difference could often be slightly lower).
This counter of lost CPU time slots time is reported in "show activity"
in numbers of milliseconds of CPU lost per second, per 15s, and total
over the process' life. By definition, the per-second counter cannot
report values larger than 1000 per thread per second and the 15s one
will be limited to 15000/s in the worst case, but it's possible that
peak values exceed such thresholds after long pauses.
Add a new pipe, one per thread, so that we can write on it to wake a thread
sleeping in a poller, and use it to wake threads supposed to take care of a
task, if they are all sleeping.
Now all the code used to manipulate chunks uses a struct buffer instead.
The functions are still called "chunk*", and some of them will progressively
move to the generic buffer handling code as they are cleaned up.
There's no real reason to have a specific scheduler for applets anymore, so
nuke it and just use tasks. This comes with some benefits, the first one
being that applets cannot induce high latencies anymore since they share
nice values with other tasks. Later it will be possible to configure the
applets' nice value. The second benefit is that the applet scheduler was
not very thread-friendly, having a big lock around it in prevision of this
change. Thus applet-intensive workloads should now scale much better with
threads.
Some more improvement is possible now : some applets also use a task to
handle timers and timeouts. These ones could now be simplified to use only
one task.
A number of counters have been added at special places helping better
understanding certain bug reports. These counters are maintained per
thread and are shown using "show activity" on the CLI. The "clear
counters" commands also reset these counters. The output is sent as a
single write(), which currently produces up to about 7 kB of data for
64 threads. If more counters are added, it may be necessary to write
into multiple buffers, or to reset the counters.
To backport to 1.8 to help collect more detailed bug reports.