In ae053b30 - BUG/MEDIUM: wdt: don't trigger the watchdog when p is unitialized:
wdt is not triggering until prev_cpu_time
is initialized to prevent unexpected process
termination.
Unfortunately this is not enough, some tasks could start
immediately after process startup, and in such cases
prev_cpu_time could be uninitialized, because
prev_cpu_time is set after the polling loop while
process_runnable_tasks() is executed before the polling loop.
It happens to be the case with lua tasks registered using
register_task function from lua script.
Those tasks are registered in early init stage of haproxy and
they are scheduled to run before the first polling loop,
leading to prev_cpu_time being uninitialized (equals 0)
on the thread when the task is first executed.
Because of this, if such tasks get stuck right away
(e.g: blocking IO) the watchdog won't behave as expected
and the thread will remain stuck indefinitely.
(polling loop for the thread won't run at all as
the thread is already stuck)
To solve this, we're now making sure that prev_cpu_time is first
set before any tasks are processed on the thread.
This is done by setting initial prev_cpu_time value directly
in clock_init_thread_date()
Thanks to Abhijeet Rastogi for reporting this unexpected behavior.
It could be backported in every stable versions.
(everywhere ae053b30 is, because both are related)
Tests with forced wakeups on a 24c/48t machine showed that we're caping
at 7.3M loops/s, which means 6.6 microseconds of loop delay without
having anything to do.
This is caused by two factors:
- the load and update of the now_offset variable
- the update of the global_now variable
What is happening is that threads are not running within the one-
microsecond time precision provided by gettimeofday(), so each thread
waking up sees a slightly different date and causes undesired updates
to global_now. But worse, these undesired updates mean that we then
have to adjust the now_offset to match that, and adds significant noise
to this variable, which then needs to be updated upon each call.
By only allowing sightly less precision we can completely eliminate
that contention. Here we're ignoring the 5 lowest bits of the usec
part, meaning that the global_now variable may be off by up to 31 us
(16 on avg). The variable is only used to correct the time drift some
threads might be observing in environments where CPU clocks are not
synchronized, and it's used by freq counters. In both cases we don't
need that level of precision and even one millisecond would be pretty
fine. We're just 30 times better at almost no cost since the global_now
and now_offset variables now only need to be updated 30000 times a
second in the worst case, which is unnoticeable.
After this change, the wakeup rate jumped from 7.3M/s to 66M/s, meaning
that the loop delay went from 6.6us to 0.73us, that's a 9x improvement
when under load! With real tasks we're seeing a boost from 28M to 52M
wakeups/s. The clock_update_global_date() function now only takes
1.6%, it's good enough so that we don't need to go further.
Pollers that support busy polling spend a lot of time (and cause
contention) updating the global date when they're looping over themselves
while it serves no purpose: what's needed is only an update on the local
date to know when to stop looping.
This patch splits clock_pudate_date() into a pair of local and global
update functions, so that pollers can be easily improved.
Since commit cc7a11ee3 ("MINOR: threads: set the tid, ltid and their bit
in thread_cfg") we ought not use (1UL << thr) to get the group mask for
thread <thr>, but (ha_thread_info[thr].ltid_bit). clock_report_idle()
needs this.
This also implies not using all_threads_mask anymore but taking the mask
from the tgroup since it becomes relative now.
The "thread_info" name was initially chosen to store all info about
threads but since we now have a separate per-thread context, there is
no point keeping some of its elements in the thread_info struct.
As such, this patch moves prev_cpu_time, prev_mono_time and idle_pct to
thread_ctx, into the thread context, with the scheduler parts. Instead
of accessing them via "ti->" we now access them via "th_ctx->", which
makes more sense as they're totally dynamic, and will be required for
future evolutions. There's no room problem for now, the structure still
has 84 bytes available at the end.
This removes the knowledge of clockid_t from anywhere but clock.c, thus
eliminating a source of includes burden. The unused clock_id field was
removed from thread_info, and the definition setting of clockid_t was
removed from compat.h. The most visible change is that the function
now_cpu_time_thread() now takes the thread number instead of a tinfo
pointer.
The code that deals with timer creation for the WDT was moved to clock.c
and is called with the few relevant arguments. This removes the need for
awareness of clock_id from wdt.c and as such saves us from having to
share it outside. The timer_t is also known only from both ends but not
from the public API so that we don't have to create a fake timer_t
anymore on systems which do not support it (e.g. macos).
This was previously open-coded in run_thread_poll_loop(). Now that
we have clock.c dedicated to such stuff, let's move the code there
so that we don't need to keep such ifdefs nor to depend on the
clock_id.
Instead of fiddling with before_poll and after_poll in
activity_count_runtime(), the function is now called by
clock_entering_poll() which passes it the number of microseconds
spent working. This allows to remove all calls to
activity_count_runtime() from the pollers.
The entering_poll/leaving_poll/measure_idle functions that were hard
to classify and used to move to various locations have now been placed
into clock.c since it's precisely about time-keeping. The functions
were renamed to clock_*. The samp_time and idle_time values are now
static since there is no reason for them to be read from outside.
There is currently a problem related to time keeping. We're mixing
the functions to perform calculations with the os-dependent code
needed to retrieve and adjust the local time.
This patch extracts from time.{c,h} the parts that are solely dedicated
to time keeping. These are the "now" or "before_poll" variables for
example, as well as the various now_*() functions that make use of
gettimeofday() and clock_gettime() to retrieve the current time.
The "tv_*" functions moved there were also more appropriately renamed
to "clock_*".
Other parts used to compute stolen time are in other files, they will
have to be picked next.