The reason for a switch in the HA roles was not always directly clear.
This change now introduces additional debug logging, indicating the
reasoning for either taking over or handing over the HA responsibility.
First, some logic was moved from the SQL query selecting active Icinga
DB instances to Go code. This allowed distinguishing between no
available responsible instances and responsible instances with an
expired heartbeat.
As the HA's peer timeout is logically bound to the Redis timeout, it
will now reference this timeout with an additional grace timeout. Doing
so eliminates a race between a handing over and a "forceful" take over.
As the old code indicated a takeover on the fact that no other instance
is active, it will now additionally check if it is already being the
active/responsible node. In this case, the takeover logic - which will
be interrupted at a later point as the node is already responsible - can
be skipped.
Next to the additional logging messages, both the takeover and handover
channel are now transporting a string to communicate the reason instead
of an empty struct{}. By doing so, both the "Taking over" and "Handing
over" log messages are enriched with reason.
This also required a change in the suppressed logging handling of the
HA.realize method, which got its logging enabled through the shouldLog
parameter. Now, there are both recurring events, which might be
suppressed, as well as state changing events, which should be logged.
Therefore, and because the logTicker's functionality was not clear to me
on first glance, I renamed it to routineLogTicker.
While dealing with the code, some function signature documentation were
added, to ease both mine as well as the understanding of future readers.
Additionally, the error handling of the SQL query selecting active
Icinga DB instances was changed slightly to also handle wrapped
sql.ErrNoRows errors.
Closes#688.
I just had the observation that blocking XREADs without timeouts (BLOCK
0) on multiple consecutive Redis restarts and I/O timeouts exceeds Redis
internal retries and eventually leads to fatal errors. @julianbrost
looked at this for clarification, here is his finding:
go-redis only considers a command successful when it returned something,
so a successfully started blocking XREAD consumes a retry attempt each
time the underlying Redis connection is terminated. If this happens
often before any element appears in the stream, this error is
propagated. (This also means that even with this PR, when restarting
Redis often enough so that a query never reaches the BLOCK 1sec, this
would still happen.)
https://github.com/Icinga/icingadb/pull/504#issuecomment-1164589244
Rather than using a JSON structure to convey these values, simply use the
existing format to communicate performance data to Icinga 2.
Also removes the reference to Go in the Redis structure, allowing this string
to be extended with more metrics in the future without running into naming
issues.
https://github.com/Icinga/icinga2/pull/9036 introduced a new environment ID for
Icinga DB that's written to the icinga:stats stream as field
"icingadb_environment". This commit updates the code to make use of this ID
instead of the one derived from the Icinga 2 Environment constant.
This is in preparation for adding foreign key constraints to the history
tables. For this, is is required to insert the rows into the different history
tables in a defined order.
Using Cond does not allow to reliably catch all events as one will only receive
events that occour after starting to listen. For heartbeat loss events it's
import to reliably catch them to not remain in an HA active state incorrectly.
fixes#360
This fixes a data race where the pairs channel was closed too early
when the context is canceled and therefore the outer errgroup
returns from Redis operations before Wait() is called on the inner
errgroup. Unfinished Go methods in the inner errgroup would then
try to work on a closed channel.
YAML is decoded by the structure of the YAML source document, not the Go
destination data structure. Therefore, the old code did not always call
UnmarshalYAML() on all sub-structs. Therefore, defaults were not always set but
zero values were used, resulting in all kind of strange behavior.
This commit changes the code so that it no longer relies on individual
UnmarshalYAML() functions to set the defaults for each sub-struct but instead
just sets all of them when creating the surrounding Config instance. It also
moves the config validation to separate Validate() functions.