redis/tests
Mincho Paskalev e3c38aab66
Handle primary/replica clients in IO threads (#14335)
# Problem

While introducing Async IO
threads(https://github.com/redis/redis/pull/13695) primary and replica
clients were left to be handled inside main thread due to data race and
synchronization issues. This PR solves this issue with the additional
hope it increases performance of replication.

# Overview

## Moving the clients to IO threads

Since clients first participate in a handshake and an RDB replication
phases it was decided they are moved to IO-thread after RDB replication
is done. For primary client this was trivial as the master client is
created only after RDB sync (+ some additional checks one can see in
`isClientMustHandledByMainThread`). Replica clients though are moved to
IO threads immediately after connection (as are all clients) so
currently in `unstable` replication happens while this client is in
IO-thread. In this PR it was moved to main thread after receiving the
first `REPLCONF` message from the replica, but it is a bit hacky and we
can remove it. I didn't find issues between the two versions.

## Primary client (replica node)

We have few issues here:
- during `serverCron` a `replicationCron` is ran which periodically
sends `REPLCONF ACK` message to the master, also checks for timed-out
master. In order to prevent data races we utilize`IOThreadClientsCron`.
The client is periodically sent to main thread and during
`processClientsFromIOThread` it's checked if it needs to run the
replication cron behaviour.

- data races with main thread - specifically `lastinteraction` and
`read_reploff` members of the primary client that are written to in
`readQueryFromClient` could be accessed at the same time from main
thread during execution of `INFO REPLICATION`(`genRedisInfoString`). To
solve this the members were duplicated so if the client is in IO-thread
it writes to the duplicates and they are synced with the original
variables each time the client is send to main thread ( that means `INFO
REPLICATION` could potentially return stale values).

- During `freeClient` the primary client is fetched to main thread but
when caching it(`replicationCacheMaster`) the thread id will remain the
id of the IO thread it was from. This creates problems when resurrecting
the master client. Here the call to `unbindClientFromIOThreadEventLoop`
in `freeClient` was rewritten to call `keepClientInMainThread` which
automatically fixes the problem.

- During `exitScriptTimedoutMode` the master is queued for reprocessing
(specifically process any pending commands ASAP after it's unblocked).
We do that by putting it in the `server.unblocked_clients` list, which
are processed in the next `beforeSleep` cycle in main thread. Since this
will create a contention between main and IO thread, we just skip this
queueing in `unblocked_clients` and just queue the client to main thread
- the `processClientsFromIOThread` will process the pending commands
just as main would have.

## Replica clients (primary node)

We move the client after RDB replication is done and after replication
backlog is fed with its first message.
We do that so that the client's reference to the first replication
backlog node is initialized before it's read from IO-thread, hence no
contention with main thread on it.

### Shared replication buffer

Currently in unstable the replication buffer is shared amongst clients.
This is done via clients holding references to the nodes inside the
buffer. A node from the buffer can be trimmed once each replica client
has read it and send its contents. The reference is
`client->ref_repl_buf_node`. The replication buffer is written to by
main thread in `feedReplicationBuffer` and the refcounting is intrusive
- it's inside the replication-buffer nodes themselves.

Since the replica client changes the refcount (decreases the refcount of
the node it has just read, and increases the refcount of the next node
it starts to read) during `writeToClient` we have a data race with main
thread when it feeds the replication buffer. Moreover, main thread also
updates the `used` size of the node - how much it has written to it,
compared to its capacity which the replica client relies on to know how
much to read. Obviously replica being in IO-thread creates another data
race here. To mitigate these issues a few new variables were added to
the client's struct:

- `io_curr_repl_node` - starting node this replica is reading from
inside IO-thread
- `io_bound_repl_node` - the last node in the replication buffer the
replica sees before being send to IO-thread.

These values are only allowed to be updated in main thread. The client
keeps track of how much it has read into the buffer via the old
`ref_repl_buf_node`. Generally while in IO-thread the replica client
will now keep refcount of the `io_curr_repl_node` until it's processed
all the nodes up to `io_bound_repl_node` - at that point its returned to
main thread which can safely update the refcounts.
The `io_bound_repl_node` reference is there so the replica knows when to
stop reading from the repl buffer - imagine that replica reads from the
last node of the replication buffer while main thread feeds data to it -
we will create a data race on the `used` value
(`_writeToClientSlave`(IO-thread) vs `feedReplicationBuffer`(main)).
That's why this value is updated just before the replica is being send
to IO thread.
*NOTE*, this means that when replicas are handled by IO threads they
will hold more than one node at a time (i.e `io_curr_repl_node` up to
`io_bound_repl_node`) meaning trimming will happen a bit less
frequently. Tests show no significant problems with that.
(tnx to @ShooterIT for the `io_curr_repl_node` and `io_bound_repl_node`
mechanism as my initial implementation had similar semantics but was way
less clear)

Example of how this works:

* Replication buffer state at time N:
   | node 0| ... | node M, used_size K |
* replica caches `io_curr_repl_node`=0, `io_bound_repl_node`=M and
`io_bound_block_pos`=K
* replica moves to IO thread and processes all the data it sees
* Replication buffer state at time N + 1:
| node 0| ... | node M, used_size Full | |node M + 1| |node M + 2,
used_size L|, where Full > M
* replica moves to main thread at time N + 1, at this point following
happens
   - refcount to node 0 (io_curr_repl_node) is decreased
- `ref_repl_buf_node` becomes node M(io_bound_repl_node) (we still have
size-K bytes to process from there)
- refcount to node M is increased (now all nodes from 0 up to M-1
including can be trimmed unless some other replica holds reference to
them)
- And just before the replica is send back to IO thread the following
are updated:
   - `io_bound_repl_node` ref becomes node M+2
   - `io_bound_block_pos` becomes L

Note that replica client is only moved to main if it has processed all
the data it knows about (i.e up to `io_bound_repl_node` +
`io_bound_block_pos`)

### Replica clients kept in main as much as possible

During implementation an issue arose - how fast is the replica client
able to get knowledge about new data from the replication buffer and how
fast can it trim it. In order for that to happen ASAP whenever a replica
is moved to main it remains there until the replication buffer is fed
new data. At that point its put in the pending write queue and special
cased in handleClientsWithPendingWrites so that its send to IO thread
ASAP to write the new data to replica. Also since each time the replica
writes its whole repl data it knows about that means after it's send to
main thread `processClientsFromIOThread` is able to immediately update
the refcounts and trim whatever it can.

### ACK messages from primary

Slave clients need to periodically read `REPLCONF ACK` messages from
client. Since replica can remain in main thread indefinitely if no DB
change occurs, a new atomic `pending_read` was added during
`readQueryFromClient`. If a replica client has a pending read it's
returned back to IO-thread in order to process the read even if there is
no pending repl data to write.

### Replicas during shutdown

During shutdown the main thread pauses write actions and periodically
checks if all replicas have reached the same replication offset as the
primary node. During `finishShutdown` that may or may not be the case.
Either way a client data may be read from the replicas and even we may
try to write any pending data to them inside `flushSlavesOutputBuffers`.
In order to prevent races all the replicas from IO threads are moved to
main via `fetchClientFromIOThread`. A cancel of the shutdown should be
ok, since the mechanism employed by `handleClientsWithPendingWrites`
should return the client back to IO thread when needed.

## Notes

While adding new tests timing issues with Tsan tests were found and
fixed.

Also there is a data race issue caught by Tsan on the `last_error`
member of the `client` struct. It happens when both IO-thread and main
thread make a syscall using a `client` instance - this can happen only
for primary and replica clients since their data can be accessed by
commands send from other clients. Specific example is the `INFO
REPLICATION` command.
Although other such races were fixed, as described above, this once is
insignificant and it was decided to be ignored in `tsan.sup`.

---------

Co-authored-by: Yuan Wang <wangyuancode@163.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2026-01-21 16:19:12 +02:00
..
assets Optimistic locking for string objects - compare-and-set and compare-and-delete (#14435) 2025-10-21 10:32:49 +03:00
cluster Handle primary/replica clients in IO threads (#14335) 2026-01-21 16:19:12 +02:00
helpers Fix daily CI for atomic slot migration (#14459) 2025-10-25 09:00:33 +08:00
integration Handle primary/replica clients in IO threads (#14335) 2026-01-21 16:19:12 +02:00
modules Fix crash when calling internal container command without arguments (#14690) 2026-01-21 08:38:04 +02:00
sentinel Handle primary/replica clients in IO threads (#14335) 2026-01-21 16:19:12 +02:00
support Handle primary/replica clients in IO threads (#14335) 2026-01-21 16:19:12 +02:00
tmp minor fixes to the new test suite, html doc updated 2010-05-14 18:48:33 +02:00
unit Handle primary/replica clients in IO threads (#14335) 2026-01-21 16:19:12 +02:00
vectorset Add daily CI for vectorset (#14302) 2025-12-10 08:52:43 +08:00
instances.tcl Handle primary/replica clients in IO threads (#14335) 2026-01-21 16:19:12 +02:00
README.md Add thread sanitizer run to daily CI (#13964) 2025-06-02 10:13:23 +03:00
test_helper.tcl Add Atomic Slot Migration (ASM) support (#14414) 2025-10-22 15:56:20 +03:00

Redis Test Suite

The normal execution mode of the test suite involves starting and manipulating local redis-server instances, inspecting process state, log files, etc.

The test suite also supports execution against an external server, which is enabled using the --host and --port parameters. When executing against an external server, tests tagged external:skip are skipped.

There are additional runtime options that can further adjust the test suite to match different external server configurations:

Option Impact
--singledb Only use database 0, don't assume others are supported.
--ignore-encoding Skip all checks for specific encoding.
--ignore-digest Skip key value digest validations.
--cluster-mode Run in strict Redis Cluster compatibility mode.
--large-memory Enables tests that consume more than 100mb

Tags

Tags are applied to tests to classify them according to the subsystem they test, but also to indicate compatibility with different run modes and required capabilities.

Tags can be applied in different context levels:

  • start_server context
  • tags context that bundles several tests together
  • A single test context.

The following compatibility and capability tags are currently used:

Tag Indicates
external:skip Not compatible with external servers.
cluster:skip Not compatible with --cluster-mode.
large-memory Test that requires more than 100mb
tls:skip Not compatible with --tls.
tsan:skip Not compatible with running under thread sanitizer.
needs:repl Uses replication and needs to be able to SYNC from server.
needs:debug Uses the DEBUG command or other debugging focused commands (like OBJECT REFCOUNT).
needs:pfdebug Uses the PFDEBUG command.
needs:config-maxmemory Uses CONFIG SET to manipulate memory limit, eviction policies, etc.
needs:config-resetstat Uses CONFIG RESETSTAT to reset statistics.
needs:reset Uses RESET to reset client connections.
needs:save Uses SAVE or BGSAVE to create an RDB file.

When using an external server (--host and --port), filtering using the external:skip tags is done automatically.

When using --cluster-mode, filtering using the cluster:skip tag is done automatically.

When not using --large-memory, filtering using the largemem:skip tag is done automatically.

In addition, it is possible to specify additional configuration. For example, to run tests on a server that does not permit SYNC use:

./runtest --host <host> --port <port> --tags -needs:repl